# World CO2 Emissions Forecasting
## ARIMA Model and LSTM Recurren Network Model Forecasting Comparison
[*Cristian Castro Álvarez*](https://github.com/cristian-castro-a)

**Goal**: 
- To compare the performance of an ARIMA Model and a LSTM Recurrent Neural Network model for the forecasting of the WORLD's CO2 Emissions for the next decade


**Data:**
- The data comes from [Our World in Data](https://github.com/owid/co2-data)
- Column 'CO2' of the dataframe indicates the Tonnes of CO2 emmitted into the atmosphere.
- The dataset includes yearly data from 1750 to 2020, with a total of 271 data points.

**Models:**
- The ARIMA Model and the LSTM Recurrent Neural Network were trained separately
- If interested in the individual tranining, please refer in this repository to Models Folder

In [1]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import math

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pandas.plotting import lag_plot
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMAResults
import statsmodels.api as sm
print('Statsmodel Version: ', sm.__version__)

import tensorflow as tf
print('TensorFlow Version: ', tf.__version__)
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_error

import warnings
warnings.filterwarnings("ignore")
import os

mpl.rcParams['figure.figsize'] = (10,8)
mpl.rcParams['axes.grid'] = False

Statsmodel Version:  0.13.2
TensorFlow Version:  2.0.0


## Data

In [2]:
# Raw Data
df = pd.read_csv('Data/owid-co2-data.csv')

# Aggregate the data on a yearly basis (the entire world as one entity, I don't care about the emissions of individual countries)
df = df.groupby(by=['year']).sum().reset_index()[['year','co2']]
df.insert(loc = 1, column = 'month', value = 12)
df.insert(loc = 2, column = 'day', value = 31)
values = pd.to_datetime(df[['year','month','day']])
df.insert(loc = 0, column = 'date', value = values)
df.drop(['year','month','day'], axis = 1, inplace = True)
df.head()

Unnamed: 0,date,co2
0,1750-12-31,46.755
1,1751-12-31,46.755
2,1752-12-31,46.77
3,1753-12-31,46.77
4,1754-12-31,46.79


In [3]:
# To work with tonnes of CO2 it is necessary a conversion factor of 3.664.
df['co2'] = df['co2']/3.664

# Visualizing the world emissions per year
fig = px.line(df, 
                x = 'date', 
                y = 'co2', 
                markers = True, 
                height = 800, 
                width = 1000)

fig.update_layout(title = dict(
        text = 'Total World CO2 Emissions',
        font = dict(
            family = 'Arial',
            size = 30
        ),
        x = 0.5
    )
    )

fig.update_traces(line_color = 'darkblue')

fig.update_xaxes(
    title_text = 'Date',
    title_font = {'size': 20}
)

fig.update_yaxes(
    title_text = 'Million Tonnes of CO2 Emmitted into the Atmosphere',
    title_font = {'size': 20}
)

fig.show()

## Import Best LSTM Model

In [4]:
# Import Best LSTM Model
# Please refer to "Models" Folder to check in detail the training process
directory = 'Models/best_lstm_model/best_lstm.h5'
parent_dir = os.path.abspath(os.getcwd())
path = os.path.join(parent_dir, directory)
best_lstm = tf.keras.models.load_model(path)

2022-10-03 17:25:56.204982: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-03 17:25:56.206819: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.


## Reproduce Best Arima Model

In [7]:
# Reproduce Best Arima Model
df_ar = df.copy()

# Set the index as date
df_ar.set_index('date', inplace = True)

# Log(co2)
df_ar['logco2'] = np.log(df_ar['co2'])
df_ar.tail()

Unnamed: 0_level_0,co2,logco2
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-12-31,34035.382642,10.435156
2017-12-31,34471.811135,10.447897
2018-12-31,35058.007096,10.464759
2019-12-31,35049.914574,10.464528
2020-12-31,33185.793122,10.409877


In [9]:
# ARIMA Model
best_arima = sm.tsa.arima.ARIMA(endog = df_ar['logco2'], order = (1,1,1))
best_arima_fit = best_arima.fit()
print(best_arima_fit.summary())

                               SARIMAX Results                                
Dep. Variable:                 logco2   No. Observations:                  271
Model:                 ARIMA(1, 1, 1)   Log Likelihood                 415.919
Date:                Mon, 03 Oct 2022   AIC                           -825.838
Time:                        17:27:35   BIC                           -815.043
Sample:                    12-31-1750   HQIC                          -821.503
                         - 12-31-2020                                         
Covariance Type:                  opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9976      0.004    280.907      0.000       0.991       1.005
ma.L1         -0.9668      0.016    -58.918      0.000      -0.999      -0.935
sigma2         0.0027      0.000     23.737      0.0