# Climate Change

## Problem 1.1 - Creating Our First Model
We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset *climate_change.csv* into Python.

Then, split the data into a *training* set, consisting of all the observations up to and including 2006, and a *testing* set consisting of the remaining years.

In [1]:
import pandas as pd

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
climate_change = pd.read_csv('../data/climate_change.csv')
climate_change.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308 entries, 0 to 307
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      308 non-null    int64  
 1   Month     308 non-null    int64  
 2   MEI       308 non-null    float64
 3   CO2       308 non-null    float64
 4   CH4       308 non-null    float64
 5   N2O       308 non-null    float64
 6   CFC-11    308 non-null    float64
 7   CFC-12    308 non-null    float64
 8   TSI       308 non-null    float64
 9   Aerosols  308 non-null    float64
 10  Temp      308 non-null    float64
dtypes: float64(9), int64(2)
memory usage: 26.6 KB


In [3]:
train = climate_change[climate_change['Year']<=2006].copy()
test = climate_change[climate_change['Year']>2006].copy()

Next, build a linear regression model to predict the dependent variable **Temp**, using **MEI**, **CO2**, **CH4**, **N20**, **CFC-11**, **CFC-12**, **TSI**, and **Aerosols** as independent variables (**Year** and **Month** should NOT be used in the model). Use the training set to build the model.

Enter the model R2 (the "Multiple R-squared" value):
- 0.751

In [4]:
features = ['MEI', 'CO2', 'CH4', 'N2O', 'CFC-11', 'CFC-12', 'TSI', 'Aerosols']
X_train = train[features]
y_train = train['Temp']

model = sm.OLS(y_train, sm.add_constant(X_train)).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.744
Method:                 Least Squares   F-statistic:                     103.6
Date:                Sat, 14 Aug 2021   Prob (F-statistic):           1.94e-78
Time:                        22:12:19   Log-Likelihood:                 280.10
No. Observations:                 284   AIC:                            -542.2
Df Residuals:                     275   BIC:                            -509.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -124.5943     19.887     -6.265      0.0

## Problem 1.2 - Creating Our First Model
Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. *Select all that apply.*
- MEI
- CO2
- CFC-11
- CFC-12
- TSI
- Aerosols

## Problem 2.1 - Understanding the Model
Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are **negative**, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the *simplest* correct explanation for this contradiction?
- All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.

## Problem 2.2 - Understanding the Model
Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? *Select all that apply.*
- CO2
- CH4
- CFC-12

In [5]:
abs(train.corr()['N2O']) > 0.7

Year         True
Month       False
MEI         False
CO2          True
CH4          True
N2O          True
CFC-11      False
CFC-12       True
TSI         False
Aerosols    False
Temp         True
Name: N2O, dtype: bool

Which of the following independent variables is CFC.11 highly correlated with? *Select all that apply.*
- CH4
- CFC-12

In [6]:
abs(train.corr()['CFC-11']) > 0.7

Year        False
Month       False
MEI         False
CO2         False
CH4          True
N2O         False
CFC-11       True
CFC-12       True
TSI         False
Aerosols    False
Temp        False
Name: CFC-11, dtype: bool

## Problem 3 - Simplifying the Model
Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Enter the coefficient of N2O in this reduced model:
- 0.0253

In [7]:
reduced_features = ['MEI', 'N2O', 'TSI', 'Aerosols']
X_train2 = train[reduced_features]
y_train2 = train['Temp']

model2 = sm.OLS(y_train2, sm.add_constant(X_train2)).fit()
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.726
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     184.9
Date:                Sat, 14 Aug 2021   Prob (F-statistic):           3.52e-77
Time:                        22:12:19   Log-Likelihood:                 266.64
No. Observations:                 284   AIC:                            -523.3
Df Residuals:                     279   BIC:                            -505.0
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -116.2269     20.223     -5.747      0.0

Enter the model R2:
- 0.726

## Problem 4 - Automatically Building the Model


In [8]:
x_columns = list(X_train.columns)

def refine_model():
    X = train[x_columns]
    y = train['Temp']
    refined = sm.OLS(y, sm.add_constant(X)).fit()
    print(refined.summary())

In [9]:
refine_model()

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.744
Method:                 Least Squares   F-statistic:                     103.6
Date:                Sat, 14 Aug 2021   Prob (F-statistic):           1.94e-78
Time:                        22:12:19   Log-Likelihood:                 280.10
No. Observations:                 284   AIC:                            -542.2
Df Residuals:                     275   BIC:                            -509.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -124.5943     19.887     -6.265      0.0

In [10]:
x_columns.remove('CH4')
refine_model()

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     118.8
Date:                Sat, 14 Aug 2021   Prob (F-statistic):           1.77e-79
Time:                        22:12:19   Log-Likelihood:                 280.07
No. Observations:                 284   AIC:                            -544.1
Df Residuals:                     276   BIC:                            -515.0
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -124.5152     19.850     -6.273      0.0

## Problem 5 - Testing on Unseen Data
We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

Enter the testing set R2:
- a

In [11]:
X = train[x_columns]
y = train['Temp']
refined = sm.OLS(y, sm.add_constant(X)).fit()

y_pred = refined.predict(sm.add_constant(test[x_columns]))
residuals = test['Temp'] - y_pred
sse = (residuals**2).sum()
sst = ((test['Temp'] - train['Temp'].mean())**2).sum()
r2_value = (1 - sse / sst).round(3)
print(f"R2 score for test set: {r2_value}")

R2 score for test set: 0.629
