# Analytics Edge: Climate Change

## Climate Change

### Compiled By: Dana Hagist
### Project Provided By: Analytics Edge

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:
- Year: the observation year.
- Month: the observation month.
- Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.
- CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane  (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.
    - CO2, N2O and CH4 are expressed in ppmv (parts per million by volume  -- i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
    - CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). 
- Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun's energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.
- TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun's energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.
- MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

### Problem 1.1 - Creating Our First Model 

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability. 

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

Indicate the model R2 (the "Multiple R-squared" value)

In [2]:
# Read in dataset and split into training and test
setwd('C:/Users/dhagi/Documents/Coursework/Analytics Edge/Climate Change')
climate = read.csv("climate_change.csv")
train = subset(climate, Year<=2006)
test = subset(climate, Year > 2006)

# Looking at the struture of the training data
str(train)

'data.frame':	284 obs. of  11 variables:
 $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
 $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
 $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
 $ CO2     : num  346 346 344 342 340 ...
 $ CH4     : num  1639 1634 1633 1631 1648 ...
 $ N2O     : num  304 304 304 304 304 ...
 $ CFC.11  : num  191 192 193 194 194 ...
 $ CFC.12  : num  350 352 354 356 357 ...
 $ TSI     : num  1366 1366 1366 1366 1366 ...
 $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
 $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...


In [3]:
# Creating the linear model with the variables outlined above
model = lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = train)
summary(model)


Call:
lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
    TSI + Aerosols, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25888 -0.05913 -0.00082  0.05649  0.32433 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
CH4          1.240e-04  5.158e-04   0.240  0.81015    
N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09171 on 275 degrees of freedom
Multiple R-squared:  0.7509,	Adjusted R-squared:  0.7436 
F-statistic: 103.6 on 8 and 275 DF,  p-val

Solution: The model R2 for this model is 0.7509, meaning about 75% of the variation in the dependent model can be explained by our model.

#### Problem 1.2 - Creating Our First Model

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05.

Solution: Based on the significance codes above, it appears that MEI, C02, CFC.11, CFC.12, TSI, and Aerosols all are statisticlaly significant.

#### Problem 2.1 - Understanding the Model 

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the simplest correct explanation for this contradiction?

Solution: It is most likely that the reason for the negative coefficients on these variables is the way that they interact with the other variables in the model.  Each coefficient represents the impact of that independent variable on the dependent variable assuming all other variables are held constant.

#### Problem 2.2 - Understanding the Model 

Compute the correlations between all the variables in the training set. Which of the independent variables is N2O highly correlated with (absolute correlation greater than 0.7)?

In [4]:
# Finding correlations between variables in train
cor(train)

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC.11,CFC.12,TSI,Aerosols,Temp
Year,1.0,-0.0279419602,-0.0369876842,0.98274939,0.91565945,0.99384523,0.56910643,0.8970116635,0.17030201,-0.3452467,0.78679714
Month,-0.02794196,1.0,0.0008846905,-0.10673246,0.01856866,0.01363153,-0.01311122,0.0006751102,-0.03460619,0.01488954,-0.09985674
MEI,-0.03698768,0.0008846905,1.0,-0.04114717,-0.0334193,-0.05081978,0.06900044,0.0082855443,-0.15449192,0.34023779,0.17247075
CO2,0.98274939,-0.1067324607,-0.0411471651,1.0,0.87727963,0.97671982,0.51405975,0.8526896272,0.17742893,-0.3561548,0.78852921
CH4,0.91565945,0.0185686624,-0.0334193014,0.87727963,1.0,0.89983864,0.77990402,0.9636162478,0.24552844,-0.26780919,0.70325502
N2O,0.99384523,0.0136315303,-0.0508197755,0.97671982,0.89983864,1.0,0.52247732,0.8679307757,0.19975668,-0.33705457,0.77863893
CFC.11,0.56910643,-0.0131112236,0.0690004387,0.51405975,0.77990402,0.52247732,1.0,0.8689851828,0.27204596,-0.0439212,0.40771029
CFC.12,0.89701166,0.0006751102,0.0082855443,0.85268963,0.96361625,0.86793078,0.86898518,1.0,0.25530281,-0.22513124,0.68755755
TSI,0.17030201,-0.0346061935,-0.1544919227,0.17742893,0.24552844,0.19975668,0.27204596,0.2553028138,1.0,0.05211651,0.24338269
Aerosols,-0.3452467,0.0148895406,0.3402377871,-0.3561548,-0.26780919,-0.33705457,-0.0439212,-0.225131244,0.05211651,1.0,-0.38491375


Solution:CO2, CH4, CFC.12, and Temp (to a lesser extent) are all highly correlated with N20.  It is worth noting that they are also all highly positively correlated, meaning when one is high, so are the others.

Task: Which of the independent variables is CFC.11 highly correlated with?

Solution: CH4 and CFC.12 are highly corrrelated with the CFC.11 variable.

#### Problem 3 - Simplifying the Model 

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Indicate the coefficient of N2O in this reduced model.

In [7]:
# Using an option to reduce scientific notation
options(scipen=999)

# Creating our simplified model
model2 = lm(Temp~ MEI + TSI + Aerosols + N2O, data = train)
summary(model2)


Call:
lm(formula = Temp ~ MEI + TSI + Aerosols + N2O, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27916 -0.05975 -0.00595  0.05672  0.34195 

Coefficients:
               Estimate  Std. Error t value             Pr(>|t|)    
(Intercept) -116.226858   20.223028  -5.747    0.000000023735836 ***
MEI            0.064186    0.006652   9.649 < 0.0000000000000002 ***
TSI            0.079490    0.014875   5.344    0.000000189373222 ***
Aerosols      -1.701737    0.217996  -7.806    0.000000000000119 ***
N2O            0.025320    0.001311  19.307 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09547 on 279 degrees of freedom
Multiple R-squared:  0.7261,	Adjusted R-squared:  0.7222 
F-statistic: 184.9 on 4 and 279 DF,  p-value: < 0.00000000000000022


Solution: The N2O coefficient in this model is 0.023664, which is a larger magnitude that in our previous model and the sign has reversed.

Task: Indicate the model R2(R-Squared)

Solution: The model R2 for this model is .7261, a little bit lower than our previous R2 in the larger model.  However, we have not lost a lot of explanatory power considering the large number of variables we have been able to omit and simplify.

#### Problem 4 - Automatically Building the Model 

We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called "climateLM", you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

Indicate the R2 value of the model produced by the step function.

In [8]:
# Using step function with original model
stepmodel = step(model)
summary(stepmodel)

Start:  AIC=-1348.16
Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq    RSS     AIC
- CH4       1   0.00049 2.3135 -1350.1
<none>                  2.3130 -1348.2
- N2O       1   0.03132 2.3443 -1346.3
- CO2       1   0.06719 2.3802 -1342.0
- CFC.12    1   0.11874 2.4318 -1335.9
- CFC.11    1   0.13986 2.4529 -1333.5
- TSI       1   0.33516 2.6482 -1311.7
- Aerosols  1   0.43727 2.7503 -1301.0
- MEI       1   0.82823 3.1412 -1263.2

Step:  AIC=-1350.1
Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq    RSS     AIC
<none>                  2.3135 -1350.1
- N2O       1   0.03133 2.3448 -1348.3
- CO2       1   0.06672 2.3802 -1344.0
- CFC.12    1   0.13023 2.4437 -1336.5
- CFC.11    1   0.13938 2.4529 -1335.5
- TSI       1   0.33500 2.6485 -1313.7
- Aerosols  1   0.43987 2.7534 -1302.7
- MEI       1   0.83118 3.1447 -1264.9



Call:
lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
    Aerosols, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25770 -0.05994 -0.00104  0.05588  0.32203 

Coefficients:
                Estimate   Std. Error t value             Pr(>|t|)    
(Intercept) -124.5151778   19.8501125  -6.273     0.00000000136513 ***
MEI            0.0640678    0.0064339   9.958 < 0.0000000000000002 ***
CO2            0.0064015    0.0022689   2.821             0.005129 ** 
N2O           -0.0160211    0.0082873  -1.933             0.054234 .  
CFC.11        -0.0066094    0.0016208  -4.078     0.00005953626341 ***
CFC.12         0.0038676    0.0009812   3.942             0.000103 ***
TSI            0.0931155    0.0147293   6.322     0.00000000103549 ***
Aerosols      -1.5402058    0.2126158  -7.244     0.00000000000436 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09155 on 276 degrees of freedom
Multiple R-square

Solution: The model R2 produced by the step function is .7508

Task: Which variable(s) were eliminated from the full model by the step function?

Solution: CH4 was removed in creation of the step model.

#### Problem 5 - Testing on Unseen Data 

We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

Enter the testing set R2.

In [9]:
tempPredict = predict(stepmodel, newdata=test)
SSE = sum((tempPredict - test$Temp)^2)
SST = sum((mean(train$Temp) - test$Temp)^2)
R2 = 1 - SSE/SST
R2

Solution: The R2 for the testing set with our step model is .63

### Model Predictions

Below, we will take a look at the predictions from the model

In [12]:
Predict_ts <- ts(tempPredict, start=2017, frequency = 12)
Predict_ts

           Jan       Feb       Mar       Apr       May       Jun       Jul
2017 0.4677808 0.4435404 0.4265541 0.4299162 0.4455113 0.4151422 0.4097367
2018 0.3522134 0.3313129 0.3142112 0.3703410 0.4162213 0.4391458 0.4237965
           Aug       Sep       Oct       Nov       Dec
2017 0.3839390 0.3255595 0.3274147 0.3231401 0.3316704
2018 0.3913679 0.3587615 0.3451991 0.3607087 0.3638076