# Exercise 4: Simple and multiple linear regression analysis

### 1. Simple regression

Temperature data for a certain month (November 1977) is available from Falun (Dalarna), Gävle (Gästrikland) and Knon (Värmland) (file: temp_falun.txt). For Falun the data series is not complete.

We want to fill the missing data for Falun using the best correlated data set of the three possible data sets:

1. Only the data from Gävle 
2. Only the data from Knon
3. Both Gävle and Knon and the information about distances (Gävle-Falun $ =82$ km, Knon-Falun $ =110$ km)

Hint: inverse distance weighting method can be used to create the third dataset.

$T_{Gavle+Knon} = \frac{\left(\frac{1}{82}\right)^2}{\left(\frac{1}{82}\right)^2 + \left(\frac{1}{110}\right)^2}T_{Gavle} + \frac{\left(\frac{1}{110}\right)^2}{\left(\frac{1}{82}\right)^2 + \left(\frac{1}{110}\right)^2}T_{Knon}$  (3)

####Question1: Compute the correlation between Falun and (1), (2) and(3) and determine which one shall be used as the independent variable.

####Question2: Calculate the regression coefficients and how much of the variance is explained by the regression model, i.e. the R^2 values.

####Question3: Test the significance of the regression coefficients

####Question4: Plot the time series of the observed and calculated dependent variable including the extended values on the same graph.

### Procedure:


In [17]:
## Import required modules
# %matplotlib allows plots to appear directly in the notebook
%matplotlib inline             
import numpy as np              # required for basic calculations
import pandas as pd             # required for data analysis (reading files)
from scipy import stats         # required for statistics
from __future__ import division # allows floating number division
import matplotlib.pylab as plt  # required for plotting
import statsmodels.formula.api as smf  # module to run ordinary least squares analysis

1) Read the data file and define a dataframe

2) Use inverse distance weighting method given in equation (1) to create the third datasets of $T_{Gavle+Knon}$.

3) Note that there are no temperature observations in Falun for the days 22 – 30. Therefore there are gaps in the table for these days. For the calculation of these parameters use the data from the neighbouring stations.

5) Calculate the correlation between two datasets by using the python function given below.

In [65]:
# correlation between T_Falun and T_Gaie
temp_data = pd.read_table('temp_falun.dat') #reading the table 
#defining the dataframe
df = pd.DataFrame(temp_data)

#look at the data:
df.head()
#print(df)
#Calculating correlation between T_Falun and T_Gavle

# r_F_G = np.corrcoef(df['T_Falun'], df['T_Gavle'])[0, 1]  # This line gives 'nan' because Falun contains missing data.
# Therefore, calculate the correlation for the first 21 days only.
r_F_G = np.corrcoef(df['T_Falun'][:-9], df['T_Gavle'][:-9])[0, 1] 



##### Simple Linear Regression
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

$y = \alpha + \beta x $

What does each term represent?

y is the response or the dependent variable

x is the independent variable

$\alpha$ is the intercept

$\beta$ is the slope or the trend

Together, $\alpha$ and $\beta$ are called the model coefficients. To create your model, you must find the values of these coefficients. And once we've learned these coefficients, we can use the model to predict temperature.

In [19]:
# Requires import statsmodels.formula.api as smf

# create a fitted model 
fg = smf.ols(formula='T_Falun ~ T_Gavle', data=df).fit()

# print the coefficients
# print(fg.params)
# print summary statistics
# print(fg.summary())

When repeating the process for all three datasets, you will get three Linear Regression Models on the form:

$ y = \alpha + \beta x$ 

(insert $\alpha$ and $\beta$ into the three models)

##### Evaluating the Regression

For the evaluation the coefficient of determination $R^2$ is used. It is defined as

$ R^2 = \frac{\text{explained variance}}{\text{total variance}} = $

You'll find this value in the upper right corner of the summary table. Select the model with the highest $R^2$ for the next part of the analysis.


### Using the Model for Prediction

The missing data at Falun can be calculated from one of your three regression models (choose the best one). 


In [56]:
# Creating new DataFrame of the temperature at Gavle. Note that we need all 30 values in this task. 
T_Gavle = df['T_Gavle']

# Predecting temperature  data by using the linear regression model at T_Gavle:
T_f_G = alpha + beta * T_Gavle # insert appropriate values for alpha and beta, and replace T_Gavle if necessary.

#plotting the complete time series from the regression model against Falun which contains missing data:
#plt.plot(T_f_G)
#plt.plot(df['T_Falun'])
#plt.legend(['T_f_G', 'T_Falun'])

NameError: name 'alpha' is not defined

### 2. Multiple linear regression

a) In the file multidata.txt there are a number of numerical variables. Chose Y as dependent variable and x1, x2, x3 as independent variables. Perform a forward stepwise multiple regression and also a standard multiple regression. 

In a forward stepwise multiple regression, start with performing a simple regression using the independent variable which is best correlated with the dependent variable. Then add another independent variable, and make sure that this second independent variable should have the higher partial correlation with the dependent while the influence of the first independent variable is removed. Continue this procedure to see if the addition of a third independent variable will be helpful. In a standard multiple regression, all the independent variables are used in the regression model. By analysing the result of the regression, you could figure out if some independent variables do not significantly contribute to the regression. If there are any, remove them from the model and redo the regression with only the significant independent variables.

b) Present in each case the $R^2$ values and the regression equations.

c) In the forward stepwise method present also your F-test results (use $\alpha$ = 5%)

d) What are your conclusions?

In [64]:
multi = pd.read_table('multidata.txt') #reading in the data
#defining the dataframe
df = pd.DataFrame(multi)
df.head()

# First, find the correlation between Y and each of the X variables and determine which X has the greatest 
# correlation with Y. Use that X as the first model.


# Then create three fitted models 
ols1 = smf.ols(formula='Y ~ X1', data=multi).fit()
ols2 = smf.ols(formula='Y ~ X1 + X2', data=multi).fit()
ols3 = smf.ols(formula='Y ~ X1 + X2 + X3', data=multi).fit()

# Compare the R^2 value for the three models.
#print(ols1.summary())

# Read the F-statistic off the summary table, and perform the F test (alpha = 0.05).