## Exercise 5: Introduction to Regression

In [44]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

### Examples: Linear Regression With Statsmodels

Next to ```scikit-learn```, ```statsmodels``` is probably the most popular Python package for regression. While the focus of ```scikit-learn``` (will be covered next week) rather lies on machine learning applications, ```statsmodels``` (as the name suggests) has a rather statitics-oriented focus. We will briefly present the basic functionality of its regression functions by revisiting the Iris data set..

In [45]:
# read in data
df = pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"])
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Example 1: Bivariate Prediction
We want to fit a regession model that estimates sepal length from sepal width.

In [46]:
# specify predictors X and target Y
y = df.sepal_length
X = df.sepal_width
#print(X[:10])
# most importantly: we have to add a constant term to estimate the intercept
X = sm.add_constant(X)
X[:10]

  x = pd.concat(x[::order], 1)


Unnamed: 0,const,sepal_width
0,1.0,3.5
1,1.0,3.0
2,1.0,3.2
3,1.0,3.1
4,1.0,3.6
5,1.0,3.9
6,1.0,3.4
7,1.0,3.4
8,1.0,2.9
9,1.0,3.1


In [47]:
# initialize model: OLS = ordinary least squares
model = sm.OLS(y,X)
# fit model: only now te model, i.e. the parameters are computed
results = model.fit()

# print a summary, i.e. an overview on parameters and diagnostics
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Sun, 04 Sep 2022",Prob (F-statistic):,0.183
Time:,19:18:51,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
sepal_width,-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


In [48]:
# get parameters of model, i.e. beta_0 and beta_1
params = results.params
params

const          6.481223
sepal_width   -0.208870
dtype: float64

In [49]:
# we can apply parameters to obtain the predictions of Y based on X
print(np.dot(X,params))
# unsurprisingly, statsmodels also provides a direct prediction function:
results.predict(X)

[5.75017718 5.85461233 5.81283827 5.8337253  5.72929015 5.66662906
 5.77106421 5.77106421 5.87549936 5.8337253  5.70840312 5.77106421
 5.85461233 5.85461233 5.64574204 5.56219392 5.66662906 5.75017718
 5.68751609 5.68751609 5.77106421 5.70840312 5.72929015 5.79195124
 5.77106421 5.85461233 5.77106421 5.75017718 5.77106421 5.81283827
 5.8337253  5.77106421 5.62485501 5.60396798 5.8337253  5.81283827
 5.75017718 5.8337253  5.85461233 5.77106421 5.75017718 6.00082154
 5.81283827 5.75017718 5.68751609 5.85461233 5.68751609 5.81283827
 5.70840312 5.79195124 5.81283827 5.81283827 5.8337253  6.00082154
 5.89638639 5.89638639 5.79195124 5.97993451 5.87549936 5.91727342
 6.06348262 5.85461233 6.02170856 5.87549936 5.87549936 5.8337253
 5.85461233 5.91727342 6.02170856 5.95904748 5.81283827 5.89638639
 5.95904748 5.89638639 5.87549936 5.85461233 5.89638639 5.85461233
 5.87549936 5.93816045 5.97993451 5.97993451 5.91727342 5.91727342
 5.85461233 5.77106421 5.8337253  6.00082154 5.85461233 5.95904

0      5.750177
1      5.854612
2      5.812838
3      5.833725
4      5.729290
         ...   
145    5.854612
146    5.959047
147    5.854612
148    5.771064
149    5.854612
Length: 150, dtype: float64

#### Example 2: Multivariate Regression
Now we want to include all other numerical columns from the data to fit to estimate sepal length.

In [50]:
# statsmodels also provides a formula syntax, which requires an additional import
from statsmodels.formula.api import ols

# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically
model = ols("sepal_length ~ sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.859
Model:,OLS,Adj. R-squared:,0.856
Method:,Least Squares,F-statistic:,297.0
Date:,"Sun, 04 Sep 2022",Prob (F-statistic):,6.28e-62
Time:,19:18:56,Log-Likelihood:,-37.0
No. Observations:,150,AIC:,82.0
Df Residuals:,146,BIC:,94.04
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8451,0.250,7.368,0.000,1.350,2.340
sepal_width,0.6549,0.067,9.823,0.000,0.523,0.787
petal_width,-0.5626,0.127,-4.426,0.000,-0.814,-0.311
petal_length,0.7111,0.057,12.560,0.000,0.599,0.823

0,1,2,3
Omnibus:,0.265,Durbin-Watson:,2.053
Prob(Omnibus):,0.876,Jarque-Bera (JB):,0.432
Skew:,0.003,Prob(JB):,0.806
Kurtosis:,2.737,Cond. No.,54.7


In [51]:
# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically

# add interaction and squared term
model = ols("sepal_length ~ sepal_width:petal_width + np.square(petal_length) + sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.866
Model:,OLS,Adj. R-squared:,0.862
Method:,Least Squares,F-statistic:,186.9
Date:,"Sun, 04 Sep 2022",Prob (F-statistic):,4.2e-61
Time:,19:18:57,Log-Likelihood:,-33.031
No. Observations:,150,AIC:,78.06
Df Residuals:,144,BIC:,96.13
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.0983,0.476,4.408,0.000,1.157,3.039
sepal_width:petal_width,-0.1255,0.100,-1.250,0.213,-0.324,0.073
np.square(petal_length),0.0333,0.013,2.571,0.011,0.008,0.059
sepal_width,0.6790,0.123,5.541,0.000,0.437,0.921
petal_width,-0.1154,0.344,-0.335,0.738,-0.796,0.565
petal_length,0.4470,0.117,3.821,0.000,0.216,0.678

0,1,2,3
Omnibus:,0.119,Durbin-Watson:,1.996
Prob(Omnibus):,0.942,Jarque-Bera (JB):,0.276
Skew:,0.021,Prob(JB):,0.871
Kurtosis:,2.794,Cond. No.,483.0


### Task 1: Fitting an Artificial Data Set

We want to implement OLS regression and test it on artificial data. Thus, in this task you may not yet use the `statsmodels` functions (except for checking results).

#### a) Creating artificial data
Create an artificial dataset which consists of:
* a vector $x$ consisting of 100 (float) values between 0 and 1
* a vector $y = 10x +\varepsilon$, in which for each element, the error $\varepsilon_i$ is drawn from the standard normal distribution.
Create a scatterplot of x against y!

In [74]:
x = np.random.uniform(low=0.0, high=1.0, size=100)
epsilon = np.random.normal(size = 100)


y = [(10*x_i + epsilon_i) for x_i, epsilon_i in zip(x, epsilon)]
y

[8.412088965518262,
 4.17011724715246,
 0.4527071424380372,
 1.868904928438841,
 3.370402551314641,
 0.1705221819044932,
 2.95697362988258,
 8.421763269007402,
 2.695997168686625,
 9.45794350537482,
 1.4088431884753092,
 1.2751644177536872,
 3.9237935404096937,
 6.271944018973317,
 0.934401160745794,
 5.5065755345362835,
 4.836801118208511,
 1.3633437623694198,
 6.545386493653456,
 2.3619529503234737,
 2.1725995715798665,
 8.914722892702084,
 2.2176079084138,
 0.6357530530472852,
 7.473907486236018,
 5.275582821715281,
 3.3368497894646705,
 7.384308246352183,
 4.755261360868804,
 0.19071340408594128,
 7.095914476183195,
 7.10710088433164,
 8.653637497790118,
 2.236149128207831,
 3.1168809058913642,
 4.920139510448155,
 8.33700428778513,
 8.282358736406888,
 9.85396649976619,
 -1.409707415192136,
 2.42663710128305,
 7.824957680120177,
 1.8534663899025299,
 3.9492977024818536,
 6.005462071189136,
 1.5844458723125854,
 6.242034554934992,
 8.509606309625353,
 4.329450644072513,
 6.08882710

#### b) Implementing OLS regression
Write a function that takes as input a numpy vector of target values $y$, and a matrix of predictors $X$, and returns the parameter vector $\beta$ resulting from OLS regression. Apply this function to fit a model on your artificial data, compute the predictions, and add the resulting regression line to the plot from a). Remember to add a constant term!

In [82]:
def linear_regression(y, X):
    X_with_constant = sm.add_constant(X)
    #print(X_with_constant)
    model = sm.OLS(y, X_with_constant)
    results = model.fit()
    print(results.summary())
    return results.params


In [83]:
linear_regression(y, x)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.895
Model:                            OLS   Adj. R-squared:                  0.894
Method:                 Least Squares   F-statistic:                     837.0
Date:                Sun, 04 Sep 2022   Prob (F-statistic):           8.51e-50
Time:                        19:31:31   Log-Likelihood:                -132.65
No. Observations:                 100   AIC:                             269.3
Df Residuals:                      98   BIC:                             274.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0850      0.189      0.449      0.6

array([0.08495516, 9.90423412])

#### c) Diagnostics 1: The Bootstrap

Based on your regression function, write a function that again takes as input a predictor matrix $X$ and a target column $y$, plus a integer $N$, and bootstraps the data $N$ times to compute the 95% confidence interval for each parameter. Specifically, return one parameter vector for the bottom beta values, and one parameter vector for the top values. 
Apply this function to estimate the confidence intervals on our artificial data.

#### d) Diagnostics 2: The $R^2$ score.

Write a function that takes as input a ground truth vector y, its prediction y_hat, and computes the $R^2$ value of that prediction! Does your model explain most of the variance in the artificial data?

### Task 2: Predicting Student Performance

We revisit the student performance dataset from the exercise two weeks ago, and aim to estimate the exam performance in math. In this task you may use statsmodels!

#### a) Data Preprocessing

Load the student performance data into a dataframe. Since we want to estimate students performance in math, separate this column from the dataframe. On the remaining columns, transform, i.e. dummy-code all categorical columns as explained in lecture. Further, check for collinearities. If a pair of highly correlated columns (i.e. pearson correlation > 0.9) is given, remove one of these columns from the predictors. Remember to add a constant term afterwards.

In [104]:
df = pd.read_csv("StudentsPerformance.csv")

#for column in df.columns:
#    column = "_".join(column.split())
#    print(column)


df.columns = ["_".join(column.split()) for column in df.columns]
df.columns = ["_".join(column.split("/")) for column in df.columns]
df

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


#### b) Learning a simple regression model

Apply statsmodels to estimate the exam performance in math from all other columns, without any column interactions. Remember to use a constant term, and properly transform categorical variables. Which significant effects do you observe?

#### c) Adding interactions
Apply statsmodels to fit a regression model that in addition to the previous model further considers an interaction term between the test preparation course and each of the continuous columns that are left from the preprocessing. Thus, first add these columns to your predictor matrix, and then compute the corresponding model.
Does this interaction yield an improvement or rather cause problems?

#### d) Checking Residuals
Create a residual plot of both your models and give an interpretation of what you observe!