## Exercise 5: Introduction to Regression

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

### Examples: Linear Regression With Statsmodels

Next to ```scikit-learn```, ```statsmodels``` is probably the most popular Python package for regression. While the focus of ```scikit-learn``` (will be covered next week) rather lies on machine learning applications, ```statsmodels``` (as the name suggests) has a rather statitics-oriented focus. We will briefly present the basic functionality of its regression functions by revisiting the Iris data set..

In [2]:
# read in data
df = pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"])
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Example 1: Bivariate Prediction
We want to fit a regession model that estimates sepal length from sepal width.

In [3]:
# specify predictors X and target Y
y = df.sepal_length
X = df.sepal_width
#print(X[:10])
# most importantly: we have to add a constant term to estimate the intercept
X = sm.add_constant(X)
X[:10]

  x = pd.concat(x[::order], 1)


Unnamed: 0,const,sepal_width
0,1.0,3.5
1,1.0,3.0
2,1.0,3.2
3,1.0,3.1
4,1.0,3.6
5,1.0,3.9
6,1.0,3.4
7,1.0,3.4
8,1.0,2.9
9,1.0,3.1


In [4]:
# initialize model: OLS = ordinary least squares
model = sm.OLS(y,X)
# fit model: only now te model, i.e. the parameters are computed
results = model.fit()

# print a summary, i.e. an overview on parameters and diagnostics
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Mon, 05 Sep 2022",Prob (F-statistic):,0.183
Time:,14:37:39,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
sepal_width,-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


In [5]:
# get parameters of model, i.e. beta_0 and beta_1
params = results.params
params

const          6.481223
sepal_width   -0.208870
dtype: float64

In [6]:
# we can apply parameters to obtain the predictions of Y based on X
print(np.dot(X,params))
# unsurprisingly, statsmodels also provides a direct prediction function:
results.predict(X)

[5.75017718 5.85461233 5.81283827 5.8337253  5.72929015 5.66662906
 5.77106421 5.77106421 5.87549936 5.8337253  5.70840312 5.77106421
 5.85461233 5.85461233 5.64574204 5.56219392 5.66662906 5.75017718
 5.68751609 5.68751609 5.77106421 5.70840312 5.72929015 5.79195124
 5.77106421 5.85461233 5.77106421 5.75017718 5.77106421 5.81283827
 5.8337253  5.77106421 5.62485501 5.60396798 5.8337253  5.81283827
 5.75017718 5.8337253  5.85461233 5.77106421 5.75017718 6.00082154
 5.81283827 5.75017718 5.68751609 5.85461233 5.68751609 5.81283827
 5.70840312 5.79195124 5.81283827 5.81283827 5.8337253  6.00082154
 5.89638639 5.89638639 5.79195124 5.97993451 5.87549936 5.91727342
 6.06348262 5.85461233 6.02170856 5.87549936 5.87549936 5.8337253
 5.85461233 5.91727342 6.02170856 5.95904748 5.81283827 5.89638639
 5.95904748 5.89638639 5.87549936 5.85461233 5.89638639 5.85461233
 5.87549936 5.93816045 5.97993451 5.97993451 5.91727342 5.91727342
 5.85461233 5.77106421 5.8337253  6.00082154 5.85461233 5.95904

0      5.750177
1      5.854612
2      5.812838
3      5.833725
4      5.729290
         ...   
145    5.854612
146    5.959047
147    5.854612
148    5.771064
149    5.854612
Length: 150, dtype: float64

#### Example 2: Multivariate Regression
Now we want to include all other numerical columns from the data to fit to estimate sepal length.

In [7]:
# statsmodels also provides a formula syntax, which requires an additional import
from statsmodels.formula.api import ols

# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically
model = ols("sepal_length ~ sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.859
Model:,OLS,Adj. R-squared:,0.856
Method:,Least Squares,F-statistic:,297.0
Date:,"Mon, 05 Sep 2022",Prob (F-statistic):,6.28e-62
Time:,14:37:47,Log-Likelihood:,-37.0
No. Observations:,150,AIC:,82.0
Df Residuals:,146,BIC:,94.04
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8451,0.250,7.368,0.000,1.350,2.340
sepal_width,0.6549,0.067,9.823,0.000,0.523,0.787
petal_width,-0.5626,0.127,-4.426,0.000,-0.814,-0.311
petal_length,0.7111,0.057,12.560,0.000,0.599,0.823

0,1,2,3
Omnibus:,0.265,Durbin-Watson:,2.053
Prob(Omnibus):,0.876,Jarque-Bera (JB):,0.432
Skew:,0.003,Prob(JB):,0.806
Kurtosis:,2.737,Cond. No.,54.7


In [8]:
# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically

# add interaction and squared term
model = ols("sepal_length ~ sepal_width:petal_width + np.square(petal_length) + sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,sepal_length,R-squared:,0.866
Model:,OLS,Adj. R-squared:,0.862
Method:,Least Squares,F-statistic:,186.9
Date:,"Mon, 05 Sep 2022",Prob (F-statistic):,4.2e-61
Time:,14:37:48,Log-Likelihood:,-33.031
No. Observations:,150,AIC:,78.06
Df Residuals:,144,BIC:,96.13
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.0983,0.476,4.408,0.000,1.157,3.039
sepal_width:petal_width,-0.1255,0.100,-1.250,0.213,-0.324,0.073
np.square(petal_length),0.0333,0.013,2.571,0.011,0.008,0.059
sepal_width,0.6790,0.123,5.541,0.000,0.437,0.921
petal_width,-0.1154,0.344,-0.335,0.738,-0.796,0.565
petal_length,0.4470,0.117,3.821,0.000,0.216,0.678

0,1,2,3
Omnibus:,0.119,Durbin-Watson:,1.996
Prob(Omnibus):,0.942,Jarque-Bera (JB):,0.276
Skew:,0.021,Prob(JB):,0.871
Kurtosis:,2.794,Cond. No.,483.0


### Task 1: Fitting an Artificial Data Set

We want to implement OLS regression and test it on artificial data. Thus, in this task you may not yet use the `statsmodels` functions (except for checking results).

#### a) Creating artificial data
Create an artificial dataset which consists of:
* a vector $x$ consisting of 100 (float) values between 0 and 1
* a vector $y = 10x +\varepsilon$, in which for each element, the error $\varepsilon_i$ is drawn from the standard normal distribution.
Create a scatterplot of x against y!

In [9]:
x = np.random.uniform(low=0.0, high=1.0, size=100)
epsilon = np.random.normal(size = 100)


y = [(10*x_i + epsilon_i) for x_i, epsilon_i in zip(x, epsilon)]
y

[6.553887848330939,
 5.121214556384347,
 4.611073660495185,
 5.5871815218713,
 8.905864799805864,
 7.326275992098562,
 4.5921025725629345,
 2.081637478020256,
 5.299979287757626,
 8.216541000490327,
 2.328565452643354,
 3.577764838921698,
 4.848785469347599,
 2.9475583705610067,
 3.1463608551947546,
 2.5867941633328293,
 3.179649600229103,
 7.472262236356768,
 2.910886020051016,
 9.187123211096065,
 -0.1818798134218591,
 5.754848710618067,
 4.99725287534912,
 4.57713558148934,
 1.838126858031456,
 9.369777042267431,
 8.622959909710978,
 4.768527095291223,
 8.015959048008765,
 2.9877712232931666,
 6.96342202274414,
 7.563578040572811,
 8.942792373550304,
 5.538053564901247,
 4.1626231742352156,
 9.60224686837955,
 10.044940881092192,
 8.299408515445817,
 5.371630928499645,
 0.7286190862015345,
 5.217041502121383,
 3.408079061800854,
 8.266610907304665,
 4.060885697504838,
 5.919819008646078,
 7.3250219841401325,
 1.2505312151186416,
 7.219168613485516,
 6.147233254697265,
 7.08945498152

#### b) Implementing OLS regression
Write a function that takes as input a numpy vector of target values $y$, and a matrix of predictors $X$, and returns the parameter vector $\beta$ resulting from OLS regression. Apply this function to fit a model on your artificial data, compute the predictions, and add the resulting regression line to the plot from a). Remember to add a constant term!

In [10]:
def linear_regression(y, X):
    X_with_constant = sm.add_constant(X)
    #print(X_with_constant)
    model = sm.OLS(y, X_with_constant)
    results = model.fit()
    print(results.summary())
    return results.params


In [11]:
linear_regression(y, x)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                     590.6
Date:                Mon, 05 Sep 2022   Prob (F-statistic):           2.80e-43
Time:                        14:37:58   Log-Likelihood:                -143.44
No. Observations:                 100   AIC:                             290.9
Df Residuals:                      98   BIC:                             296.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4921      0.231      2.134      0.0

array([0.49212575, 9.23592933])

#### c) Diagnostics 1: The Bootstrap

Based on your regression function, write a function that again takes as input a predictor matrix $X$ and a target column $y$, plus a integer $N$, and bootstraps the data $N$ times to compute the 95% confidence interval for each parameter. Specifically, return one parameter vector for the bottom beta values, and one parameter vector for the top values. 
Apply this function to estimate the confidence intervals on our artificial data.

#### d) Diagnostics 2: The $R^2$ score.

Write a function that takes as input a ground truth vector y, its prediction y_hat, and computes the $R^2$ value of that prediction! Does your model explain most of the variance in the artificial data?

### Task 2: Predicting Student Performance

We revisit the student performance dataset from the exercise two weeks ago, and aim to estimate the exam performance in math. In this task you may use statsmodels!

#### a) Data Preprocessing

Load the student performance data into a dataframe. Since we want to estimate students performance in math, separate this column from the dataframe. On the remaining columns, transform, i.e. dummy-code all categorical columns as explained in lecture. Further, check for collinearities. If a pair of highly correlated columns (i.e. pearson correlation > 0.9) is given, remove one of these columns from the predictors. Remember to add a constant term afterwards.

In [12]:
df = pd.read_csv("StudentsPerformance.csv")

#for column in df.columns:
#    column = "_".join(column.split())
#    print(column)

#Renaming/modifying columns
df.columns = ["_".join(column.split()) for column in df.columns]
df.columns = ["_".join(column.split("/")) for column in df.columns]
df

pd_dummies = pd.get_dummies(df, prefix=['gender', 'race_ethnicity', 'lunch', 'test_preparation_course', 'test_preparation_course'],drop_first = True)
pd_dummies.columns = ["_".join(column.split()) for column in pd_dummies.columns]
pd_dummies

Unnamed: 0,math_score,reading_score,writing_score,gender_male,race_ethnicity_group_B,race_ethnicity_group_C,race_ethnicity_group_D,race_ethnicity_group_E,lunch_bachelor's_degree,lunch_high_school,lunch_master's_degree,lunch_some_college,lunch_some_high_school,test_preparation_course_standard,test_preparation_course_none
0,72,72,74,0,1,0,0,0,1,0,0,0,0,1,1
1,69,90,88,0,0,1,0,0,0,0,0,1,0,1,0
2,90,95,93,0,1,0,0,0,0,0,1,0,0,1,1
3,47,57,44,1,0,0,0,0,0,0,0,0,0,0,1
4,76,78,75,1,0,1,0,0,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88,99,95,0,0,0,0,1,0,0,1,0,0,1,0
996,62,55,55,1,0,1,0,0,0,1,0,0,0,0,1
997,59,71,65,0,0,1,0,0,0,1,0,0,0,0,0
998,68,78,77,0,0,0,1,0,0,0,0,1,0,1,0


In [13]:
correlation_matrix = pd_dummies.corr(method='pearson').abs()


correlation_matrix

Unnamed: 0,math_score,reading_score,writing_score,gender_male,race_ethnicity_group_B,race_ethnicity_group_C,race_ethnicity_group_D,race_ethnicity_group_E,lunch_bachelor's_degree,lunch_high_school,lunch_master's_degree,lunch_some_college,lunch_some_high_school,test_preparation_course_standard,test_preparation_course_none
math_score,1.0,0.81758,0.802642,0.167982,0.08425,0.073387,0.050071,0.205855,0.079664,0.128725,0.060417,0.037056,0.079852,0.350877,0.177702
reading_score,0.81758,1.0,0.954598,0.244313,0.060283,0.003074,0.035177,0.106712,0.096024,0.151068,0.106452,0.010782,0.071369,0.22956,0.24178
writing_score,0.802642,0.954598,1.0,0.301225,0.078254,0.010203,0.082032,0.089077,0.128297,0.182211,0.125693,0.027989,0.097326,0.245769,0.312946
gender_male,0.167982,0.244313,0.301225,1.0,0.028466,0.063368,0.030566,0.020302,0.011638,0.037952,0.046188,0.00446,0.00899,0.021372,0.006028
race_ethnicity_group_B,0.08425,0.060283,0.078254,0.028466,1.0,0.331479,0.288574,0.195411,0.019121,0.069093,0.056363,0.036203,0.026531,0.008257,0.000106
race_ethnicity_group_C,0.073387,0.003074,0.010203,0.063368,0.331479,1.0,0.407797,0.276145,0.015682,0.007977,0.00163,0.015872,0.045339,0.003385,0.012522
race_ethnicity_group_D,0.050071,0.035177,0.082032,0.030566,0.288574,0.407797,1.0,0.240402,0.020556,0.042118,0.072793,0.042347,0.018402,0.009458,0.055956
race_ethnicity_group_E,0.205855,0.106712,0.089077,0.020302,0.195411,0.276145,0.240402,1.0,0.013221,0.039494,0.00318,0.023153,0.053075,0.052398,0.059393
lunch_bachelor's_degree,0.079664,0.096024,0.128297,0.011638,0.019121,0.015682,0.020556,0.013221,1.0,0.180595,0.091588,0.197647,0.17079,0.013668,0.024285
lunch_high_school,0.128725,0.151068,0.182211,0.037952,0.069093,0.007977,0.042118,0.039494,0.180595,1.0,0.123632,0.266799,0.230545,0.002211,0.074446


In [14]:
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),k=1).astype(np.bool))
print(upper_tri)

                                  math_score  reading_score  writing_score  \
math_score                               NaN        0.81758       0.802642   
reading_score                            NaN            NaN       0.954598   
writing_score                            NaN            NaN            NaN   
gender_male                              NaN            NaN            NaN   
race_ethnicity_group_B                   NaN            NaN            NaN   
race_ethnicity_group_C                   NaN            NaN            NaN   
race_ethnicity_group_D                   NaN            NaN            NaN   
race_ethnicity_group_E                   NaN            NaN            NaN   
lunch_bachelor's_degree                  NaN            NaN            NaN   
lunch_high_school                        NaN            NaN            NaN   
lunch_master's_degree                    NaN            NaN            NaN   
lunch_some_college                       NaN            NaN     

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),k=1).astype(np.bool))


In [15]:
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.9)]
    print(); print(to_drop)


['writing_score']


In [16]:
df1 = pd_dummies.drop(to_drop, axis=1)
df1


Unnamed: 0,math_score,reading_score,gender_male,race_ethnicity_group_B,race_ethnicity_group_C,race_ethnicity_group_D,race_ethnicity_group_E,lunch_bachelor's_degree,lunch_high_school,lunch_master's_degree,lunch_some_college,lunch_some_high_school,test_preparation_course_standard,test_preparation_course_none
0,72,72,0,1,0,0,0,1,0,0,0,0,1,1
1,69,90,0,0,1,0,0,0,0,0,1,0,1,0
2,90,95,0,1,0,0,0,0,0,1,0,0,1,1
3,47,57,1,0,0,0,0,0,0,0,0,0,0,1
4,76,78,1,0,1,0,0,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88,99,0,0,0,0,1,0,0,1,0,0,1,0
996,62,55,1,0,1,0,0,0,1,0,0,0,0,1
997,59,71,0,0,1,0,0,0,1,0,0,0,0,0
998,68,78,0,0,0,1,0,0,0,0,1,0,1,0


#### b) Learning a simple regression model

Apply statsmodels to estimate the exam performance in math from all other columns, without any column interactions. Remember to use a constant term, and properly transform categorical variables. Which significant effects do you observe?

In [17]:
y = df1.math_score
X = df1.iloc[:, 1:]
X = sm.add_constant(X)
X

  x = pd.concat(x[::order], 1)


Unnamed: 0,const,reading_score,gender_male,race_ethnicity_group_B,race_ethnicity_group_C,race_ethnicity_group_D,race_ethnicity_group_E,lunch_bachelor's_degree,lunch_high_school,lunch_master's_degree,lunch_some_college,lunch_some_high_school,test_preparation_course_standard,test_preparation_course_none
0,1.0,72,0,1,0,0,0,1,0,0,0,0,1,1
1,1.0,90,0,0,1,0,0,0,0,0,1,0,1,0
2,1.0,95,0,1,0,0,0,0,0,1,0,0,1,1
3,1.0,57,1,0,0,0,0,0,0,0,0,0,0,1
4,1.0,78,1,0,1,0,0,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1.0,99,0,0,0,0,1,0,0,1,0,0,1,0
996,1.0,55,1,0,1,0,0,0,1,0,0,0,0,1
997,1.0,71,0,0,1,0,0,0,1,0,0,0,0,0
998,1.0,78,0,0,0,1,0,0,0,0,1,0,1,0


In [18]:
model = sm.OLS(y, X)
results = model.fit()
results.summary()

params = results.params
params

const                               -7.021095
reading_score                        0.907039
gender_male                         11.409329
race_ethnicity_group_B               0.838018
race_ethnicity_group_C               0.407506
race_ethnicity_group_D               1.617095
race_ethnicity_group_E               5.133762
lunch_bachelor's_degree              0.010595
lunch_high_school                   -0.357857
lunch_master's_degree               -0.925760
lunch_some_college                   0.577912
lunch_some_high_school              -0.576037
test_preparation_course_standard     4.304605
test_preparation_course_none         1.183365
dtype: float64

#### c) Adding interactions
Apply statsmodels to fit a regression model that in addition to the previous model further considers an interaction term between the test preparation course and each of the continuous columns that are left from the preprocessing. Thus, first add these columns to your predictor matrix, and then compute the corresponding model.
Does this interaction yield an improvement or rather cause problems?

#### d) Checking Residuals
Create a residual plot of both your models and give an interpretation of what you observe!