<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Session4_LinReg_SomeModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load packages:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

We rely on `statsmodels.api` here rather than scikit-learn, which is our general go-to tool. The reason is to illustrate a more "comfortable" way to linear regressions that has more of the feel and look of `R`.

Let's load our data:

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

Cloning into 'ML_656'...
remote: Enumerating objects: 281, done.[K
remote: Counting objects: 100% (164/164), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 281 (delta 96), reused 123 (delta 76), pack-reused 117[K
Receiving objects: 100% (281/281), 24.00 MiB | 16.49 MiB/s, done.
Resolving deltas: 100% (146/146), done.


In [None]:
data = pd.read_csv('ML_656/tel.csv')
data.head()

Unnamed: 0,Hours,ByDa,RWT,SOA,SOB,SOC,Field,Hot,Day
0,111,62,34,496,0,0,36,12,3
1,114,35,29,258,0,0,34,16,4
2,70,74,19,39,0,1,27,9,5
3,114,97,19,376,5,1,26,28,1
4,87,83,31,107,1,1,9,14,2


Let's look at aggregate statistics:

In [None]:
data.describe()

Unnamed: 0,Hours,ByDa,RWT,SOA,SOB,SOC,Field,Hot,Day
count,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0
mean,101.806452,60.387097,42.870968,263.806452,2.387097,2.548387,24.870968,12.193548,3.0
std,18.332884,31.679307,25.692725,132.756273,3.242195,6.381425,14.568784,5.20525,1.414214
min,48.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,1.0
25%,92.0,42.0,24.5,170.0,0.0,0.0,18.0,9.0,2.0
50%,108.0,56.0,35.0,283.0,1.0,1.0,22.0,11.0,3.0
75%,114.5,74.0,66.5,352.5,4.0,2.0,28.0,16.0,4.0
max,124.0,174.0,109.0,496.0,12.0,35.0,92.0,28.0,5.0


Having Day as a number doesn't make sense, so let's put in dummies. We will use Monday as our base and use dummies for Tuesday till Friday. (It's not clear whether it makes sense to use RWT and other varibles as numeric, but let's run with it for now.)

In [None]:
data['Tuesday'] = data.apply(lambda row: int(row.Day==2), axis=1)
data['Wednesday'] = data.apply(lambda row: int(row.Day==3), axis=1)
data['Thursday'] = data.apply(lambda row: int(row.Day==4), axis=1)
data['Friday'] = data.apply(lambda row: int(row.Day==5), axis=1)
data.head()

Unnamed: 0,Hours,ByDa,RWT,SOA,SOB,SOC,Field,Hot,Day,Tuesday,Wednesday,Thursday,Friday
0,111,62,34,496,0,0,36,12,3,0,1,0,0
1,114,35,29,258,0,0,34,16,4,0,0,1,0
2,70,74,19,39,0,1,27,9,5,0,0,0,1
3,114,97,19,376,5,1,26,28,1,0,0,0,0
4,87,83,31,107,1,1,9,14,2,1,0,0,0


Let's start running regressions, and let's start with the "full" model.

In [None]:
# Assign dependent and independent / explanatory variables
y = data['Hours']
X = data.drop(columns=['Hours','Day'])
X = sm.add_constant(X) # Add a constant term as the default model doesn't include one
model = sm.OLS(y, X).fit()
# Check regression results
model.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.859
Model:,OLS,Adj. R-squared:,0.778
Method:,Least Squares,F-statistic:,10.56
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,6.41e-06
Time:,17:42:05,Log-Likelihood:,-103.24
No. Observations:,31,AIC:,230.5
Df Residuals:,19,BIC:,247.7
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,83.7338,10.888,7.690,0.000,60.944,106.523
ByDa,-0.0659,0.089,-0.741,0.468,-0.252,0.120
RWT,0.1395,0.102,1.371,0.186,-0.073,0.352
SOA,0.0471,0.015,3.234,0.004,0.017,0.078
SOB,-0.4801,0.706,-0.680,0.504,-1.957,0.997
SOC,0.0170,0.317,0.054,0.958,-0.646,0.680
Field,0.0376,0.152,0.248,0.807,-0.280,0.356
Hot,0.8029,0.378,2.126,0.047,0.012,1.593
Tuesday,-8.1570,6.057,-1.347,0.194,-20.834,4.520

0,1,2,3
Omnibus:,0.005,Durbin-Watson:,1.729
Prob(Omnibus):,0.997,Jarque-Bera (JB):,0.149
Skew:,0.026,Prob(JB):,0.928
Kurtosis:,2.664,Cond. No.,2840.0


In [None]:
X = data.drop(columns=['Hours','Day','Tuesday','Wednesday','Thursday'])
X = sm.add_constant(X)
model2 = sm.OLS(y, X).fit()
model2.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.823
Model:,OLS,Adj. R-squared:,0.759
Method:,Least Squares,F-statistic:,12.8
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,1.13e-06
Time:,17:43:18,Log-Likelihood:,-106.79
No. Observations:,31,AIC:,231.6
Df Residuals:,22,BIC:,244.5
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,80.1454,9.130,8.778,0.000,61.211,99.080
ByDa,-0.0971,0.078,-1.247,0.225,-0.259,0.064
RWT,0.1824,0.098,1.870,0.075,-0.020,0.385
SOA,0.0438,0.015,2.915,0.008,0.013,0.075
SOB,-0.8148,0.715,-1.140,0.267,-2.297,0.668
SOC,0.0486,0.293,0.166,0.870,-0.559,0.656
Field,0.1392,0.138,1.010,0.323,-0.146,0.425
Hot,0.9392,0.367,2.560,0.018,0.178,1.700
Friday,-25.4658,6.175,-4.124,0.000,-38.272,-12.659

0,1,2,3
Omnibus:,1.043,Durbin-Watson:,1.915
Prob(Omnibus):,0.594,Jarque-Bera (JB):,0.889
Skew:,-0.15,Prob(JB):,0.641
Kurtosis:,2.226,Cond. No.,1950.0


In [None]:
X = data.drop(columns=['Hours','Day','Tuesday','Wednesday','Thursday','SOB','ByDa'])
X = sm.add_constant(X)
model3 = sm.OLS(y, X).fit()
model3.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.807
Model:,OLS,Adj. R-squared:,0.759
Method:,Least Squares,F-statistic:,16.72
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,1.65e-07
Time:,17:45:36,Log-Likelihood:,-108.15
No. Observations:,31,AIC:,230.3
Df Residuals:,24,BIC:,240.3
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,78.2406,8.991,8.702,0.000,59.683,96.798
RWT,0.1120,0.075,1.500,0.147,-0.042,0.266
SOA,0.0445,0.015,2.972,0.007,0.014,0.075
SOC,0.0686,0.288,0.238,0.814,-0.527,0.664
Field,0.0572,0.117,0.487,0.631,-0.185,0.299
Hot,0.8204,0.352,2.333,0.028,0.094,1.546
Friday,-23.6281,5.755,-4.105,0.000,-35.507,-11.750

0,1,2,3
Omnibus:,3.602,Durbin-Watson:,1.669
Prob(Omnibus):,0.165,Jarque-Bera (JB):,1.678
Skew:,-0.209,Prob(JB):,0.432
Kurtosis:,1.94,Cond. No.,1860.0


In [None]:
X = data[['SOA','Hot','Friday']]
X = sm.add_constant(X)
model4 = sm.OLS(y, X).fit()
model4.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.788
Model:,OLS,Adj. R-squared:,0.764
Method:,Least Squares,F-statistic:,33.43
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,3.1e-09
Time:,17:46:42,Log-Likelihood:,-109.62
No. Observations:,31,AIC:,227.2
Df Residuals:,27,BIC:,233.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,87.1876,6.002,14.528,0.000,74.873,99.502
SOA,0.0424,0.014,3.064,0.005,0.014,0.071
Hot,0.7268,0.326,2.226,0.035,0.057,1.397
Friday,-28.1038,4.712,-5.964,0.000,-37.772,-18.435

0,1,2,3
Omnibus:,2.302,Durbin-Watson:,1.886
Prob(Omnibus):,0.316,Jarque-Bera (JB):,2.063
Skew:,-0.591,Prob(JB):,0.357
Kurtosis:,2.554,Cond. No.,1260.0


In [None]:
X = data[['Hot','Friday']]
X = sm.add_constant(X)
model5 = sm.OLS(y, X).fit()
model5.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.714
Model:,OLS,Adj. R-squared:,0.694
Method:,Least Squares,F-statistic:,34.97
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,2.44e-08
Time:,17:47:43,Log-Likelihood:,-114.24
No. Observations:,31,AIC:,234.5
Df Residuals:,28,BIC:,238.8
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,99.0913,5.215,19.001,0.000,88.409,109.774
Hot,0.7700,0.372,2.071,0.048,0.008,1.532
Friday,-34.4832,4.819,-7.155,0.000,-44.355,-24.611

0,1,2,3
Omnibus:,4.311,Durbin-Watson:,1.757
Prob(Omnibus):,0.116,Jarque-Bera (JB):,3.25
Skew:,-0.789,Prob(JB):,0.197
Kurtosis:,3.163,Cond. No.,43.8


In [None]:
X = data[['Friday']]
X = sm.add_constant(X)
model6 = sm.OLS(y, X).fit()
model6.summary()

0,1,2,3
Dep. Variable:,Hours,R-squared:,0.67
Model:,OLS,Adj. R-squared:,0.659
Method:,Least Squares,F-statistic:,58.96
Date:,"Mon, 02 Oct 2023",Prob (F-statistic):,1.82e-08
Time:,17:48:38,Log-Likelihood:,-116.45
No. Observations:,31,AIC:,236.9
Df Residuals:,29,BIC:,239.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,109.0400,2.141,50.922,0.000,104.661,113.419
Friday,-37.3733,4.867,-7.678,0.000,-47.328,-27.419

0,1,2,3
Omnibus:,2.571,Durbin-Watson:,2.287
Prob(Omnibus):,0.277,Jarque-Bera (JB):,2.16
Skew:,-0.633,Prob(JB):,0.34
Kurtosis:,2.739,Cond. No.,2.64
