# Lab 6.2: Linear Regression

In [1]:
%pylab inline

import pandas as pd
import statsmodels.api as sm
import yaml

from seaborn import pairplot
from sqlalchemy import create_engine

pg_creds = yaml.load(open('../../pg_creds.yaml'))['student']

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(**pg_creds))

Populating the interactive namespace from numpy and matplotlib


**Question 1**  

Using the cars data,

1) Fit a simple linear regression to predict `mpg` using `weight`.  

In [2]:
cars = pd.read_sql("SELECT * FROM cars WHERE horsepower IS NOT NULL;", engine, index_col='index')

In [3]:
cars.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car_name
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [4]:
X = cars.weight
X = sm.add_constant(X)
y = cars.mpg

model = sm.OLS(y, X)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.693
Model:,OLS,Adj. R-squared:,0.692
Method:,Least Squares,F-statistic:,878.8
Date:,"Mon, 03 Oct 2016",Prob (F-statistic):,6.02e-102
Time:,16:44:50,Log-Likelihood:,-1130.0
No. Observations:,392,AIC:,2264.0
Df Residuals:,390,BIC:,2272.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,46.2165,0.799,57.867,0.000,44.646 47.787
weight,-0.0076,0.000,-29.645,0.000,-0.008 -0.007

0,1,2,3
Omnibus:,41.682,Durbin-Watson:,0.808
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60.039
Skew:,0.727,Prob(JB):,9.18e-14
Kurtosis:,4.251,Cond. No.,11300.0


2) Comment on the model fit.  

The percentage of the variation in y (mpg) can be explained by the corresponding variation in X (weight) and the least-squares line is 69.3%, and the unexplained percentage of variation is 100% – 69.3% = 30.7%.

3) Interpret the model. 

Every unit of increase in weight affects the mpg by -0.0076.

4) Is `weight` useful for predicting `mpg`? Carry out a formal hypothesis test to show it.  

$H_0: \beta_1 = 0$  

$H_a: \beta_1 \neq 0$ 

Test statistic:  

$ t_{stat} = \frac{b_1 - 0}{s_{b_1}} = \frac{b_1}{s_{b_1}}$

In [None]:
#t_stats = results.params[1]/

5) Make a prediction for the average `mpg` of all cars that have a weight of 2000.  

In [9]:
46.2165 - 0.0076473 * 2000

30.9219

In [10]:
results.predict([1, 2000])

array([ 30.92183948])

6) Make a prediction for a particular car that has a weight of 2000.  

7) Write a Python function to calculate the confidence interval for your prediction in part 5).  

In [22]:
x = cars.weight
se = sqrt(results.mse_resid)
b0, b1 = results.params

x_new = 2000

def confidence_se(s_e, x, x_new):
    mean_x = x.mean()
    var_x = x.var()
    n = len(x)
    return s_e * (1/n + (x_new - mean_x)**2 / ((n - 1) * var_x))**0.5

sign = array([-1., 1.])
b0 + b1 * x_new + sign * 1.96 * confidence_se(se, x, x_new)

array([ 30.26741098,  31.57626797])

8) Write a Python function to calculate the prediction interval for your prediction in part 6).  

In [23]:
def prediction_se(s_e, x, x_new):
    mean_x = x.mean()
    var_x = x.var()
    n = len(x)
    return s_e * (1 + 1/n + (x_new - mean_x)**2 / ((n - 1) * var_x))**0.5

b0 + b1 * x_new + sign * 1.96 * prediction_se(se, x, x_new)

array([ 22.40454496,  39.439134  ])

9) What are the differences between the intervals you found in parts 7) and 8)?

**Question 2**  

You are shopping for a laptop computer at Best Buy. To help you with your decision, you decide to construct a regression model to predict the selling price of the laptop. The table `laptops` provides the following data for a random sample of laptops on Best Buy’s Web site:  

* Selling price
* Brand
* Screen size (in.)
* Hard drive size (GB)
* Amount of RAM memory (GB)
* Number of USB ports
* Weight (oz.) 

a) Using multiple regression, model selling price using the variables screen size, hard drive size, amount of ram, number of usb ports and weight.  

In [27]:
laptops = pd.read_sql("SELECT * FROM laptops AS l;", engine)

In [28]:
laptops.head()

Unnamed: 0,Price ($),Screen Size (in.),RAM Memory (GB),Hard drive (GB),USB Ports,Brand,Weight (oz.)
0,830,13.3,4,500,3,Toshiba,4.9
1,750,13.3,4,640,3,Toshiba,3.2
2,1200,11.6,2,128,2,Apple,2.3
3,1600,18.4,6,640,4,Toshiba,9.7
4,1900,18.4,8,500,4,Toshiba,9.7


In [30]:
# Lowercase and replace periods & spaces in the column names
new_names = []

for col in laptops.columns:
    new_names.append(col.replace('.', '', len(col)).replace(' ', '', len(col)).lower())

laptops.columns = new_names

print(laptops.columns)

Index(['price($)', 'screensize(in)', 'rammemory(gb)', 'harddrive(gb)',
       'usbports', 'brand', 'weight(oz)'],
      dtype='object')


In [31]:
X_multi = laptops[['screensize(in)', 'rammemory(gb)', 'harddrive(gb)', 'usbports', 'weight(oz)']]
X_multi = sm.add_constant(X_multi)
y_multi = laptops['price($)']

model_multi = sm.OLS(y_multi, X_multi)
results_multi = model_multi.fit()
results_multi.summary()

0,1,2,3
Dep. Variable:,price($),R-squared:,0.117
Model:,OLS,Adj. R-squared:,0.04
Method:,Least Squares,F-statistic:,1.514
Date:,"Mon, 03 Oct 2016",Prob (F-statistic):,0.2
Time:,17:34:50,Log-Likelihood:,-477.99
No. Observations:,63,AIC:,968.0
Df Residuals:,57,BIC:,980.8
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,761.4987,946.954,0.804,0.425,-1134.744 2657.741
screensize(in),4.1113,96.206,0.043,0.966,-188.539 196.761
rammemory(gb),12.8642,74.411,0.173,0.863,-136.141 161.870
harddrive(gb),0.6561,0.459,1.429,0.159,-0.263 1.576
usbports,-206.5346,123.563,-1.671,0.100,-453.965 40.896
weight(oz),51.6251,99.210,0.520,0.605,-147.040 250.290

0,1,2,3
Omnibus:,9.835,Durbin-Watson:,1.792
Prob(Omnibus):,0.007,Jarque-Bera (JB):,10.384
Skew:,0.993,Prob(JB):,0.00556
Kurtosis:,3.102,Cond. No.,8400.0


b) Perform and interpret the overall F test.  

F-statistic:	1.514

Prob (F-statistic):	0.200

c) Using p-values, which variables appear to be needed in the model? Justify your answer.   

d) Now create a new predictor that contains random numbers drawn from your favorite distribution, and include this predictor in your multiple regression model. Comment on the model fit. How does the new $R^2$ compare to the one in part a)?  

In [35]:
mu, sigma = 0, 1
# We choose size equal to the size of the dataframe.
new_predictor = np.random.normal(mu, sigma, len(laptops))

In [37]:
laptops['new_predictor'] = new_predictor

In [38]:
laptops.head()

Unnamed: 0,price($),screensize(in),rammemory(gb),harddrive(gb),usbports,brand,weight(oz),new_predictor
0,830,13.3,4,500,3,Toshiba,4.9,0.709908
1,750,13.3,4,640,3,Toshiba,3.2,1.004663
2,1200,11.6,2,128,2,Apple,2.3,0.418535
3,1600,18.4,6,640,4,Toshiba,9.7,1.45236
4,1900,18.4,8,500,4,Toshiba,9.7,-0.145833


In [39]:
X_multi_new = laptops[['screensize(in)', 'rammemory(gb)', 'harddrive(gb)', 'usbports', 'weight(oz)', 'new_predictor']]
X_multi_new = sm.add_constant(X_multi_new)
y_multi_new = laptops['price($)']

model_multi_new = sm.OLS(y_multi_new, X_multi_new)
results_multi_new = model_multi_new.fit()
results_multi_new.summary()

0,1,2,3
Dep. Variable:,price($),R-squared:,0.169
Model:,OLS,Adj. R-squared:,0.08
Method:,Least Squares,F-statistic:,1.901
Date:,"Mon, 03 Oct 2016",Prob (F-statistic):,0.0967
Time:,17:45:48,Log-Likelihood:,-476.08
No. Observations:,63,AIC:,966.2
Df Residuals:,56,BIC:,981.2
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,731.1274,926.972,0.789,0.434,-1125.820 2588.075
screensize(in),22.3282,94.664,0.236,0.814,-167.306 211.962
rammemory(gb),2.0731,73.058,0.028,0.977,-144.279 148.425
harddrive(gb),0.5331,0.454,1.174,0.245,-0.377 1.443
usbports,-236.2965,121.978,-1.937,0.058,-480.648 8.055
weight(oz),42.0702,97.236,0.433,0.667,-152.717 236.857
new_predictor,148.6894,79.453,1.871,0.067,-10.475 307.853

0,1,2,3
Omnibus:,9.125,Durbin-Watson:,1.841
Prob(Omnibus):,0.01,Jarque-Bera (JB):,9.215
Skew:,0.931,Prob(JB):,0.00998
Kurtosis:,3.211,Cond. No.,8400.0


e) Generate another new predictor - you can draw another list of random numbers from the same distribution as above, or you can draw from a different distribution. Add this predictor to the model in part d). What happends to the $R^2$? Does this mean that the new predictor is useful for predicting laptop prices?

In [42]:
lambda_predictor = 2
# We choose size equal to the size of the dataframe.
another_predictor = np.random.exponential(lambda_predictor, len(laptops))

In [43]:
laptops['another_predictor'] = another_predictor

In [44]:
laptops.head()

Unnamed: 0,price($),screensize(in),rammemory(gb),harddrive(gb),usbports,brand,weight(oz),new_predictor,another_predictor
0,830,13.3,4,500,3,Toshiba,4.9,0.709908,2.268801
1,750,13.3,4,640,3,Toshiba,3.2,1.004663,4.260563
2,1200,11.6,2,128,2,Apple,2.3,0.418535,1.215281
3,1600,18.4,6,640,4,Toshiba,9.7,1.45236,1.813176
4,1900,18.4,8,500,4,Toshiba,9.7,-0.145833,1.262967


In [45]:
X_multi_another = laptops[['screensize(in)', 'rammemory(gb)', 'harddrive(gb)', 'usbports', 'weight(oz)', 'new_predictor', 'another_predictor']]
X_multi_another = sm.add_constant(X_multi_another)
y_multi_another = laptops['price($)']

model_multi_another = sm.OLS(y_multi_another, X_multi_another)
results_multi_another = model_multi_another.fit()
results_multi_another.summary()

0,1,2,3
Dep. Variable:,price($),R-squared:,0.169
Model:,OLS,Adj. R-squared:,0.064
Method:,Least Squares,F-statistic:,1.602
Date:,"Mon, 03 Oct 2016",Prob (F-statistic):,0.154
Time:,17:50:51,Log-Likelihood:,-476.08
No. Observations:,63,AIC:,968.2
Df Residuals:,55,BIC:,985.3
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,728.1514,935.936,0.778,0.440,-1147.506 2603.809
screensize(in),23.4894,96.461,0.244,0.809,-169.823 216.802
rammemory(gb),3.9095,76.737,0.051,0.960,-149.875 157.694
harddrive(gb),0.5285,0.461,1.146,0.257,-0.396 1.453
usbports,-236.6897,123.158,-1.922,0.060,-483.505 10.125
weight(oz),39.7843,101.638,0.391,0.697,-163.902 243.471
new_predictor,148.7275,80.168,1.855,0.069,-11.933 309.388
another_predictor,-4.1462,48.154,-0.086,0.932,-100.649 92.357

0,1,2,3
Omnibus:,9.209,Durbin-Watson:,1.837
Prob(Omnibus):,0.01,Jarque-Bera (JB):,9.306
Skew:,0.935,Prob(JB):,0.00953
Kurtosis:,3.221,Cond. No.,8410.0


**Question 3**  

Squirt Squad is a cleaning service that sends crews to residential homes on either a once-a-month or twice-a-month schedule, depending on the customer’s preference. The owner would like to predict the amount of time required to clean a house based on the square footage of the house, the total number of rooms in the house, the number of bathrooms it has, the size of the cleaning crew, the frequency of the cleaning schedule, and whether or not the household has children. Data can be found in the tables **`squad`** (containing `squad_id`, `home_id`, `crew` and `freq` (0: once-a-month, 1: twice-a-month); **`squad_homes`** (containing `home_id`, `footage`, `rooms`, `baths` and `children` (Squirt Squad assumes the number of children in a house will never change. BONUS: how would you change the schema to account for the possibility that it will?)); and **`squad_times`** (containing `squad_id` and `dt`, `time` and `crew` (redundant with `squad` but included in case the squad size changes)). You will need to construct a three-way join using `home_id` and `squad_id`.

a) Construct a regression model using all of the independent variables.  

In [20]:
squad = pd.read_sql("SELECT s.squad_id, s.home_id, s.crew, s.freq, sh.footage,sh.rooms, sh.baths, sh.children, st.dt, st.time FROM squad AS s INNER JOIN squad_homes AS sh ON s.home_id = sh.home_id INNER JOIN squad_times AS st ON s.squad_id = st.squad_id;", engine)

In [21]:
squad.head()

Unnamed: 0,squad_id,home_id,crew,freq,footage,rooms,baths,children,dt,time
0,1,0,3,1,1548,8,2.0,0,2016-09-17,132
1,2,1,2,1,1599,7,1.5,0,2016-09-09,146
2,3,2,3,1,1630,8,2.0,0,2016-09-12,131
3,4,3,3,1,1640,7,1.5,0,2016-09-11,141
4,5,4,3,0,1711,8,2.5,1,2016-09-27,144


b) Test and interpret the significance of the overall regression model (what is the result of the overall F test)?  

c) Interpret the meaning of the regression coefficient for the Rooms, Crew, Children, and Frequency variables.  

d) Using the p-values, identify which independent variables are significant (needed).  

e) Construct a regression model using only the significant variables found in part d) and predict the average time to clean a house that has 2,250 square feet, 11 total rooms, 3.5 bathrooms, and no children. This house is cleaned once a month with a crew of four employees.  

f) Compare the two models you fitted, which one is a better model? Why?