# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [62]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [63]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [64]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102 

0,1,2,3
Dep. Variable:,Price,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.06
Method:,Least Squares,F-statistic:,18.11
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,2.23e-11
Time:,12:07:54,Log-Likelihood:,-9207.1
No. Observations:,804,AIC:,18420.0
Df Residuals:,801,BIC:,18430.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1272.3412,804.623,-1.581,0.114,-2851.759,307.077
Cylinder,5587.4472,804.509,6.945,0.000,4008.252,7166.642
Doors,-1404.5513,804.275,-1.746,0.081,-2983.288,174.185

0,1,2,3
Omnibus:,157.913,Durbin-Watson:,0.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,257.529
Skew:,1.278,Prob(JB):,1.2e-56
Kurtosis:,4.074,Cond. No.,1.03


The table of coefficients above gives us the values to plug into an equation of form:
    B0 + B1 * Mileage + B2 * model_ord + B3 * doors
    
In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.

Could we have figured that out earlier?

In [65]:
y.groupby(df.Doors).mean()

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?

In [66]:
df.describe()

Unnamed: 0,Price,Mileage,Cylinder,Liter,Doors,Cruise,Sound,Leather
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,21343.143767,19831.93408,5.268657,3.037313,3.527363,0.752488,0.679104,0.723881
std,9884.852801,8196.319707,1.387531,1.105562,0.850169,0.431836,0.467111,0.447355
min,8638.930895,266.0,4.0,1.6,2.0,0.0,0.0,0.0
25%,14273.07387,14623.5,4.0,2.2,4.0,1.0,0.0,0.0
50%,18024.995019,20913.5,6.0,2.8,4.0,1.0,1.0,1.0
75%,26717.316636,25213.0,6.0,3.8,4.0,1.0,1.0,1.0
max,70755.466717,50387.0,8.0,6.0,4.0,1.0,1.0,1.0


In [67]:
X.describe()

Unnamed: 0,Mileage,Cylinder,Doors
count,804.0,804.0,804.0
mean,5.247323e-17,1.350495e-16,-1.591043e-15
std,1.000622,1.000622,1.000622
min,-2.388647,-0.9148957,-1.797659
25%,-0.6358556,-0.9148957,0.5562789
50%,0.1320396,0.5274105,0.5562789
75%,0.6569309,0.5274105,0.5562789
max,3.730221,1.969717,0.5562789


In [68]:
import numpy as np
df['Doors'] = df['Doors'].apply(lambda x: x+np.random.randint(0.,5.))

In [69]:
df.describe()

Unnamed: 0,Price,Mileage,Cylinder,Liter,Doors,Cruise,Sound,Leather
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,21343.143767,19831.93408,5.268657,3.037313,5.575871,0.752488,0.679104,0.723881
std,9884.852801,8196.319707,1.387531,1.105562,1.655352,0.431836,0.467111,0.447355
min,8638.930895,266.0,4.0,1.6,2.0,0.0,0.0,0.0
25%,14273.07387,14623.5,4.0,2.2,4.0,1.0,0.0,0.0
50%,18024.995019,20913.5,6.0,2.8,6.0,1.0,1.0,1.0
75%,26717.316636,25213.0,6.0,3.8,7.0,1.0,1.0,1.0
max,70755.466717,50387.0,8.0,6.0,8.0,1.0,1.0,1.0


In [70]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,8,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,8,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,7,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,8,1,0,1


In [72]:
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


      Mileage  Cylinder     Doors
0   -1.417485  0.527410  1.465331
1   -1.305902  0.527410 -0.952578
2   -0.810128  0.527410  1.465331
3   -0.426058  0.527410  0.860854
4    0.000008  0.527410  1.465331
5    0.293493  0.527410  0.860854
6    0.335001  0.527410 -0.348101
7    0.382369  0.527410 -0.952578
8    0.511409  0.527410  1.465331
9    0.914768  0.527410  0.256377
10  -1.171368  0.527410  0.256377
11  -0.581834  0.527410 -0.952578
12  -0.390532  0.527410  0.860854
13  -0.003899  0.527410  0.860854
14   0.430591  0.527410  0.860854
15   0.480156  0.527410  0.256377
16   0.509822  0.527410  0.860854
17   0.757160  0.527410  0.256377
18   1.594886  0.527410  1.465331
19   1.810849  0.527410 -0.952578
20  -1.326046  0.527410  0.860854
21  -1.129860  0.527410  1.465331
22  -0.667658  0.527410  1.465331
23  -0.405792  0.527410  1.465331
24  -0.112796  0.527410 -0.952578
25  -0.044552  0.527410 -0.348101
26   0.190700  0.527410  0.860854
27   0.337442  0.527410  0.860854
28   0.566102 

0,1,2,3
Dep. Variable:,Price,R-squared:,0.061
Model:,OLS,Adj. R-squared:,0.057
Method:,Least Squares,F-statistic:,17.24
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,7.46e-11
Time:,12:08:18,Log-Likelihood:,-9208.4
No. Observations:,804,AIC:,18420.0
Df Residuals:,801,BIC:,18440.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1280.9682,806.854,-1.588,0.113,-2864.767,302.831
Cylinder,5574.2673,805.867,6.917,0.000,3992.406,7156.129
Doors,-616.7605,806.607,-0.765,0.445,-2200.073,966.552

0,1,2,3
Omnibus:,190.409,Durbin-Watson:,0.01
Prob(Omnibus):,0.0,Jarque-Bera (JB):,354.183
Skew:,1.408,Prob(JB):,1.23e-77
Kurtosis:,4.625,Cond. No.,1.06


In [73]:
df2 = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')

In [74]:
df2['Doors'] = df2.apply(lambda x: x.Doors+np.random.randint(0.,x.Price),axis=1)

In [75]:
df2.describe()

Unnamed: 0,Price,Mileage,Cylinder,Liter,Doors,Cruise,Sound,Leather
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,21343.143767,19831.93408,5.268657,3.037313,10615.833333,0.752488,0.679104,0.723881
std,9884.852801,8196.319707,1.387531,1.105562,8492.327601,0.431836,0.467111,0.447355
min,8638.930895,266.0,4.0,1.6,10.0,0.0,0.0,0.0
25%,14273.07387,14623.5,4.0,2.2,4344.5,1.0,0.0,0.0
50%,18024.995019,20913.5,6.0,2.8,8774.5,1.0,1.0,1.0
75%,26717.316636,25213.0,6.0,3.8,14762.25,1.0,1.0,1.0
max,70755.466717,50387.0,8.0,6.0,53841.0,1.0,1.0,1.0


In [76]:
df2.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,17106,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,6141,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,8606,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,15682,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4505,1,0,1


In [77]:
X2 = df2[['Mileage', 'Cylinder', 'Doors']]
y2 = df2['Price']

X2[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X2[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X2)

est = sm.OLS(y2, X2).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.764715
1   -1.305902  0.527410 -0.527255
2   -0.810128  0.527410 -0.236812
3   -0.426058  0.527410  0.596929
4    0.000008  0.527410 -0.720019
5    0.293493  0.527410 -0.606198
6    0.335001  0.527410 -0.140076
7    0.382369  0.527410 -0.377733
8    0.511409  0.527410 -0.262852
9    0.914768  0.527410  0.352439
10  -1.171368  0.527410 -0.912076
11  -0.581834  0.527410 -0.133478
12  -0.390532  0.527410 -0.649559
13  -0.003899  0.527410 -1.123929
14   0.430591  0.527410  0.797470
15   0.480156  0.527410  0.891850
16   0.509822  0.527410 -0.291719
17   0.757160  0.527410  0.428319
18   1.594886  0.527410 -0.269332
19   1.810849  0.527410 -1.230091
20  -1.326046  0.527410  0.770017
21  -1.129860  0.527410 -0.999622
22  -0.667658  0.527410 -1.133708
23  -0.405792  0.527410 -0.049114
24  -0.112796  0.527410  0.539194
25  -0.044552  0.527410 -1.227498
26   0.190700  0.527410  0.748572
27   0.337442  0.527410  0.672692
28   0.566102 

  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


0,1,2,3
Dep. Variable:,Price,R-squared:,0.092
Model:,OLS,Adj. R-squared:,0.088
Method:,Least Squares,F-statistic:,27.01
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.2e-16
Time:,12:08:32,Log-Likelihood:,-9194.8
No. Observations:,804,AIC:,18400.0
Df Residuals:,801,BIC:,18410.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-625.9173,800.885,-0.782,0.435,-2197.999,946.164
Cylinder,4038.4060,844.172,4.784,0.000,2381.356,5695.456
Doors,4526.3556,853.154,5.305,0.000,2851.673,6201.038

0,1,2,3
Omnibus:,238.745,Durbin-Watson:,0.06
Prob(Omnibus):,0.0,Jarque-Bera (JB):,619.698
Skew:,1.53,Prob(JB):,2.72e-135
Kurtosis:,6.021,Cond. No.,1.48
