# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [4]:
import pandas as pd

df = pd.read_excel('cars.xls')


In [3]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [3]:
import statsmodels.api as sm # for conducting statistical tests, and statistical data exploration
from sklearn.preprocessing import StandardScaler #To standardise the features on a same scale.
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Liter','Leather']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Liter','Leather']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Liter','Leather']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

  from pandas.core import datetools
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


      Mileage  Cylinder     Liter   Leather
0   -1.417485  0.527410  0.056736  0.617611
1   -1.305902  0.527410  0.056736 -1.619142
2   -0.810128  0.527410  0.056736 -1.619142
3   -0.426058  0.527410  0.056736 -1.619142
4    0.000008  0.527410  0.056736  0.617611
5    0.293493  0.527410  0.056736 -1.619142
6    0.335001  0.527410  0.056736 -1.619142
7    0.382369  0.527410  0.056736 -1.619142
8    0.511409  0.527410  0.056736  0.617611
9    0.914768  0.527410  0.056736  0.617611
10  -1.171368  0.527410  0.509277 -1.619142
11  -0.581834  0.527410  0.509277 -1.619142
12  -0.390532  0.527410  0.509277 -1.619142
13  -0.003899  0.527410  0.509277  0.617611
14   0.430591  0.527410  0.509277  0.617611
15   0.480156  0.527410  0.509277 -1.619142
16   0.509822  0.527410  0.509277 -1.619142
17   0.757160  0.527410  0.509277  0.617611
18   1.594886  0.527410  0.509277 -1.619142
19   1.810849  0.527410  0.509277  0.617611
20  -1.326046  0.527410  0.509277 -1.619142
21  -1.129860  0.527410  0.50927

0,1,2,3
Dep. Variable:,Price,R-squared:,0.063
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,13.36
Date:,"Sun, 06 Oct 2019",Prob (F-statistic):,1.52e-10
Time:,15:37:59,Log-Likelihood:,-9207.5
No. Observations:,804,AIC:,18420.0
Df Residuals:,800,BIC:,18440.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1266.6470,805.846,-1.572,0.116,-2848.470,315.176
Cylinder,4059.0669,2807.356,1.446,0.149,-1451.587,9569.721
Liter,1504.4958,2809.339,0.536,0.592,-4010.051,7019.043
Leather,1116.2450,808.463,1.381,0.168,-470.713,2703.203

0,1,2,3
Omnibus:,208.132,Durbin-Watson:,0.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,425.092
Skew:,1.466,Prob(JB):,4.93e-93
Kurtosis:,5.024,Cond. No.,6.86


In [4]:
y.groupby(df.Mileage).mean()
Z= [y.groupby(df.Mileage).mean(),y.groupby(df.Cylinder).mean(),y.groupby(df.Liter).mean()]
Z 

[Mileage
 266      10813.343521
 583      70755.466717
 636      25948.962594
 788      48310.329545
 865      16116.843916
 881      17360.810635
 932      19446.882941
 1160     14584.448122
 1169     15635.796160
 1480     19164.610627
 1592     19822.115392
 1676     35033.215454
 1737     14739.067236
 1787     20021.195206
 1853     21757.049509
 2189     19567.259291
 2202     51154.047216
 2295     23197.436790
 2308     25589.983155
 2392     15110.192598
 2464     14894.982593
 2616     48365.980897
 2846     42741.523666
 2879     16916.869535
 2973     16927.779761
 2992     20698.077083
 3625     46732.606030
 3629     12649.110893
 3828     37088.562413
 3867     32197.340466
              ...     
 34191    31186.741463
 34269    26012.374625
 34447    11149.618304
 34621     9919.048185
 34665    17968.838278
 34815    12741.190233
 34998    11521.525888
 35157    27666.231078
 35299     8768.998585
 35326    32038.339563
 35624    16216.980706
 35662    13585.636802
 3

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?