# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [1]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [5]:
df.describe()

Unnamed: 0,Price,Mileage,Cylinder,Liter,Doors,Cruise,Sound,Leather,Model_ord
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,21343.143767,19831.93408,5.268657,3.037313,3.527363,0.752488,0.679104,0.723881,14.838308
std,9884.852801,8196.319707,1.387531,1.105562,0.850169,0.431836,0.467111,0.447355,8.706433
min,8638.930895,266.0,4.0,1.6,2.0,0.0,0.0,0.0,0.0
25%,14273.07387,14623.5,4.0,2.2,4.0,1.0,0.0,0.0,6.0
50%,18024.995019,20913.5,6.0,2.8,4.0,1.0,1.0,1.0,14.0
75%,26717.316636,25213.0,6.0,3.8,4.0,1.0,1.0,1.0,22.0
max,70755.466717,50387.0,8.0,6.0,4.0,1.0,1.0,1.0,31.0


In [2]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


In [9]:
df.Model.unique()

array(['Century', 'Lacrosse', 'Lesabre', 'Park Avenue', 'CST-V', 'CTS',
       'Deville', 'STS-V6', 'STS-V8', 'XLR-V8', 'AVEO', 'Cavalier',
       'Classic', 'Cobalt', 'Corvette', 'Impala', 'Malibu', 'Monte Carlo',
       'Bonneville', 'G6', 'Grand Am', 'Grand Prix', 'GTO', 'Sunfire',
       'Vibe', '9_3', '9_3 HO', '9_5', '9_5 HO', '9-2X AWD', 'Ion',
       'L Series'], dtype=object)

In [6]:
df.Model.value_counts()

AVEO           60
Cavalier       60
Malibu         60
Ion            50
Cobalt         50
9_3 HO         40
Grand Prix     30
Bonneville     30
Monte Carlo    30
Vibe           30
Impala         30
9_5            30
Deville        30
Lacrosse       30
Corvette       20
9_3            20
Grand Am       20
9_5 HO         20
Park Avenue    20
Lesabre        20
G6             20
Classic        10
STS-V8         10
CST-V          10
Sunfire        10
CTS            10
GTO            10
Century        10
L Series       10
XLR-V8         10
STS-V6         10
9-2X AWD        4
Name: Model, dtype: int64

### We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.
### Note how we use pandas.Categorical to convert textual category data (model name) into an ordinal number that we can work with.

In [12]:
pd.Categorical(df.Model).codes[0:20]

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 22, 22, 22, 22, 22, 22, 22,
       22, 22, 22], dtype=int8)

In [3]:
import statsmodels.api as sm

df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]

X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()

est.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.042
Model:,OLS,Adj. R-squared:,0.038
Method:,Least Squares,F-statistic:,11.57
Date:,"Sat, 31 Jul 2021",Prob (F-statistic):,1.98e-07
Time:,20:50:42,Log-Likelihood:,-8519.1
No. Observations:,804,AIC:,17050.0
Df Residuals:,800,BIC:,17060.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.125e+04,1809.549,17.272,0.000,2.77e+04,3.48e+04
Mileage,-0.1765,0.042,-4.227,0.000,-0.259,-0.095
Model_ord,-39.0387,39.326,-0.993,0.321,-116.234,38.157
Doors,-1652.9303,402.649,-4.105,0.000,-2443.303,-862.558

0,1,2,3
Omnibus:,206.41,Durbin-Watson:,0.08
Prob(Omnibus):,0.0,Jarque-Bera (JB):,470.872
Skew:,1.379,Prob(JB):,5.640000000000001e-103
Kurtosis:,5.541,Cond. No.,115000.0


In [4]:
y.groupby(df.Doors).mean()

Unnamed: 0_level_0,Price
Doors,Unnamed: 1_level_1
2,23807.13552
4,20580.670749


Surprisingly, more doors does not mean a higher price! So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?