# Handle categorical predictors

In this notebook we will use categorical variable as predictor in a regression model
we will work with the auto-mpg dataset [link of the course](https://openclassrooms.com/en/courses/5873596-design-effective-statistical-models-to-understand-your-data/6233031-handle-categorical-predictors)

In [132]:
import numpy as np 
import pandas as pd
df=pd.read_csv("data/auto-mpg.csv")
df.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In this data set we have serveral categorical variables but we will interest in origin et name variables

# Origin category

In [133]:
origin={1:'Amerique',2:'European',3:'Japanese'}
df["origin"]=df.origin.astype(int)
df.origin.dtype
df['origin']=df['origin'].apply(lambda d:origin[int(d)])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
387,27.0,4,140.0,86.0,2790.0,15.6,82,Amerique,ford mustang gl
388,44.0,4,97.0,52.0,2130.0,24.6,82,European,vw pickup
389,32.0,4,135.0,84.0,2295.0,11.6,82,Amerique,dodge rampage
390,28.0,4,120.0,79.0,2625.0,18.6,82,Amerique,ford ranger
391,31.0,4,119.0,82.0,2720.0,19.4,82,Amerique,chevy s-10


In [135]:
df.origin.value_counts()

Amerique    245
Japanese     79
European     68
Name: origin, dtype: int64

In [39]:
df['brand'] = df.name.apply(lambda d : d.split(' ')[0])
df.brand

0      chevrolet
1          buick
2       plymouth
3            amc
4           ford
         ...    
387         ford
388           vw
389        dodge
390         ford
391        chevy
Name: brand, Length: 392, dtype: object

In [137]:
import statsmodels.formula.api as smf
results=smf.ols(formula="mpg ~ origin",data=df).fit()
results.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.332
Model:,OLS,Adj. R-squared:,0.328
Method:,Least Squares,F-statistic:,96.6
Date:,"Sun, 14 Jun 2020",Prob (F-statistic):,8.67e-35
Time:,00:27:07,Log-Likelihood:,-1282.2
No. Observations:,392,AIC:,2570.0
Df Residuals:,389,BIC:,2582.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,20.0335,0.409,49.025,0.000,19.230,20.837
origin[T.European],7.5695,0.877,8.634,0.000,5.846,9.293
origin[T.Japanese],10.4172,0.828,12.588,0.000,8.790,12.044

0,1,2,3
Omnibus:,26.33,Durbin-Watson:,0.763
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.217
Skew:,0.679,Prob(JB):,2.74e-07
Kurtosis:,3.066,Cond. No.,3.16


## Interpreting model

Two thing stand out

* the the statstical model handle the categorical variables by creating two new variables :origin[T.European] and origin[T.Japanese]
* We have three categorie but only two are created



## Dummy Encoding

It encode k level variable in k-1 binary variable 

In this case 
* origin[T.European] means the car is from European true/false
* origin[T.Japanese] means the car is from Japanese true/false
And where the car is not from japanese and Europen then it is from Amerique





In [138]:
pd.get_dummies(df.origin).head()

Unnamed: 0,Amerique,European,Japanese
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [139]:
df = df.merge(pd.get_dummies(df.origin), left_index=True, right_index= True )
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,Amerique,European,Japanese
0,18.0,8,307.0,130.0,3504.0,12.0,70,Amerique,chevrolet chevelle malibu,1,0,0
1,15.0,8,350.0,165.0,3693.0,11.5,70,Amerique,buick skylark 320,1,0,0
2,18.0,8,318.0,150.0,3436.0,11.0,70,Amerique,plymouth satellite,1,0,0
3,16.0,8,304.0,150.0,3433.0,12.0,70,Amerique,amc rebel sst,1,0,0
4,17.0,8,302.0,140.0,3449.0,10.5,70,Amerique,ford torino,1,0,0


## The model mpg ~ Japanese + European

In [141]:
results2 = smf.ols('mpg ~ Japanese + European', data = df).fit()
results2.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.332
Model:,OLS,Adj. R-squared:,0.328
Method:,Least Squares,F-statistic:,96.6
Date:,"Sun, 14 Jun 2020",Prob (F-statistic):,8.67e-35
Time:,00:42:54,Log-Likelihood:,-1282.2
No. Observations:,392,AIC:,2570.0
Df Residuals:,389,BIC:,2582.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,20.0335,0.409,49.025,0.000,19.230,20.837
Japanese,10.4172,0.828,12.588,0.000,8.790,12.044
European,7.5695,0.877,8.634,0.000,5.846,9.293

0,1,2,3
Omnibus:,26.33,Durbin-Watson:,0.763
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.217
Skew:,0.679,Prob(JB):,2.74e-07
Kurtosis:,3.066,Cond. No.,3.16


This model give the same result that the mpg ~ origin

## Interpreting the coefficient

In [149]:
df[["mpg","origin"]].groupby(by="origin").mean().reset_index()

Unnamed: 0,origin,mpg
0,Amerique,20.033469
1,European,27.602941
2,Japanese,30.450633


* the intercept is the mean of American cars
* origin[T.European]=means of European cars - intercept
* origin[T.Japanese]= means of Japan cars - intercept
