<h2>Categorical Variables and One Hot Encoding</h2>

In [1]:
import pandas as pd

<h2 style='color:purple'>Using pandas to create dummy variables</h2>

In [2]:
Real_State_File=pd.read_csv("homeprices.csv")
Real_State_File

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [3]:
Real_State_File.replace(to_replace=['monroe township','west windsor','robinsville'],value=['Elzaywa','Hada2ek Elaasima','Fifth Statment'],inplace=True)

In [4]:
Real_State_File

Unnamed: 0,town,area,price
0,Elzaywa,2600,550000
1,Elzaywa,3000,565000
2,Elzaywa,3200,610000
3,Elzaywa,3600,680000
4,Elzaywa,4000,725000
5,Hada2ek Elaasima,2600,585000
6,Hada2ek Elaasima,2800,615000
7,Hada2ek Elaasima,3300,650000
8,Hada2ek Elaasima,3600,710000
9,Fifth Statment,2600,575000


In [10]:
town_dummies=pd.get_dummies(Real_State_File.town)
town_dummies

Unnamed: 0,Elzaywa,Fifth Statment,Hada2ek Elaasima
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [11]:
merge_of_dummies_to_table=pd.concat([Real_State_File,town_dummies],axis='columns')
merge_of_dummies_to_table

Unnamed: 0,town,area,price,Elzaywa,Fifth Statment,Hada2ek Elaasima
0,Elzaywa,2600,550000,1,0,0
1,Elzaywa,3000,565000,1,0,0
2,Elzaywa,3200,610000,1,0,0
3,Elzaywa,3600,680000,1,0,0
4,Elzaywa,4000,725000,1,0,0
5,Hada2ek Elaasima,2600,585000,0,0,1
6,Hada2ek Elaasima,2800,615000,0,0,1
7,Hada2ek Elaasima,3300,650000,0,0,1
8,Hada2ek Elaasima,3600,710000,0,0,1
9,Fifth Statment,2600,575000,0,1,0


In [18]:
final_file=merge_of_dummies_to_table.drop(['town','Fifth Statment'],axis="columns")
final_file

Unnamed: 0,area,price,Elzaywa,Hada2ek Elaasima
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,1
6,2800,615000,0,1
7,3300,650000,0,1
8,3600,710000,0,1
9,2600,575000,0,0


<h3 style='color:purple'>Dummy Variable Trap</h3>

When you can derive one variable from other variables, they are known to be multi-colinear. Here
if you know values of california and georgia then you can easily infer value of new jersey state, i.e. 
california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this
situation linear regression won't work as expected. Hence you need to drop one column. 

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the 
    state columns it is going to work, however we should make a habit of taking care of dummy variable
    trap ourselves just in case library that you are using is not handling this for you**

In [24]:
X = final_file.drop(['price'],axis='columns') #This a Preparation Step towards Regression (we prepare the inputs)
X

Unnamed: 0,area,Elzaywa,Hada2ek Elaasima
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,1
6,2800,0,1
7,3300,0,1
8,3600,0,1
9,2600,0,0


In [25]:
y = final_file.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [33]:
from sklearn.linear_model import LinearRegression
My_Real_Estate_Model= LinearRegression()

In [34]:
My_Real_Estate_Model.fit(X,y) #Training Process

LinearRegression()

In [40]:
My_Real_Estate_Model.predict(X) # 2600 sqr ft home in new jersey

array([539709.73984091, 590468.71640508, 615848.20468716, 666607.18125133,
       717366.1578155 , 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [42]:
My_Real_Estate_Model.score(X,y) #This to Calc the accuracy of a model

0.9573929037221873

In [44]:
My_Real_Estate_Model.predict([[200,0,0]]) # 3400 sqr ft home in Fifth Statment

array([260842.29198029])

In [45]:
My_Real_Estate_Model.predict([[2800,0,1]]) # 2800 sqr ft home in Hada2ek Elaasima

array([605103.20361213])

<h2 style='color:purple'>Using sklearn OneHotEncoder</h2>

First step is to use label encoder to convert town names into numbers

In [46]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [47]:
dfle = Real_State_File
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [48]:
X = dfle[['town','area']].values

In [49]:
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [50]:
y = dfle.price.values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

Now use one hot encoder to create dummy variables for each of the town

In [51]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [52]:
X = ct.fit_transform(X)
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [53]:
X = X[:,1:]

In [54]:
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [55]:
model.fit(X,y)

LinearRegression()

In [56]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

array([681241.6684584])

In [57]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

<h2 style='color:green'>Exercise</h2>

In [58]:
import pandas as pd
excercise_file=pd.read_csv(r'H:\ML_DL\ML Project 01\py\ML\5_one_hot_encoding\Exercise\carprices.csv')
excercise_file

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [59]:
car_dummies=pd.get_dummies(excercise_file['Car Model'])
car_dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


In [64]:
Merged_data=pd.concat([excercise_file,car_dummies],axis='columns')
Merged_data

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,0,1,0
1,BMW X5,35000,34000,3,0,1,0
2,BMW X5,57000,26100,5,0,1,0
3,BMW X5,22500,40000,2,0,1,0
4,BMW X5,46000,31500,4,0,1,0
5,Audi A5,59000,29400,5,1,0,0
6,Audi A5,52000,32000,5,1,0,0
7,Audi A5,72000,19300,6,1,0,0
8,Audi A5,91000,12000,8,1,0,0
9,Mercedez Benz C class,67000,22000,6,0,0,1


In [70]:
Final_File=Merged_data.drop(['Car Model','Mercedez Benz C class'], axis='columns')
Final_File

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,0,1
1,35000,34000,3,0,1
2,57000,26100,5,0,1
3,22500,40000,2,0,1
4,46000,31500,4,0,1
5,59000,29400,5,1,0
6,52000,32000,5,1,0
7,72000,19300,6,1,0
8,91000,12000,8,1,0
9,67000,22000,6,0,0


In [74]:
from sklearn.linear_model import LinearRegression
Car_model=LinearRegression()

In [75]:
x=Final_File.drop(['Sell Price($)'], axis='columns')
x

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,0,1
1,35000,3,0,1
2,57000,5,0,1
3,22500,2,0,1
4,46000,4,0,1
5,59000,5,1,0
6,52000,5,1,0
7,72000,6,1,0
8,91000,8,1,0
9,67000,6,0,0


In [77]:
Y=Final_File['Sell Price($)']
Y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [78]:
Car_model.fit(x,Y)

LinearRegression()

In [79]:
Car_model.predict(x)

array([18705.2723644 , 35286.78445645, 24479.19112468, 41245.76426391,
       29882.98779056, 28023.6135243 , 30614.46818502, 21879.57266964,
       12182.34562104, 26183.72387884, 18929.31674102, 20409.80511857,
       30477.15426156])

In [82]:
Car_model.score(x,Y)

0.9417050937281083

In [84]:
x

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,0,1
1,35000,3,0,1
2,57000,5,0,1
3,22500,2,0,1
4,46000,4,0,1
5,59000,5,1,0
6,52000,5,1,0
7,72000,6,1,0
8,91000,8,1,0
9,67000,6,0,0


In [85]:
Car_model.predict([[45000,4,0,0]])

array([36991.31721061])

In [86]:
Car_model.predict([[45000,7,0,1]])

array([26255.74900211])

In [92]:
Car_model.score(x,Y)

0.9417050937281083

At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv.
This file has car sell prices for 3 different models. First plot data points on a scatter plot chart
to see if linear regression model can be applied. If yes, then build a model that can answer
following questions,

**1) Predict price of a mercedez benz that is 4 yr old with mileage 45000**

**2) Predict price of a BMW X5 that is 7 yr old with mileage 86000**

**3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())**