There are three types of categorical variables-- binary, nominal, and ordinal. Binary means there are two categories to choose from, nominal means more than 2 categories to choose from, and ordinal means there is an order to these categories. In either case, we must consider how to handle categorical variable, as they are text information, not numerical in nature. How do we handle nominal data? Is simple label encoding (change from text to 1,2, or 3) enough?

In [3]:
import pandas as pd
df = pd.read_csv('carprices.csv')
df

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [5]:
df.columns = df.columns.str.replace(' ', '')
df

Unnamed: 0,CarModel,Mileage,SellPrice($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


# Pandas Get_Dummies Method

In [28]:
#get_dummies will return the dummy variable columns. You must specify which column you want to convert to dummy variables
dummies = pd.get_dummies(df.CarModel)
dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


In [29]:
merged = pd.concat([df, dummies], axis=1) #or axis='columns'
merged

Unnamed: 0,CarModel,Mileage,SellPrice($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,0,1,0
1,BMW X5,35000,34000,3,0,1,0
2,BMW X5,57000,26100,5,0,1,0
3,BMW X5,22500,40000,2,0,1,0
4,BMW X5,46000,31500,4,0,1,0
5,Audi A5,59000,29400,5,1,0,0
6,Audi A5,52000,32000,5,1,0,0
7,Audi A5,72000,19300,6,1,0,0
8,Audi A5,91000,12000,8,1,0,0
9,Mercedez Benz C class,67000,22000,6,0,0,1


Now that you have the dummy variables, you technically don't need the first column (CarModel) anymore. Let's drop that column!

In [30]:
merged.drop(columns='CarModel')

Unnamed: 0,Mileage,SellPrice($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,69000,18000,6,0,1,0
1,35000,34000,3,0,1,0
2,57000,26100,5,0,1,0
3,22500,40000,2,0,1,0
4,46000,31500,4,0,1,0
5,59000,29400,5,1,0,0
6,52000,32000,5,1,0,0
7,72000,19300,6,1,0,0
8,91000,12000,8,1,0,0
9,67000,22000,6,0,0,1


Also it's recommended to drop one dummy variable in order to fix what is called the dummy variable trap-- which is just multicollinearity, where two variables are highly correlated, and so one variable can be predicted from the other variables. In this case, since one variable out of the three can be implied from the rest of the variables, then it is necessary to drop one

In [32]:
final = merged.drop(columns=['CarModel', 'Mercedez Benz C class'])
final

Unnamed: 0,Mileage,SellPrice($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,0,1
1,35000,34000,3,0,1
2,57000,26100,5,0,1
3,22500,40000,2,0,1
4,46000,31500,4,0,1
5,59000,29400,5,1,0
6,52000,32000,5,1,0
7,72000,19300,6,1,0
8,91000,12000,8,1,0
9,67000,22000,6,0,0


Even if we didn't drop one of the dummary variables, fitting it to the linear regression would still work, since the linear regression model is aware of the dummy variable trap. But it is still good practice to do this.

In [67]:
from sklearn.linear_model import LinearRegression
model = LinearRegression() #instantiate an object for the LinearRegression class

In [34]:
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [39]:
X = final.drop(columns="SellPrice($)")
X

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,0,1
1,35000,3,0,1
2,57000,5,0,1
3,22500,2,0,1
4,46000,4,0,1
5,59000,5,1,0
6,52000,5,1,0
7,72000,6,1,0
8,91000,8,1,0
9,67000,6,0,0


In [37]:
y = final[['SellPrice($)']]
y

Unnamed: 0,SellPrice($)
0,18000
1,34000
2,26100
3,40000
4,31500
5,29400
6,32000
7,19300
8,12000
9,22000


In [40]:
model.fit(X,y) #this is where you are training your machine learning model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In another tutorial one can save their model to a file (since most models take FOREVER to load)

In [43]:
model.predict([[45000,4,0,0]]) #Predict price of Mercedes Benz that is 4 years old with mileage 450000

array([[36991.31721061]])

this predicts from the fitted line what the price is. You must model the parameters like X (mileage, age, audi, bmw).

In [44]:
model.predict([[86000,7,0,1]]) #Predict price of BMW X5 that is 7 yr old with mileage 86000

array([[11080.74313219]])

In [45]:
model.score(X,y)

0.9417050937281082

# One Hot Encoding

In [1]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # instantiate the le object from Label Encoder class

In [8]:
dfle = df
dfle.CarModel = le.fit_transform(dfle.CarModel) #takes label column as input and convert the labels (simple label encoder)
dfle

Unnamed: 0,CarModel,Mileage,SellPrice($),Age(yrs)
0,1,69000,18000,6
1,1,35000,34000,3
2,1,57000,26100,5
3,1,22500,40000,2
4,1,46000,31500,4
5,0,59000,29400,5
6,0,52000,32000,5
7,0,72000,19300,6
8,0,91000,12000,8
9,2,67000,22000,6


In [60]:
X = df[['CarModel', 'Mileage', 'Age(yrs)']].values
X

array([[    1, 69000,     6],
       [    1, 35000,     3],
       [    1, 57000,     5],
       [    1, 22500,     2],
       [    1, 46000,     4],
       [    0, 59000,     5],
       [    0, 52000,     5],
       [    0, 72000,     6],
       [    0, 91000,     8],
       [    2, 67000,     6],
       [    2, 83000,     7],
       [    2, 79000,     7],
       [    2, 59000,     5]])

In [53]:
y = dfle["SellPrice($)"]
y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: SellPrice($), dtype: int64

## Using OneHotEncoder

In [61]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0]) #create an object called ohe of the class, OneHotEncoder

In [62]:
X = ohe.fit_transform(X).toarray()
X

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[0.00e+00, 1.00e+00, 0.00e+00, 6.90e+04, 6.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 3.50e+04, 3.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 5.70e+04, 5.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 2.25e+04, 2.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 4.60e+04, 4.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.90e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.20e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 7.20e+04, 6.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 9.10e+04, 8.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 6.70e+04, 6.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 8.30e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 7.90e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 5.90e+04, 5.00e+00]])

## Another way to produce Dummy Variables without using OneHotEncoder

In [55]:
from sklearn.compose import ColumnTransformer 

ct = ColumnTransformer([("encoder", OneHotEncoder(),[0])], remainder="passthrough") # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X)   

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[0.00e+00, 1.00e+00, 0.00e+00, 6.90e+04, 6.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 3.50e+04, 3.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 5.70e+04, 5.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 2.25e+04, 2.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 4.60e+04, 4.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.90e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.20e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 7.20e+04, 6.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 9.10e+04, 8.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 6.70e+04, 6.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 8.30e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 7.90e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 5.90e+04, 5.00e+00]])

In [64]:
X = X[:,1:] #drop 1st column to prevent dummy variable trap- multicollinearity

In [68]:
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [70]:
model.predict([[0,0,45000,4]]) #Predict price of Mercedes Benz that is 4 years old with mileage 450000)

array([34537.77647335])

In [72]:
model.predict([[0,1,86000,7]]) #Predict price of BMW X5 that is 7 yr old with mileage 86000

array([17818.95045785])

In [73]:
model.score(X,y)

0.9417050937281082