# 5.4) Encoding Categorical Data

In [1]:
import pandas as pd
import numpy as np

- We know Data is of two types:
  - Numerical
  - Categorical: Is of two types:
    - Nominal: There is no order/relation between categorical data. Ex: States (West Bengal, Bihar, Telangana, etc.)
	- Ordinal: There is order/relation between categorical data. Ex: Grades, Class in Train, etc.
- Problem:
  - Categorical Data is in String form but ML algorithms expects data in numbers.
  - There are many encoding techniques out of which the following two are widely used:
	- (a) Ordinal Encoding: On Ordinal Categorical Data
	- (b) One Hot Encoding: On Nominal Categorical Data

### (a) Ordinal Encoding and Label Encoding:

- If we have some ordinal column in input column than we use Ordinal Encoding. If we have the output column as categorical (say like in classification problems) than the encoding that we use is Label Encoding and for this we have one special class named LabelEncoder in sklearn.
- Rest there is no difference between Ordinal and Nominal Encoding.

- (Q) How to do Ordinal Encoding?

In [6]:
# If we have one column as:
df = pd.Series(['HS', 'UG', 'PG', 'PG', 'HS', 'UG'])
df

0    HS
1    UG
2    PG
3    PG
4    HS
5    UG
dtype: object

In [7]:
# Then, after ordinal encoding, this column should look like:
df = pd.Series([0,1,2,2,0,1])   # PG (2) > UG (1) > HS (0)
df

0    0
1    1
2    2
3    2
4    0
5    1
dtype: int64

- **Code:**

In [8]:
import numpy as np
import pandas as pd
df = pd.read_csv('customer.csv')

- About the dataset: This dataset has five columns as:
  - age - numerical
  - gender - Nominal Categorical
  - review - Ordinal Categorical (Contains Good, Average, and Poor)
  - education - Ordinal
  - purchased (Whether the customer purchased the recommended product or not) - Label Encoding can be used

In [9]:
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [10]:
# As of now we will only select review, rducation, purchased columns for our work and for that we can do:
df = df.iloc[:,2:]
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [13]:
# train_test_Split
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size = 0.2)

In [15]:
# Ordinal Encoding:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories = [['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']]) # categories must be provided in ascending order of their order
oe.fit(X_train)
X_train = oe.transform(X_train)
X_test = oe.transform(x_test)

X_train  # gives us our ordinal encoded data

array([[0., 1.],
       [2., 2.],
       [2., 0.],
       [0., 2.],
       [2., 1.],
       [1., 2.],
       [2., 1.],
       [1., 0.],
       [2., 2.],
       [0., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [2., 1.],
       [1., 0.],
       [2., 2.],
       [0., 0.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 2.],
       [0., 1.],
       [2., 0.],
       [0., 2.],
       [2., 1.],
       [0., 0.],
       [0., 2.],
       [0., 1.],
       [2., 1.],
       [2., 2.],
       [0., 2.],
       [1., 2.],
       [2., 0.],
       [0., 2.],
       [1., 1.],
       [2., 1.],
       [1., 1.],
       [1., 1.],
       [0., 2.],
       [2., 0.]])

In [16]:
# Note: To get what are the categories we feeded to OrdinalEncoder we can do:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [17]:
# LabelEncoder for Output Column:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)
print(le.classes_)    # gives the array of categories

y_train = le.transform(y_train)
y_test = le.transform(y_test)

y_train

['No' 'Yes']


array([0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])

===============================================================================================================

### One Hot Encoding:

- Used on Nominal Categorical column.

In [19]:
# Suppose we have the following dataset:
df = pd.DataFrame({'color': ['Yellow', 'Yellow', 'Blue', 'Yellow', 'Red'], 'target': [0,1,1,1,1]})
df

Unnamed: 0,color,target
0,Yellow,0
1,Yellow,1
2,Blue,1
3,Yellow,1
4,Red,1


- Here, color -> Nominal Categorical Column (Independent Variable) and target is dependent variable.
- Here, we can not follow the ordinal way for encoding color column. Quantitatively speaking if we do follow ordinal way by assigning some value to colors than our model will think the color with higher value has higher priority.

In [22]:
# After one-hot encoding, our 'color' column will be replaced with three columns as:
# Suppose we have the following dataset:
df = pd.DataFrame({'color_Y': [1, 1, 0, 1, 0], 'color_B': [0, 0, 1, 0, 0], 'color_R': [0, 0, 0, 0, 1], 'target': [0,1,1,1,1]})
df

Unnamed: 0,color_Y,color_B,color_R,target
0,1,0,0,0
1,1,0,0,1
2,0,1,0,1
3,1,0,0,1
4,0,0,1,1


- We got one column for each category.
- Now,  
[1,0,0] vector -> Yellow Color  
[0,1,0] vector -> Blue Color  
[0,0,1] vector -> Resd Color
- What if we had many categories in column (say 50 categories)? -> Answer is than also we will follow this way only.

- **Dummy Variable Trap:**
  - Dummy variable is the column formed in encoding (Here, Color_Y, Color_B, and Color_R)
  - The dummy variable trap occurs when we include all the dummy variables for a categorical feature without removing one. It causes perfect multicollinearity, meaning one column can be exactly predicted using the others. For example, in the above table:
		Yellow = 1 - (Blue + Red)
		or, the sum of encoded value in each row = 1
  - So there's redundancy. This leads to issues in regression models, particularly linear regression, where the model becomes unstable or can't be fitted properly.
  - Multicollinearity:
    - It occurs when two or more independent variables (here dummy variables) in a regression model are highly correlated.
    - Linear Regression assumes that the independent variables are not perfectly correlated.
	- If multicollinearity exists, then the model may not be able to distinguish the individual effect of each predictor or it may show strange or inconsistent behavior in coefficient estimates.
	- Multicollinearity - This will be studied in more detail later on
- (Q) How do we eliminate multicollinearity from One Hot Encoding?
  - We drop one dummy variable column (generally the first one). After doing that:  
		[0,0] -> Yellow, 
        [1,0] -> Blue, 
        [0,1] -> Red

- **One Hot Encoding using the most frequent variables:**
  - Imagine we have a feature like "City" with hundreds or thousands of unique values (like Delhi, Mumbai, New York, Tokyo, etc)
  - One Hot Encoding would create hundreds of new columns, which can:
    - Make the dataset huge and slow.
	- Lead to sparse data (lots of 0s)
	- Causes overfitting in machine learning models.
  - Smart solution for the above problem: Instead of creating dummy columns for all categories, we keep only the most frequent ones (e.g., top 10 cities) and group all the less frequent ones into a single category (often called "Other") or simply ignore them.
  - Example: Let's say we have a column with 100 city names, but 90% of the data uses just four cities (Mumbai, Delhi, Bengaluru, and Hyderabad). So, we only keep those four, and treat the rest as "Other" (or just don't encode them).

- **Code:**

In [23]:
import numpy as np
import pandas as pd
df = pd.read_csv('cars.csv')
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [25]:
# Some observations on this dataset:
print(df['brand'].nunique()) # Gives count of total different category in brand.
df['brand'].value_counts()  # Gives count of cars corresponding to each brand
# Similarly, we can check for fuel (four categories) and owner (five categories) columns as well.

32


brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [26]:
# Way (I): One Hot Encoding using Pandas:-
pd.get_dummies(df, columns = ['fuel', 'owner'])  # we are ignoring brand column for now as it has many categories.

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


- Output: fuel column will be replaced by four dummy variable and owner column will be replaced by five dummy variable.
- But still, here we haven't solved the multicollinearity problem. To solve that we can use the drop_first parameter.

In [27]:
pd.get_dummies(df, columns = ['fuel', 'owner'], drop_first = True)  # Now we will have three and four dummy variables for fuel and owner columns.

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


In [32]:
# Way (II): One Hot Encoding using sklearn:
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(df.iloc[:,0:4], df.iloc[:,-1], test_size = 0.2, random_state = 0)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()    # To avoid multicollinearity we can do create object as: ohe = OneHotEncoder(drop='first')
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']]).toarray()  
# X_train[['fuel', 'owner']] - It selects two columns — 'fuel' and 'owner' — from the X_train DataFrame.
# By fit_transform(), we are performing both fitting and transformation
X_test_new = ohe.transform(x_test[['fuel', 'owner']]).toarray()

In [37]:
# Merging with other columns:
np.hstack((X_train[['brand', 'km_driven']].values, X_train_new)).shape
# np.hstack() in NumPy: np.hstack() means horizontal stack — it joins arrays side by side (column-wise).

(6502, 11)

In [36]:
pd.DataFrame(np.hstack((X_train[['brand', 'km_driven']].values, X_train_new))).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Hyundai,60000,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,Tata,150000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Hyundai,110000,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,Mahindra,28000,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,Maruti,15000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


- Note:
  - We are getting the sparse matrix, and hence we are using the toarray() method. To avoid using toarray() explicitly, we can use the sparse parameter while creating the object as:  
		ohe = OneHotEncoder(sparse = False) -> No sparse matrix will form
  - We can even control the data type of dummy variable (which by default is float) as:  
		ohe = OneHotEncoder(dtype = np.int32)

- **One Hot Encoding with columns having a lot of categories (Here, brand column):**

In [38]:
counts = df['brand'].value_counts()
counts

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

- We will take a threshold (say 100 cars). We will put the brand having cars less than the threshold value in Other category.

In [39]:
threshold = 100
# We know, counts <= threshold will create a boolean Series where True means that the brand has cars ≤ 100.
counts[counts<=threshold].index    # getting brand's name with cars count less than or equal to threshold

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [40]:
repl = counts[counts<=threshold].index
pd.get_dummies(df['brand'].replace(repl, 'uncommon'))   # We are saying replace those rare brands with 'uncommon'

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
0,False,False,False,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,False,False,False,False,True,False,False,False,False,False,False,False,False
8124,False,False,False,False,True,False,False,False,False,False,False,False,False
8125,False,False,False,False,False,False,True,False,False,False,False,False,False
8126,False,False,False,False,False,False,False,False,False,True,False,False,False


- That method takes a categorical column (like brand) and turns it into multiple boolean (True/False) columns — one for each unique category (like BMW, Honda, etc.).
- Each row represents one car from our dataset. Each column (like BMW, Honda, uncommon) says whether the brand of that car is that particular brand.

=============================================================================================================