# Scaling Categorical Data

### Ordinal Encoding 

Ordinal Encoding is a technique used in machine learning to convert categorical variables (labels with text values) into numerical format where the order matters. It’s especially useful when categories have a natural ranking or sequence (e.g., "low", "medium", "high").
Most machine learning models work only with numbers. When a feature is categorical and ordered, ordinal encoding helps capture this order information.
For X or input uses ordinal Encoder and For Y or output data uses Label Encoder
Here Order MATTERS like in clases ore priority will be given to the one having higher class than the lower one

In [1]:
import pandas as pd
import numpy as np 

In [2]:
df= pd.read_csv("C:\\Users\\utkar\\Downloads\\customer.csv")

In [3]:
df.sample(6)

Unnamed: 0,age,gender,review,education,purchased
29,83,Female,Average,UG,Yes
33,89,Female,Good,PG,Yes
44,77,Female,Average,UG,No
32,92,Male,Average,UG,Yes
18,19,Male,Good,School,No
30,73,Male,Average,UG,No


In [4]:
df= df.iloc[:,2:]

In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)

In [12]:
X_train.shape

(40, 2)

In [11]:
X_test.shape

(10, 2)

In [13]:
from sklearn.preprocessing import OrdinalEncoder

In [14]:
oe= OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [15]:
oe.fit(X_train)

In [17]:
X_train=oe.transform(X_train)
X_test=oe.transform(X_test)

In [18]:
X_train

array([[1., 1.],
       [0., 0.],
       [2., 0.],
       [0., 2.],
       [2., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [1., 2.],
       [1., 1.],
       [0., 2.],
       [1., 0.],
       [1., 0.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [0., 0.],
       [0., 2.],
       [1., 1.],
       [1., 1.],
       [2., 1.],
       [1., 0.],
       [0., 0.],
       [2., 0.],
       [2., 0.],
       [1., 0.],
       [0., 0.],
       [0., 0.],
       [1., 2.],
       [1., 0.],
       [0., 1.],
       [2., 2.],
       [1., 1.],
       [0., 1.],
       [0., 2.],
       [2., 2.],
       [2., 1.],
       [0., 1.],
       [2., 2.],
       [2., 2.]])

In [19]:
X_test

array([[0., 2.],
       [2., 2.],
       [1., 1.],
       [0., 2.],
       [2., 0.],
       [1., 2.],
       [2., 1.],
       [0., 2.],
       [0., 1.],
       [2., 1.]])

In [20]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

## Label Encoding 

In [21]:
from sklearn.preprocessing import LabelEncoder

In [22]:
le= LabelEncoder()

In [23]:
le.fit(y_train)

In [24]:
y_train= le.transform(y_train)
y_test= le.transform(y_test)

In [25]:
y_train

array([1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1])

In [26]:
y_test

array([1, 1, 1, 0, 0, 1, 0, 1, 1, 1])

# One Hot Encoding 

One-Hot Encoding is a method to convert categorical variables (text labels) into a form that machine learning models can understand, especially when the categories are unordered (nominal data).
Each category becomes its own column, and only one of them is 1, the rest are 0.

In [27]:
import numpy as np 
import pandas as pd

In [28]:
df= pd.read_csv("C:\\Users\\utkar\\cars.csv")

In [29]:
df.sample(7)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
7244,Ford,50000,Petrol,First Owner,620000
2051,Mahindra,120000,Diesel,First Owner,650000
6587,BMW,7500,Diesel,First Owner,5400000
5209,Tata,40000,Petrol,First Owner,450000
5248,Jeep,17000,Petrol,First Owner,4100000
1061,Maruti,37000,Diesel,First Owner,625000
5095,Maruti,69779,Petrol,First Owner,600000


In [30]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [31]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [33]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [34]:
df['brand'].nunique()

32

In [35]:
df['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [36]:
df['fuel'].nunique()

4

In [38]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [39]:
df['owner'].nunique()

5

# One Hot Encoding Using Pandas

In [40]:
pd.get_dummies(df, columns=['brand','fuel','owner'])

Unnamed: 0,km_driven,selling_price,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,...,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,145500,450000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
1,120000,370000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,True,False,False
2,140000,158000,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,True
3,127000,225000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
4,120000,130000,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,110000,320000,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
8124,119000,135000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,True,False,False,False
8125,120000,382000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
8126,25000,290000,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False


In [41]:
#To reduce number of columns eliminate brands
pd.get_dummies(df, columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


# K-1 Encoding 

In [42]:
pd.get_dummies(df, columns=['fuel','owner'], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


In [43]:
pd.get_dummies(df, columns=['brand','fuel','owner'], drop_first=True)

Unnamed: 0,km_driven,selling_price,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,...,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,145500,450000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
1,120000,370000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,True,False,False
2,140000,158000,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
3,127000,225000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
4,120000,130000,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,110000,320000,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
8124,119000,135000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
8125,120000,382000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
8126,25000,290000,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
