# Data Classification

Data can be mainly classified into type categories **Numerical data** and **Categorical Data**.

Numerical data includes data types such as Integers and Floats. Whereas Categorical data includes data types such as Objects.

Futher, Categorical data is also of two types Nominal Categorical data and Ordinal Categorical data.

**Nominal Categorical data:** The data does not have any relationship among themselves. 

For example: there is a column which contains Names hair color of all the poeple, so this type of cannot be put in groups and have unique values.

**Ordinal Categorical data:** The data has a relationship among themselves. There is a Order followed in this type of data. 

Example: There is a column as education, so we know PG > UG > HS. There is a order followed and there is a relationship between the data.

PG = Post Graduate

UG = Under Graduate

HS = High School


We know that the categorical Data is mostly in the form of strings and the ML alogorithm expects numbers. So, it is our duty to convert these strings into numbers.  There are many methods which can help to convert strings to numbers.

Note: This notebook focuses on Ordinal Encoding , Label Encoding and One Hot Encoding methods.

*Following techniques are used for the type of data in the dataset:*

*   Nominal Data --> One Hot Encoding
*   Ordinal Data with Numerical output --> Ordinal Encoding
*   Ordinal Data with Categorical output --> Label Encoding



#**Label Encoding and Ordinal Encoding:**

Suppose in a dataset of x features and y output, there is one column which contains Ordinal data, so to that column we can apply **Ordinal Encoder**, this will actually classify the same type of data into classes and encode them as 0,1,2..etc.


This is only in the case if the output (y) is a Numeric data. Why if the output is not Numeric Data?

In that case, it is recommended to use **Label Encoder**.

In all, Label Encoder is also used for Ordinal data but when the output (y) is a Categorical data. This would happen mostly in case of Classfication problems.

**How does Ordinal Encoding works?**

Suppose if there is a column as Eduction and it contains data such as HS , MS , UG , PG , Phd and many more.

Now we need to convert this data into Numerical Data.

We know that is Categorical data, but we need to know weather it is Nominal Data or Ordinal Data.

We can always see if there is a order followed in the data (  PG > UG > HS ) and there is a relationship between the data , it is Ordinal Data.

In Ordinal Encoding, highest type of data is given a greater code which works as:

PG --> 2

UG --> 1

HS --> 0

# Code:

In [131]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder , LabelEncoder , OneHotEncoder

In [49]:
from google.colab import files

In [50]:
data = files.upload()

Saving customer.csv to customer (1).csv


In [51]:
df = pd.read_csv('customer.csv')

In [52]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
17,22,Female,Poor,UG,Yes
42,30,Female,Good,PG,Yes
4,16,Female,Average,UG,No
10,98,Female,Good,UG,Yes
38,45,Female,Good,School,No


**Inference:**  This data contains Numeric as well Categorical data.

*Gender column:* Gender is an example of a nominal variable because the categories (woman, man, transgender, non-binary, etc.) cannot be ordered from high to low. We need to apply One Hot Encoder on this column,so for now we are ignoring it.


*review and education columns*: The data has an order and has a relationship, so ordinal encoding can be done.

But we see that the output column also has categorical data, so now we will perform Ordinal Encoding on those three columns and we need to do Label Encoding on the Output column.

In [68]:
df = df.iloc[:,2:]  #ignoring the columns we dont need for encoding.

In [69]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


While doing any of the Feature Transformation , we always do train, test and split.

In [82]:
x_train,x_test,y_train,y_test = train_test_split(df.iloc[:,0:2] , df.iloc[:,-1] , test_size=0.2)

In [83]:
x_train.head()

Unnamed: 0,review,education
26,Poor,PG
13,Average,School
0,Average,School
7,Poor,School
4,Average,UG


In [84]:
y_train.head()

26     No
13     No
0      No
7     Yes
4      No
Name: purchased, dtype: object

Applying Ordinal Encoder on these two columns.

In [85]:
oe = OrdinalEncoder(categories = [['Poor' , 'Average' , 'Good'] , ['School' , 'UG' , 'PG']])

In [86]:
oe.fit(x_train)

In [87]:
x_train = oe.transform(x_train)

The strings in these two columns have been converted to Numbers.

In [88]:
x_train

array([[0., 2.],
       [1., 0.],
       [1., 0.],
       [0., 0.],
       [1., 1.],
       [1., 1.],
       [1., 2.],
       [2., 2.],
       [2., 0.],
       [0., 0.],
       [2., 0.],
       [1., 0.],
       [1., 1.],
       [0., 0.],
       [2., 2.],
       [0., 1.],
       [1., 0.],
       [2., 1.],
       [1., 0.],
       [0., 1.],
       [0., 2.],
       [0., 1.],
       [2., 0.],
       [0., 1.],
       [1., 2.],
       [2., 1.],
       [2., 1.],
       [0., 2.],
       [1., 1.],
       [0., 2.],
       [2., 0.],
       [0., 2.],
       [2., 1.],
       [2., 0.],
       [2., 1.],
       [2., 2.],
       [0., 0.],
       [2., 1.],
       [0., 2.],
       [0., 2.]])

In [89]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

Now lets convert the output column into Numerical Data using Label Encoder.

In [90]:
le = LabelEncoder()

In [91]:
le.fit(y_train)

In [92]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [93]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [94]:
y_train

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1])

This is how we converted the output into Numerical Data.

# One Hot Encoding

One Hot Encoding is used to tackle the Nominal Categorical Data.

Suppose we have a column named as Colors which have data such as Yellow,blue,red. If i choose to use ordinal encoding over here, then the ML algorithm will get confused with the order of the colors, which to prioritise more?

In this case, we can use **One Hot Encoder**.

**How does One Hot Encoder works?**

In One Hot Encoding we create a separate column for each category. This means there will be 3 more columns added to the dataset namely color_Y , color_R , color_B. Now , when in the original data if we have values for column color as Y,Y,B,Y,R... then in the new data we have 3 more columns, now we can code them as:

color , color_Y , color_R , color_B:

Y   :      1         0         0

Y   :     1            0     0


B    :    0         0         1

Y    :   1         0         0 

R   :   0           1        0



One Hot Encoding coverts strings into vectors. 

**Quick Question:** So if we have 50 values in the color column , are we going to create 50 new columns?


YES!!!!! THIS IS CRAZYYYYYY:)

**Dummy Variable Trap in One Hot Encoding:**


The new columns formed in our data are called **Dummy Variables.**

In One Hot Encoding to avoid the variables being dependent on each other, we must remove one entire column from the new data. This helps in maintaining the independency of the variables and helps to avoid multicollinearity.

**Back to the quick question:** So i have 50 unique type of entries in a column, then after applying One Hot Encoder there would be 50 dummy variables, but this would increase the dimenstionality of the data make the processing slower.


To avoid this problem, we only pick the very frequent entries from the column to make dummy variables and the less frequent entries are put together in a single dummy column named as 'Others'.

# Code:

In [95]:
from google.colab import files

In [96]:
file = files.upload()

Saving used_car_dataset.csv to used_car_dataset.csv


In [153]:
df2 = pd.read_csv('used_car_dataset.csv')

In [154]:
df2.head()

Unnamed: 0,car_name,car_price_in_rupees,kms_driven,fuel_type,city,year_of_manufacture
0,Hyundai Grand i10 Magna 1.2 Kappa VTVT [2017-2...,₹ 4.45 Lakh,"22,402 km",Petrol,Mumbai,2016
1,Maruti Suzuki Alto 800 Lxi,₹ 2.93 Lakh,"10,344 km",Petrol,Kolkata,2019
2,Tata Safari XZ Plus New,₹ 22.49 Lakh,"12,999 km",Diesel,Bangalore,2021
3,Maruti Suzuki Ciaz ZXI+,₹ 6.95 Lakh,"45,000 km",Petrol,Thane,2016
4,Jeep Compass Sport Plus 1.4 Petrol [2019-2020],₹ 12 Lakh,"11,193 km",Petrol,Kolkata,2019


In [155]:
data = df2.drop(columns = ['year_of_manufacture' , 'car_price_in_rupees','car_name','kms_driven'],axis=1)
data.head()

Unnamed: 0,fuel_type,city
0,Petrol,Mumbai
1,Petrol,Kolkata
2,Diesel,Bangalore
3,Petrol,Thane
4,Petrol,Kolkata


In [156]:
data['fuel_type'].value_counts()

Petrol        1348
Diesel         636
CNG             82
Petrol + 1      18
Electric        10
Diesel + 1       7
Hybrid           2
LPG              2
Name: fuel_type, dtype: int64

In [157]:
data['city'].value_counts()

Bangalore      248
Pune           247
Mumbai         246
Ahmedabad      246
Kolkata        245
Hyderabad      245
Thane          244
Delhi          190
Chennai         78
Noida           41
Ambattur        19
Pallikarnai     17
Thiruvallur     16
Gurgaon          8
Poonamallee      8
Faridabad        7
Name: city, dtype: int64

## One Hot Encoding using Pandas : get dummies

In [158]:
pd.get_dummies(data,columns=['fuel_type','city'])

Unnamed: 0,fuel_type_CNG,fuel_type_Diesel,fuel_type_Diesel + 1,fuel_type_Electric,fuel_type_Hybrid,fuel_type_LPG,fuel_type_Petrol,fuel_type_Petrol + 1,city_Ahmedabad,city_Ambattur,...,city_Gurgaon,city_Hyderabad,city_Kolkata,city_Mumbai,city_Noida,city_Pallikarnai,city_Poonamallee,city_Pune,city_Thane,city_Thiruvallur
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2100,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2101,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2102,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2103,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0


Tooo many columns, lets subtract one column for each feature.

In [159]:
pd.get_dummies(data,columns=['fuel_type','city'],drop_first = True)

Unnamed: 0,fuel_type_Diesel,fuel_type_Diesel + 1,fuel_type_Electric,fuel_type_Hybrid,fuel_type_LPG,fuel_type_Petrol,fuel_type_Petrol + 1,city_Ambattur,city_Bangalore,city_Chennai,...,city_Gurgaon,city_Hyderabad,city_Kolkata,city_Mumbai,city_Noida,city_Pallikarnai,city_Poonamallee,city_Pune,city_Thane,city_Thiruvallur
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2100,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2101,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2102,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2103,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


You can see 2 columns have been reduced.

# One Hot Encoding using sklearn:

In [160]:
data.head()          #even after creating dummies the original data remains the same

Unnamed: 0,fuel_type,city
0,Petrol,Mumbai
1,Petrol,Kolkata
2,Diesel,Bangalore
3,Petrol,Thane
4,Petrol,Kolkata


In [161]:
x_train , x_test , y_train , y_test = train_test_split(data.iloc[:,0:2] ,data.iloc[:,-1],test_size = 0.2 , random_state = 2 )

In [162]:
x_train.head()

Unnamed: 0,fuel_type,city
1659,Petrol,Bangalore
1624,Petrol,Hyderabad
794,Petrol,Pune
1526,Diesel,Chennai
921,Petrol,Pune


In [163]:
x_test.head()

Unnamed: 0,fuel_type,city
826,Petrol,Ahmedabad
803,Petrol,Mumbai
778,Petrol,Ahmedabad
1505,Petrol,Ahmedabad
1644,Petrol,Bangalore


In [164]:
ohe = OneHotEncoder()

In [168]:
x_train_new = ohe.fit_transform(x_train[['fuel_type','city']]).toarray()

In [169]:
x_train_new

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])