## **Ordinal number encoding**



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import datetime

In [None]:
today_date=datetime.datetime.today()

In [None]:
today_date

datetime.datetime(2024, 4, 29, 12, 35, 20, 251144)

In [None]:
today_date-datetime.timedelta(3)

datetime.datetime(2024, 4, 26, 12, 35, 20, 251144)

In [None]:
### List Comprehension

days=[today_date-datetime.timedelta(x) for x in range(0,15)]

In [None]:
import pandas as pd

data=pd.DataFrame(days)
data.columns=["Day"]

data.head()

Unnamed: 0,Day
0,2024-04-29 12:35:20.251144
1,2024-04-28 12:35:20.251144
2,2024-04-27 12:35:20.251144
3,2024-04-26 12:35:20.251144
4,2024-04-25 12:35:20.251144


In [None]:
data["weekday"]=data["Day"].dt.day_name()
data.head()

Unnamed: 0,Day,weekday
0,2024-04-29 12:35:20.251144,Monday
1,2024-04-28 12:35:20.251144,Sunday
2,2024-04-27 12:35:20.251144,Saturday
3,2024-04-26 12:35:20.251144,Friday
4,2024-04-25 12:35:20.251144,Thursday


In [None]:
dictionary={
    'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7
}

In [None]:
dictionary

{'Monday': 1,
 'Tuesday': 2,
 'Wednesday': 3,
 'Thursday': 4,
 'Friday': 5,
 'Saturday': 6,
 'Sunday': 7}

In [None]:
data['weekday_ordinal']=data['weekday'].map(dictionary)

In [None]:
data

Unnamed: 0,Day,weekday,weekday_ordinal
0,2024-04-29 12:35:20.251144,Monday,1
1,2024-04-28 12:35:20.251144,Sunday,7
2,2024-04-27 12:35:20.251144,Saturday,6
3,2024-04-26 12:35:20.251144,Friday,5
4,2024-04-25 12:35:20.251144,Thursday,4
5,2024-04-24 12:35:20.251144,Wednesday,3
6,2024-04-23 12:35:20.251144,Tuesday,2
7,2024-04-22 12:35:20.251144,Monday,1
8,2024-04-21 12:35:20.251144,Sunday,7
9,2024-04-20 12:35:20.251144,Saturday,6


#### **Count or frequency encoding**


In [None]:
train_set=pd.read_csv('adult.data', index_col=None,header=None)

train_set.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
len(train_set[1].unique())

columns=[1,3,5,6,7,8,9,13]

In [None]:
train_set=train_set[columns]

In [None]:
#renaming columns to some other string
train_set.columns=['Employment','Degree','Status','Designation','Family_Job','Race','Sex','Country']

In [None]:
train_set.head()

Unnamed: 0,Employment,Degree,Status,Designation,Family_Job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [None]:
for feature in train_set.columns[:]:
  print(feature,":",len(train_set[feature].unique()),'labels')

Employment : 9 labels
Degree : 16 labels
Status : 7 labels
Designation : 15 labels
Family_Job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


In [None]:
country_map=train_set['Country'].value_counts().to_dict()

In [None]:
train_set['Country']=train_set['Country'].map(country_map)

train_set.head(20)

Unnamed: 0,Employment,Degree,Status,Designation,Family_Job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170


#### Advantages

1. Easy to use
2. Not increasing feature space (not increasing dimension), which happens with other encoding (like on hot encoding)

#### Disadvantages

1. It will provide same weight if the frequencies are same. (like two countries have same value).

#### **Target Guided Ordinal encoding**

1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0







In [None]:
import pandas as pd

df=pd.read_csv('Titanic-Dataset.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [None]:
df['Cabin'].fillna('Missing',inplace=True)

df['Cabin']=df['Cabin'].astype(str).str[0]

df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [None]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [None]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [None]:
#after sorting values just get the index  , so later can assign some number in order
df.groupby(['Cabin'])['Survived'].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [None]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index

ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [None]:
#map all ordinal_labels to some number

ordinal_labels={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [None]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels)

df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


#### **Mean Encoding**

replace categorical value with mean

In [None]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()

In [None]:
df['mean_ordinal_encode']=df['Cabin'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,mean_ordinal_encode
0,0,M,1,0.299854
1,1,C,4,0.59322
2,1,M,1,0.299854
3,1,C,4,0.59322
4,0,M,1,0.299854



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




#### Advantage
1. Help to not increase dimension of data
#### Disadvantage
1. It leads to overfitting