## Ordinal Encoding

* Where you can order the feature labels in such a way that you can assign a rank to them.

* For example: Low, Medium and High or Grading System (A-1, B-2, C-3, D-4, F-5)

* Experience of a Batsman (A-10 years-1st Rank, B-5 years-2nd Rank, C-3 years-3rd Rank)

* Weekdays and Weekends (Mon-7, Tues, Wed, Thur, Fri-3, Sat-2, Sun-1)

In [None]:
import datetime

In [None]:
today_date = datetime.datetime.today()
today_date

datetime.datetime(2020, 8, 5, 20, 35, 32, 423050)

In [None]:
today_date-datetime.timedelta(3)

datetime.datetime(2020, 8, 2, 20, 35, 32, 423050)

#### List Comprehension

In [None]:
days = [today_date - datetime.timedelta(x) for x in range(0,15)]
days

[datetime.datetime(2020, 8, 5, 20, 35, 32, 423050),
 datetime.datetime(2020, 8, 4, 20, 35, 32, 423050),
 datetime.datetime(2020, 8, 3, 20, 35, 32, 423050),
 datetime.datetime(2020, 8, 2, 20, 35, 32, 423050),
 datetime.datetime(2020, 8, 1, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 31, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 30, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 29, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 28, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 27, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 26, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 25, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 24, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 23, 20, 35, 32, 423050),
 datetime.datetime(2020, 7, 22, 20, 35, 32, 423050)]

In [None]:
import pandas as pd

In [None]:
data=pd.DataFrame(days)

In [None]:
data.columns=["Day"]

In [None]:
data.head()

Unnamed: 0,Day
0,2020-08-05 20:35:32.423050
1,2020-08-04 20:35:32.423050
2,2020-08-03 20:35:32.423050
3,2020-08-02 20:35:32.423050
4,2020-08-01 20:35:32.423050


* Advised to take out the **hour, minute, second, month, year, day of the month** and create those features from the 'Day' Column and then drop the 'Day' column

In [None]:
data['weekday']=data['Day'].dt.day_name()
data.head()

Unnamed: 0,Day,weekday
0,2020-08-05 20:35:32.423050,Wednesday
1,2020-08-04 20:35:32.423050,Tuesday
2,2020-08-03 20:35:32.423050,Monday
3,2020-08-02 20:35:32.423050,Sunday
4,2020-08-01 20:35:32.423050,Saturday


#### Ranking the days of the week

In [None]:
dictionary={'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7}

In [None]:
dictionary

{'Monday': 1,
 'Tuesday': 2,
 'Wednesday': 3,
 'Thursday': 4,
 'Friday': 5,
 'Saturday': 6,
 'Sunday': 7}

* .map() function maps the week day with the rank assigned to it.

In [None]:
data['weekday_ordinal']=data['weekday'].map(dictionary)

In [None]:
data

Unnamed: 0,Day,weekday,weekday_ordinal
0,2020-08-05 20:35:32.423050,Wednesday,3
1,2020-08-04 20:35:32.423050,Tuesday,2
2,2020-08-03 20:35:32.423050,Monday,1
3,2020-08-02 20:35:32.423050,Sunday,7
4,2020-08-01 20:35:32.423050,Saturday,6
5,2020-07-31 20:35:32.423050,Friday,5
6,2020-07-30 20:35:32.423050,Thursday,4
7,2020-07-29 20:35:32.423050,Wednesday,3
8,2020-07-28 20:35:32.423050,Tuesday,2
9,2020-07-27 20:35:32.423050,Monday,1


## Count Or Frequency Encoding

* It replace the feature label with the **frequency or the count** of that feature label.

* Replace the **'Nan' or '?'** with **'Missing'/'Others'** label.  

* If a new country is added after the deployment, it can be added to the **'Others'** column.

In [None]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                        header = None,index_col=None)
train_set.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
columns=[1,3,5,6,7,8,9,13]

In [None]:
train_set=train_set[columns]

In [None]:
train_set.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']

In [None]:
train_set.head()

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [None]:
for feature in train_set.columns[:]:
    print(feature,":",len(train_set[feature].unique()),'labels')

Employment : 9 labels
Degree : 16 labels
Status : 7 labels
Designation : 15 labels
family_job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


In [None]:
country_map=train_set['Country'].value_counts
country_map

<bound method IndexOpsMixin.value_counts of 0         United-States
1         United-States
2         United-States
3         United-States
4                  Cuba
              ...      
32556     United-States
32557     United-States
32558     United-States
32559     United-States
32560     United-States
Name: Country, Length: 32561, dtype: object>

In [None]:
country_map=train_set['Country'].value_counts().to_dict()
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' Greece': 29,
 ' France': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Cambodia': 19,
 ' Trinadad&Tobago': 19,
 ' Laos': 18,
 ' Thailand': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [None]:
train_set['Country']=train_set['Country'].map(country_map)
train_set.head(20)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170


1. **Advantages**

    * Easy To Use
    * Not increasing feature space 
    
    
2. **Disadvantage**

    * It will provide same weight if the frequencies of the features are same.

## Target Guided Ordinal Encoding

1. Ordering the labels according to the target feature.


2. Replace the labels by the joint probability of being 1 or 0.

In [None]:
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [None]:
df['Cabin'].fillna('Missing',inplace=True)

In [None]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [None]:
df['Cabin']=df['Cabin'].astype(str)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [None]:
df['Cabin']=df['Cabin'].astype(str).str[0]

In [None]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [None]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

* We are grouping the rows by the **Cabin Index** and taking the **Percentage of People who survived** in that particular **Cabin Index**

In [None]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [None]:
df.groupby(['Cabin'])['Survived'].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [None]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [None]:
enumerate(ordinal_labels, start=0)

<enumerate at 0x7fee4e717910>

In [None]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

* The **iterator variable(0 to 8)** will always be returned first and then the value of **ordinal label(T to D)** will be returned for **enumerate**


* In this higher the **'Cabin_ordinal_labels'**, higher is the **Percentage of Survival Rate**

In [None]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


## Mean Encoding

* Here instead of Ranking the **Percentage Survival Rate** from 0 to 8, we replace the **Cabin Feature** with the **Percentage Survival Rate Value**itself.

In [None]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()

In [None]:
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [None]:
df['Cabin_Mean_Encoding'] = df['Cabin'].map(mean_ordinal)

In [None]:
df.head(20)

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,Cabin_Mean_Encoding
0,0,M,1,0.299854
1,1,C,4,0.59322
2,1,M,1,0.299854
3,1,C,4,0.59322
4,0,M,1,0.299854
5,0,M,1,0.299854
6,0,E,7,0.75
7,0,M,1,0.299854
8,1,M,1,0.299854
9,1,M,1,0.299854
