### Technique we should be followed
Label Encoding:
1. One Hot Encoding 
2. One Hot Encoding with top 10 feature
3. Count or Frequency Encoding
4. Mean Encoding 

Ordinal Encoding:
1. Normal Ordinal Encoding 
2. Target Guided Ordinal Encoding

In [1]:
#### Import all necessity library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### One Hot Encoding

In [2]:
df = pd.read_csv('titanic.csv', usecols = ['Sex', 'Survived', 'Cabin', 'Embarked'])
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,S
1,1,female,C85,C
2,1,female,,S
3,1,female,C123,S
4,0,male,,S


In [3]:
#### if i want to do the One Hot Encoding in "Sex"
pd.get_dummies(df, columns = ['Sex', 'Embarked'])

Unnamed: 0,Survived,Cabin,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,,0,1,0,0,1
1,1,C85,1,0,1,0,0
2,1,,1,0,0,0,1
3,1,C123,1,0,0,0,1
4,0,,0,1,0,0,1
...,...,...,...,...,...,...,...
886,0,,0,1,0,0,1
887,1,B42,1,0,0,0,1
888,0,,1,0,0,0,1
889,1,C148,0,1,1,0,0


#### Advantage:
1. Easy to implement and popular

#### Disadvantage:
1. if categories are too huge then it will create additional feature that leads to CURSE OF DIMENSIONALITY
2. Sparse Matrix created - meaning most of the row might have zeros

#### One Hot Encoding with top 10 features

In [4]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'], usecols = ['occupation'])
df.head()

#### One Hot Encoding with top 10 features
df.loc[:, 'occupation'].nunique()

#### Taking top 10 categories from the dataset
top_ten_ = list(df.loc[:, 'occupation'].value_counts(ascending = False).head(10).index)
for top_ in top_ten_:
    print(top_, end = " || ")

for index_, category_value_ in enumerate(df.loc[:, 'occupation']):
    if category_value_ in top_ten_:
        pass
    else:
        df.loc[index_, 'occupation'] = 'unknown'
print("Done !")

df.loc[:, 'occupation'].nunique()

#### Now One Hot Encoding 
pd.get_dummies(df, columns = ['occupation'], drop_first = True)

Prof-specialty || Craft-repair || Exec-managerial || Adm-clerical || Sales || Other-service || Machine-op-inspct || Transport-moving || Handlers-cleaners || Farming-fishing || Done !


Unnamed: 0,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Transport-moving,occupation_unknown
0,0,0,0,0,1,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
48837,0,0,0,0,0,0,0,0,0,1
48838,0,0,0,0,1,0,0,0,0,0
48839,0,0,0,0,0,0,0,0,0,0
48840,0,0,0,0,0,0,0,0,0,0


#### Advantage:
1. Easy to implement - Popular
2. Not increasing a high number of features using this technique

#### Disadvantage:
1. It might have Curse of Dimensionality
2. Sparse Matrix 

#### Count of Frequency Encoding 

In [5]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
df.loc[:, 'native-country'].nunique()
#### In this case if we are using One Hot Encoding then 40 new features will be created.

41

In [7]:
df.loc[:, 'native-country'] = df.loc[:, 'native-country'].map(df.loc[:, 'native-country'].value_counts().to_dict())
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,43832.0,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,43832.0,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,43832.0,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,43832.0,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,43832.0,<=50K


#### Advantage:
1. Easy to implement
2. Not inreasing the additional features

#### Disadvantage:
1. if that particular column has same value_counts then it might be problem

#### Mean Encoding

In [8]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [9]:
#### In order to do the mean encoding first of all we will have to encode target
df.loc[:, 'income'] = df.loc[:, 'income'].map({'<=50K':0 , '>50K':1})
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,0


In [10]:
df.loc[:, 'native-country'] = df.loc[:, 'native-country'].map(df.groupby(['native-country'])['income'].mean().to_dict())
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,0.243977,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,0.243977,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,0.243977,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,0.243977,1
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,0.243977,0


#### Advantage:
1. Easy to implement
2. Not increasing the additional features

#### Disadvantage:
1. Though it is quite is based on `Target` column, so it might need a proper validation - so it lead to overfitting

#### Label Encoding

In [11]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [12]:
df.loc[:, 'workclass'].nunique()

8

In [13]:
df.loc[:, 'workclass'].value_counts()

Private             33906
Self-emp-not-inc     3862
Local-gov            3136
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

In [14]:
df.loc[:, 'workclass'] = df.loc[:, 'workclass'].map({'Federal-gov': 8, 'State-gov': 7, 
                                                     'Local-gov': 6, 'Private': 5, 'Self-emp-inc': 4,
                                                     'Self-emp-not-inc' : 3, 'Without-pay': 2, 
                                                     'Never-worked' : 1})

In [15]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,5.0,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,5.0,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,6.0,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,5.0,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


#### Advantage:
1. Easy to implement
2. Not Increasing the additional features

#### Disadvantage:
1. Quite hard to understand to which rank we should be used

#### Target Guided Ordinal Encoding

In [16]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [17]:
df.loc[:, 'education'].isnull().sum()

0

In [18]:
#### In order to do the mean encoding first of all we will have to encode target
df.loc[:, 'income'] = df.loc[:, 'income'].map({'<=50K':0 , '>50K':1})
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,0


In [19]:
#### Target Guided Ordinal Encoding
target_guided_ = list(df.groupby(['education'])['income'].mean().sort_values(ascending = True).index)
dict_ = {value_: index_ for index_, value_ in enumerate(target_guided_)}
df.loc[:, 'education'] = df.loc[:, 'education'].map(dict_)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,2,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,8,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,11,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,9,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,,103497,9,10,Never-married,,Own-child,White,Female,0,0,30,United-States,0


#### Advantage:
1. Easy to implement
2. Not increasing the additional features

#### Disadvantage:
1. Though, we are using Target - meaning that it required proper validation - as a result it leads to overfitting. 

#### Implement 

In [20]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, f1_score, recall_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:
df = pd.read_csv('adult.csv', na_values = ['?', '??'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [22]:
#### Null value check
df.isnull().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

In [None]:
#### For "workclass" and "occupation" we use "Missing"