### Feature engineering :   Label Encoder , OneHotEncoder, Get_Dummy

Please read this to uderstand regarding encoding
   -  https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
   -  https://towardsdatascience.com/one-hot-encoding-multicollinearity-and-the-dummy-variable-trap-b5840be3c41a
   -  https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
   -  https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
   -  https://inmachineswetrust.com/posts/drop-first-columns/

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
## Add these lines to turn off the warnings
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [3]:
## import data
train_df = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Titanic/titanic_train.csv")

In [4]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
train_df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [7]:
## Get the Categorical Data In Dataset
df_cat_objects = train_df.select_dtypes(include= "object")
df_cat_objects.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [8]:
cat_features = df_cat_objects.columns.values.tolist()

In [9]:
cat_features

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [10]:
### doing level Encoding on some columns or all categorical columns
for col in cat_features:
    le = LabelEncoder()
    le.fit(train_df[col].astype(str))
    train_df[col] = le.transform(train_df[col].astype(str))

In [11]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,108,1,22.0,1,0,523,7.25,147,2
1,2,1,1,190,0,38.0,1,0,596,71.2833,81,0
2,3,1,3,353,0,26.0,0,0,669,7.925,147,2
3,4,1,1,272,0,35.0,1,0,49,53.1,55,2
4,5,0,3,15,1,35.0,0,0,472,8.05,147,2


In [12]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    int32  
 4   Sex          891 non-null    int32  
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    int32  
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    int32  
 11  Embarked     891 non-null    int32  
dtypes: float64(2), int32(5), int64(5)
memory usage: 66.3 KB


**cat_features** are converted to numerical type with `LabelEncoder`. `LabelEncoder` basically labels the classes from **0** to **n**. This process is necessary for models to learn from those features.

### One-Hot Encoding the Categorical Features
Now we will convert the categorical features ** cat_features** to one-hot encoded features with `OneHotEncoder`.

You may have observed that we first did integer-encoding of categorical column using the LabelEncoder. This is because the OneHotEncoder requires the categorical columns to contain numerical labels. 
      1. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical    (discrete) features.
      2. The output will be a sparse matrix where each column corresponds to one possible value of one feature.
      3. It is assumed that input features take on values in the range [0, n_values).
      4. This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models    and SVMs with the standard kernels.

We can also use the pandas get_dummy coding to encode the categorical variables.

In [13]:
## cat_features  -- same which we use in Lebel Encoding also add Pclass which is also categorical features
## Total categorical Variables
cat_features.append("Pclass")
cat_features


['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Pclass']

In [14]:
hot_encoding_features_count = train_df[['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Pclass']].nunique().sum()
hot_encoding_features_count

1729

In [15]:
train_df_coulmns = train_df.columns.values.tolist()
train_df_coulmns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [16]:
not_hot_enc_features = [i for i in train_df_coulmns if i not in cat_features]
not_hot_enc_features

['PassengerId', 'Survived', 'Age', 'SibSp', 'Parch', 'Fare']

In [17]:
Total_hot_encoded_columns = len(not_hot_enc_features) + hot_encoding_features_count + len(cat_features)
Total_hot_encoded_columns

1741

In [18]:
## Lets start Onehot encoding taking only one features -- "Sex"
## #reshape the 1-D Sex array to 2-D as fit_transform expects 2-D and finally fit the object 
encoded_feature = OneHotEncoder().fit_transform(train_df.Sex.values.reshape(-1,1)).toarray()

In [19]:
encoded_feature

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [20]:
### Now we will assign column name to this array - eccoded_feature
## How many columns we should have for "Sex", as there are 2 unique value (Male,Female). But if we have many unique vale then how ew ill find, use below method
n = train_df["Sex"].nunique()

In [21]:
## Assign column name on these 2 value (Sex_1,Sex_2)

In [22]:
## We will use format method for name of Sex columns (e.g. Sex_1,Sex_2) . 
cols = ["{}_{}".format("Sex",n) for n in range (1, n+1)]

In [23]:
cols

['Sex_1', 'Sex_2']

In [24]:
### Now our main goal to add these columns (Sex_1 and Sex_2) with values in dataframe train_df.
### There are many way to add these columns in train_df, I am using concat method
### We have to make a dataframe which will have columns Sex_1 and Sex_2
df_feature = pd.DataFrame(encoded_feature,columns=cols)

In [25]:
df_feature

Unnamed: 0,Sex_1,Sex_2
0,0.0,1.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0
...,...,...
886,0.0,1.0
887,1.0,0.0
888,1.0,0.0
889,0.0,1.0


In [26]:
## Add dataframe df_feature in training dataset
train_df = pd.concat([train_df,df_feature],axis=1)
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_1,Sex_2
0,1,0,3,108,1,22.0,1,0,523,7.2500,147,2,0.0,1.0
1,2,1,1,190,0,38.0,1,0,596,71.2833,81,0,1.0,0.0
2,3,1,3,353,0,26.0,0,0,669,7.9250,147,2,1.0,0.0
3,4,1,1,272,0,35.0,1,0,49,53.1000,55,2,1.0,0.0
4,5,0,3,15,1,35.0,0,0,472,8.0500,147,2,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,548,1,27.0,0,0,101,13.0000,147,2,0.0,1.0
887,888,1,1,303,0,19.0,0,0,14,30.0000,30,2,1.0,0.0
888,889,0,3,413,0,,1,2,675,23.4500,147,2,1.0,0.0
889,890,1,1,81,1,26.0,0,0,8,30.0000,60,0,0.0,1.0


In [27]:
## Now we will do Onehot Encoding on remaining cat features
cat_features

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Pclass']

In [28]:
cat_features.remove("Sex")
cat_features

['Name', 'Ticket', 'Cabin', 'Embarked', 'Pclass']

In [29]:
for val in cat_features:
    encoded_feature = OneHotEncoder().fit_transform(train_df[val].values.reshape(-1,1)).toarray()
    n = train_df[val].nunique()
    cols = ["{}_{}".format(val,n) for n in range(1, n+1)]
    df_val = pd.DataFrame(encoded_feature,columns=cols)
    train_df = pd.concat([train_df,df_val],axis= 1)

In [30]:
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Cabin_146,Cabin_147,Cabin_148,Embarked_1,Embarked_2,Embarked_3,Embarked_4,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,108,1,22.0,1,0,523,7.2500,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,2,1,1,190,0,38.0,1,0,596,71.2833,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3,1,3,353,0,26.0,0,0,669,7.9250,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,4,1,1,272,0,35.0,1,0,49,53.1000,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,5,0,3,15,1,35.0,0,0,472,8.0500,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,548,1,27.0,0,0,101,13.0000,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
887,888,1,1,303,0,19.0,0,0,14,30.0000,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
888,889,0,3,413,0,,1,2,675,23.4500,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
889,890,1,1,81,1,26.0,0,0,8,30.0000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [31]:
Total_col_count_after_hot_encoding = len(train_df.columns.values.tolist())
Total_col_count_after_hot_encoding

1741

We can see that total columns counts are same.

#### Challenges of One-Hot Encoding: Dummy Variable Trap
One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables.

   **Dummy Variable Trap is a scenario in which variables are highly correlated to each other.**

The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.

So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped. 
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

Consequently, if we apply the tiniest bit of regularization , we can handle features that are perfectly correlated without removing any columns. Regularization also innately addresses the effects of multicollinearity—it's pretty awesome.
Using iterative numerical methods such as gradient descent—with or without regularization— —don't involve matrix inversions,there's no reason to drop one of the one-hot encoded columns from each categorical feature when using them.

**Side note** : I recommend avoiding pandas' get_dummies and switching to a more robust one-hot encoder, such as OneHotEncoder from scikit-learn—it's designed to handle these frequent scenarios:
1. A categorical feature containing values that appear in the test set but not the training set
2. A categorical feature in the test set containing a subset of the total possible values


### Dummy Encoding


In [32]:
train_df_1 = pd.read_csv("C:/E/Github/Data_Files/Titanic/titanic_train.csv")
train_df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [33]:
cat_features = train_df_1.select_dtypes(include= "object").columns.values.tolist()
cat_features

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [34]:
train_df_2 = pd.get_dummies(train_df_1[cat_features])

In [35]:
train_df_2.head()

Unnamed: 0,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel","Name_Abelson, Mrs. Samuel (Hannah Wizosky)","Name_Adahl, Mr. Mauritz Nils Martin","Name_Adams, Mr. John","Name_Ahlin, Mrs. Johan (Johanna Persdotter Larsson)","Name_Aks, Mrs. Sam (Leah Rosen)","Name_Albimona, Mr. Nassef Cassem",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [36]:
train_df_3 = pd.get_dummies(train_df_1[["Pclass","Sex"]])
train_df_3.head()

Unnamed: 0,Pclass,Sex_female,Sex_male
0,3,0,1
1,1,1,0
2,3,1,0
3,1,1,0
4,3,0,1


We can see that int and object both cannot be done using get_dummy,so we have to convert Pclass to object

In [37]:
train_df_1["Pclass"] = train_df_1["Pclass"].apply(str)

In [38]:
train_df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    object 
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(6)
memory usage: 83.7+ KB


In [39]:
cat_features = train_df_1.select_dtypes(include="object").columns.values.tolist()
cat_features

['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [40]:
train_df_4 = pd.get_dummies(train_df_1[cat_features])

In [41]:
train_df_4.head()

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel","Name_Abelson, Mrs. Samuel (Hannah Wizosky)","Name_Adahl, Mr. Mauritz Nils Martin","Name_Adams, Mr. John",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
