## Feature Encoding in Machine Learning

Youtube Explanation :

- Binary Categorical Features Encoding : https://youtu.be/Guis2MvnJfU
- Ordinal Categorical Features Encoding : https://youtu.be/XgAanLBmips
- Nominal Categorical Features Encoding : https://youtu.be/xPhwsbbK2eo
- When to use One-Hot , Label and Ordinal Encoding in Machine Learning : https://youtu.be/vt1rsKy2vSY
- Sklearn OneHotEncoder vs pd.get_dummies : https://youtu.be/tFMD56ypgPY
- Dummy Variable Trap : https://youtu.be/bUEsTEzrpy8
- Frequency Encoding : https://youtu.be/NvaNhPX6TCQ

Most Machine Learning Algorithms cannot work with categorical variables directly, they need to be converted to numbers.
Even if we find a way to directly work with categorical variables without converting them to numbers, our model shall get biased towards the language we use. For eg, in an animal classification task, if the labels are {‘rat’, ‘dog’, ‘ant’}, then using such a labelling method would train our model to predict labels only in English, which would put a linguistic restriction upon the possible applications of the model.

So most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers so that the model is able to understand and extract valuable information.

We can generally divide the categorical variables(features) into 3 types:

 - Binary: 
 
        (Yes, No) , (True, False) 
        
 - Ordinal: Specific ordered Groups.

         economic status (“low income”,”middle income”,”high income”), 
         
         education level (“high school”,”BS”,”MS”,”PhD”), 
         
         income level (“less than 50K”, “50K-100K”, “over 100K”),
         
         satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”) 
        
 - Nominal : Unordered Groups.
 
        (cat, dog, tiger),(pizza, burger, coke)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# load the dataset
df = pd.read_excel("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Feature_Engineering/feature_encoding.xlsx")

In [3]:
df

Unnamed: 0,CUSTID,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,Y,A,extremely like,phd,200,Premium
1,2,N,A,like,ms,100,NonPremium
2,3,Y,C,neutral,bs,130,NonPremium
3,4,Y,B,extremely dislike,bs,120,NonPremium
4,5,Y,D,extremely like,ms,160,Premium
5,6,N,C,like,ms,170,NonPremium
6,7,Y,B,dislike,bs,130,NonPremium
7,8,N,A,neutral,ms,127,NonPremium
8,9,Y,C,like,phd,157,NonPremium
9,10,N,D,dislike,ms,182,Premium


In [4]:
df.drop("CUSTID",axis=1,inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PAYMENT_MODE_CARD  26 non-null     object
 1   CITY               26 non-null     object
 2   RATING             26 non-null     object
 3   EDUCATION          26 non-null     object
 4   PURCHAGE_AMOUNT    26 non-null     int64 
 5   CUSTOMER_TYPE      20 non-null     object
dtypes: int64(1), object(5)
memory usage: 1.3+ KB


In [6]:
df_train = df.loc[0:19]
df_test = df.loc[20:]

In [7]:
df_test

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
20,Y,B,neutral,ms,170,
21,Y,A,dislike,phd,175,
22,Y,B,like,bs,184,
23,Y,B,extremely like,ms,188,
24,N,A,neutral,ms,193,
25,N,A,extremely dislike,bs,173,


In [8]:
df_train

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,Y,A,extremely like,phd,200,Premium
1,N,A,like,ms,100,NonPremium
2,Y,C,neutral,bs,130,NonPremium
3,Y,B,extremely dislike,bs,120,NonPremium
4,Y,D,extremely like,ms,160,Premium
5,N,C,like,ms,170,NonPremium
6,Y,B,dislike,bs,130,NonPremium
7,N,A,neutral,ms,127,NonPremium
8,Y,C,like,phd,157,NonPremium
9,N,D,dislike,ms,182,Premium


In [9]:
# Lets examine the Binary, Ordinal and Nominal Data in the dataset.
# Binary : "PAYMENT_MODE_CARD","CUSTOMER_TYPE"
# Ordinal : "RATING","EDUCATION"
# Nominal : "CITY"

## Encoding of Binary Features

- Binary datasets only have two (usable) values.
- We can always use simple mapping on binary features. Like we can use replace, apply and ,any other way

In [10]:
df_train.head()

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,Y,A,extremely like,phd,200,Premium
1,N,A,like,ms,100,NonPremium
2,Y,C,neutral,bs,130,NonPremium
3,Y,B,extremely dislike,bs,120,NonPremium
4,Y,D,extremely like,ms,160,Premium


In [11]:
#pd.options.mode.chained_assignment = None 
# error issue if we dont use above option
# If you modify values in df later you will find that the modifications do not propagate back to 
# the original data (df), and that Pandas does warning.
#df_train["PAYMENT_MODE_CARD"] = df_train["PAYMENT_MODE_CARD"].apply(lambda x : 1 if x =='Y' 
#                                                                    else (0 if x == 'N' else None))
#df_train["CUSTOMER_TYPE"] = df_train["CUSTOMER_TYPE"].apply(lambda x : 1 if x == "Premium" 
#                                                            else (0 if x == "NonPremium" else None))

In [12]:
df_train = df_train.copy()

In [13]:
df_train["PAYMENT_MODE_CARD"] = df_train["PAYMENT_MODE_CARD"].apply(lambda x : 1 if x =='Y' 
                                                                    else (0 if x == 'N' else None))
df_train["CUSTOMER_TYPE"] = df_train["CUSTOMER_TYPE"].apply(lambda x : 1 if x == "Premium" 
                                                            else (0 if x == "NonPremium" else None))

In [14]:
df_train

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1
5,0,C,like,ms,170,0
6,1,B,dislike,bs,130,0
7,0,A,neutral,ms,127,0
8,1,C,like,phd,157,0
9,0,D,dislike,ms,182,1


## Feature Encoding of Ordinal Features

 - We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the  order is important. Hence encoding should reflect the sequence.
 - In encoding, each label is converted into an integer value.

### OrdinalEncoder- Sklearn

In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [15]:
df_train.head(10)

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1
5,0,C,like,ms,170,0
6,1,B,dislike,bs,130,0
7,0,A,neutral,ms,127,0
8,1,C,like,phd,157,0
9,0,D,dislike,ms,182,1


In [16]:
df_train.RATING.unique()

array(['extremely like', 'like', 'neutral', 'extremely dislike',
       'dislike'], dtype=object)

In [17]:
from sklearn.preprocessing import OrdinalEncoder

In [18]:
oenc_sk = OrdinalEncoder()
oenc_sk.fit_transform(df_train.RATING.values.reshape(-1,1)) #Expected 2D array, got 1D array instead
#oenc_sk.fit_transform(df_train.RATING)

array([[2.],
       [3.],
       [4.],
       [1.],
       [2.],
       [3.],
       [0.],
       [4.],
       [3.],
       [0.],
       [4.],
       [3.],
       [4.],
       [2.],
       [0.],
       [1.],
       [3.],
       [2.],
       [4.],
       [0.]])

In [19]:
oenc_sk.categories_

[array(['dislike', 'extremely dislike', 'extremely like', 'like',
        'neutral'], dtype=object)]

Each value in the array is encoded by it's position.

In [20]:
oenc_feat = OrdinalEncoder()
oenc_feat.fit_transform(df_train[["RATING",'EDUCATION']])

array([[2., 2.],
       [3., 1.],
       [4., 0.],
       [1., 0.],
       [2., 1.],
       [3., 1.],
       [0., 0.],
       [4., 1.],
       [3., 2.],
       [0., 1.],
       [4., 0.],
       [3., 2.],
       [4., 1.],
       [2., 1.],
       [0., 0.],
       [1., 0.],
       [3., 1.],
       [2., 2.],
       [4., 0.],
       [0., 0.]])

In [21]:
oenc_feat.categories_

[array(['dislike', 'extremely dislike', 'extremely like', 'like',
        'neutral'], dtype=object),
 array(['bs', 'ms', 'phd'], dtype=object)]

In [22]:
oenc_feat.fit_transform(df_train[["RATING",'EDUCATION']])

array([[2., 2.],
       [3., 1.],
       [4., 0.],
       [1., 0.],
       [2., 1.],
       [3., 1.],
       [0., 0.],
       [4., 1.],
       [3., 2.],
       [0., 1.],
       [4., 0.],
       [3., 2.],
       [4., 1.],
       [2., 1.],
       [0., 0.],
       [1., 0.],
       [3., 1.],
       [2., 2.],
       [4., 0.],
       [0., 0.]])

**Disadvantage** 

The values won't be the one that you provided, since internally the fit method use numpy.unique which gives result sorted in alphabetic order and not by order of appearance

In [23]:
# we can map the order using dictionary
df_train_sk_ord = df_train.copy()
# Creationary dictionary 
dic_city = {"extremely dislike":1,"dislike":2,"neutral":3,"like":4,"extremely like":5}
df_train_sk_ord ["ENC_RATING"] = df_train.RATING.map(dic_city)
df_train_sk_ord = df_train_sk_ord.drop("RATING",axis=1)
df_train_sk_ord

Unnamed: 0,PAYMENT_MODE_CARD,CITY,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE,ENC_RATING
0,1,A,phd,200,1,5
1,0,A,ms,100,0,4
2,1,C,bs,130,0,3
3,1,B,bs,120,0,1
4,1,D,ms,160,1,5
5,0,C,ms,170,0,4
6,1,B,bs,130,0,2
7,0,A,ms,127,0,3
8,1,C,phd,157,0,4
9,0,D,ms,182,1,2


### LabelEncoder- Sklearn

 - In this technique, each label is assigned a unique integer based on alphabetical ordering.
 - Encode target labels with value between 0 and n_classes-1

In [24]:
from sklearn.preprocessing import LabelEncoder
lenc = LabelEncoder()

In [25]:
lenc.fit_transform(df_train["RATING"])

array([2, 3, 4, 1, 2, 3, 0, 4, 3, 0, 4, 3, 4, 2, 0, 1, 3, 2, 4, 0])

In [26]:
lenc.classes_

array(['dislike', 'extremely dislike', 'extremely like', 'like',
       'neutral'], dtype=object)

In [27]:
#To verify how it only work with one variable so we use this with Target Variables
#lenc.fit_transform(df_train[["RATING","EDUCATION"]])

**Disadvantage**

The order won't be the one that you provided, gives result sorted in alphabetic order and not by order of appearance.


**Comparison of sklearn.LabelEncoder and sklearn.OrdinalEncoder**

 - OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.
 - OrdinalEncoder for 2D data; shape (n_samples, n_features), LabelEncoder is for 1D data: for shape (n_samples,))
 - Another difference between the encoders is the name of their learned parameter:
 
        LabelEncoder learns classes_        
        OrdinalEncoder learns categories_
 - As both gives result sorted in alphabetic order and not by order of appearance so we cannot use the both methods to encode the Ordinal Categorical variables.

### category_encoders - OrdinalEncoder

- To overcome the problem in sklearn OrdinalEncoder and LabelEncoder, we can use this , it use the mapping while creating the object , where we map the order as per our requirement.
- Encodes categorical features as ordinal, in one ordered feature.

In [28]:
# https://contrib.scikit-learn.org/category_encoders/
import category_encoders as ce

In [29]:
# Create object for OrdinalEncoding
encoder =  ce.OrdinalEncoder(cols =["RATING"],return_df=True,
                            mapping = [{'col':'RATING','mapping':{"extremely dislike":1,"dislike":2,
                                                                  "neutral":3,"like":4,"extremely like":5}}])

In [30]:
#fit and transfor the train_df
import warnings
warnings.filterwarnings("ignore")
encoder.fit_transform(df_train)

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,5,phd,200,1
1,0,A,4,ms,100,0
2,1,C,3,bs,130,0
3,1,B,1,bs,120,0
4,1,D,5,ms,160,1
5,0,C,4,ms,170,0
6,1,B,2,bs,130,0
7,0,A,3,ms,127,0
8,1,C,4,phd,157,0
9,0,D,2,ms,182,1


In [31]:
df_train.head()

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1


In [32]:
encoder_feat = ce.OrdinalEncoder(cols = ["RATING","EDUCATION"])

In [33]:
df_train_ord = df_train.copy()

In [34]:
df_train_ord[["RATING",'EDUCATION']] = encoder_feat.fit_transform(df_train[["RATING",'EDUCATION']])
df_train_ord

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,1,1,200,1
1,0,A,2,2,100,0
2,1,C,3,3,130,0
3,1,B,4,3,120,0
4,1,D,1,2,160,1
5,0,C,2,2,170,0
6,1,B,5,3,130,0
7,0,A,3,2,127,0
8,1,C,2,1,157,0
9,0,D,5,2,182,1


**Advantages and Disadvantage** 

- If we don't use mapping,the values won't be the one that you provided, gives result sorted in alphabetic order and not by order of appearance.
- So to overcome the sequence problem we have to provide the mapping while creating the object for keeping the ordering sequences.

## Feature Encoding of Nominal Features

If we have the Nominal Dataset and If we perfor the LabelEncoding, It encode the values on the basis of their alphabetical order. Due to this, there is a very high probability that the model captures the relationship between the values while training and this will misleading to nominal features present in the data.

This is something that we do not want! So how can we overcome this obstacle? Here comes the concept of One-Hot Encoding.

We can encode the categorical variable values to numbers as below:

Case 1:
The ordinal categorical variable can be encoded as:
{‘Low’: 1, ‘Medium’: 2, ‘High’: 3}.

Case 2:
For the animal classification task the label variable, which is a nominal categorical variable can be encoded as:
{‘rat’: 1, ‘dog’: 2, ‘ant’: 3}.

Let’s discuss case 2 first:

There is one major issue with this- the labels in the animal classification problem should not be encoded to integers (like we have done above) since that would enforce an apparently incorrect natural ordering of: ‘rat’ < ‘dog’ < ‘ant’. While we understand no such ordering really exists and that the numbers 1, 2 and 3 do not hold any numerical ordering in the labels we have encoded, our Machine Learning model will not be able to intuitively understand that. If we feed these numbers directly into a model, the cost/loss function is likely to get affected by these values. We need to model this understanding of ours mathematically. One-hot-encoding is how we do it.

Case 1:

Speed is an ordinal variable. We may argue that the relation: ‘low’< ‘medium’<‘high’ makes sense and therefore, using labels 1, 2 and 3 should not be an issue. Unfortunately it is not so. Using labels as 100, 101 and 3000 in place of 1, 2 and 3 would still have the same relationship as ‘low’, ‘medium’ and ‘high’ have. There is nothing special about using 1, 2 and 3. In other words, we do not know how greater is a speed of ‘medium’ than a speed of ‘low’ and how small it is compared to a ‘high’ speed. Difference between these labels can potentially affect the model we train. So, we might want to one-hot-encode the variable ‘speed’ as well.

Till now, I hope we have understood what categorical variables are all about, and why we would like to one-hot-encode them.


### One-Hot Encoding

- Every unique value in the category will be added as a feature and represented as a one-hot vector.
- Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.
- These newly created binary features are known as Dummy variables. 
- The number of dummy variables depends on the levels present in the categorical variable.


- Suppose we have a dataset with a category "City", having different cities like "Delhi", "Mumbai", "Bangalore","Pune". so lets understand how we can encode the one-hot encoding on this CITY feature.

### category_encoders - OneHotEncoder

In [35]:
from category_encoders import OneHotEncoder

In [36]:
# Lets create the dataframe
data_df = pd.DataFrame({"City":["Delhi","Mumbai","Lucknow","Pune"]})
enc_datadf = OneHotEncoder("City"
                           ,use_cat_names=True)
hot = enc_datadf.fit_transform(data_df)
print(hot)
pd.concat([data_df,hot],axis = 1)

   City_Delhi  City_Mumbai  City_Lucknow  City_Pune
0           1            0             0          0
1           0            1             0          0
2           0            0             1          0
3           0            0             0          1


Unnamed: 0,City,City_Delhi,City_Mumbai,City_Lucknow,City_Pune
0,Delhi,1,0,0,0
1,Mumbai,0,1,0,0
2,Lucknow,0,0,1,0
3,Pune,0,0,0,1


In [37]:
df_train

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1
5,0,C,like,ms,170,0
6,1,B,dislike,bs,130,0
7,0,A,neutral,ms,127,0
8,1,C,like,phd,157,0
9,0,D,dislike,ms,182,1


In [38]:
enc_one = OneHotEncoder(cols="CITY",
    drop_invariant=False, 
    return_df=True,
    handle_missing='value',
    handle_unknown='return_nan',
    use_cat_names=True)

In [39]:
# Fit and Transform the data
enc_one.fit_transform(df_train)

Unnamed: 0,PAYMENT_MODE_CARD,CITY_A,CITY_C,CITY_B,CITY_D,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,1.0,0.0,0.0,0.0,extremely like,phd,200,1
1,0,1.0,0.0,0.0,0.0,like,ms,100,0
2,1,0.0,1.0,0.0,0.0,neutral,bs,130,0
3,1,0.0,0.0,1.0,0.0,extremely dislike,bs,120,0
4,1,0.0,0.0,0.0,1.0,extremely like,ms,160,1
5,0,0.0,1.0,0.0,0.0,like,ms,170,0
6,1,0.0,0.0,1.0,0.0,dislike,bs,130,0
7,0,1.0,0.0,0.0,0.0,neutral,ms,127,0
8,1,0.0,1.0,0.0,0.0,like,phd,157,0
9,0,0.0,0.0,0.0,1.0,dislike,ms,182,1


### Sklearn - OneHotEncoder

In [40]:
from sklearn.preprocessing import OneHotEncoder

In [41]:
# Creating one hot encoder object and fit  # same object fit we can use to trasnform the train nad test data in
# Machine lear
sk_enc_one = OneHotEncoder()
sk_enc_one.fit(df_train_ord.CITY.values.reshape(-1,1))

OneHotEncoder()

In [42]:
#reshape the 1-D "CITY" array to 2-D as transform expects 2-D and finally fit the object
sk_one_val = sk_enc_one.transform(df_train_ord.CITY.values.reshape(-1,1)).toarray()

In [43]:
sk_one_val

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [44]:
len(df_train_ord.CITY.unique())

4

In [45]:
p = [i+1 for i in range(len(df_train_ord.CITY.unique()))]

In [46]:
p

[1, 2, 3, 4]

In [47]:
# Lets assign the column name to each one hot vector
df_one = pd.DataFrame(sk_one_val,columns=["CITY_"+str(int(i+1)) for i in range(len(df_train_ord.CITY.unique()))])

In [48]:
df_one

Unnamed: 0,CITY_1,CITY_2,CITY_3,CITY_4
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,1.0
5,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0
8,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,1.0


In [49]:
# Lets add these value back in our main dataframe
df_train_cp = df_train_ord.copy()

In [50]:
df_train_cp

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,1,1,200,1
1,0,A,2,2,100,0
2,1,C,3,3,130,0
3,1,B,4,3,120,0
4,1,D,1,2,160,1
5,0,C,2,2,170,0
6,1,B,5,3,130,0
7,0,A,3,2,127,0
8,1,C,2,1,157,0
9,0,D,5,2,182,1


In [51]:
df_train_hot = pd.concat([df_one,df_train_cp],axis=1)
df_train_hot

Unnamed: 0,CITY_1,CITY_2,CITY_3,CITY_4,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1.0,0.0,0.0,0.0,1,A,1,1,200,1
1,1.0,0.0,0.0,0.0,0,A,2,2,100,0
2,0.0,0.0,1.0,0.0,1,C,3,3,130,0
3,0.0,1.0,0.0,0.0,1,B,4,3,120,0
4,0.0,0.0,0.0,1.0,1,D,1,2,160,1
5,0.0,0.0,1.0,0.0,0,C,2,2,170,0
6,0.0,1.0,0.0,0.0,1,B,5,3,130,0
7,1.0,0.0,0.0,0.0,0,A,3,2,127,0
8,0.0,0.0,1.0,0.0,1,C,2,1,157,0
9,0.0,0.0,0.0,1.0,0,D,5,2,182,1


In [52]:
#droping the "CITY" column 
df_train_hot.drop("CITY",inplace=True,axis=1)

In [53]:
df_train_hot.head()

Unnamed: 0,CITY_1,CITY_2,CITY_3,CITY_4,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1.0,0.0,0.0,0.0,1,1,1,200,1
1,1.0,0.0,0.0,0.0,0,2,2,100,0
2,0.0,0.0,1.0,0.0,1,3,3,130,0
3,0.0,1.0,0.0,0.0,1,4,3,120,0
4,0.0,0.0,0.0,1.0,1,1,2,160,1


### pandas.get_dummies

In [54]:
df_train_dmy = pd.concat([pd.get_dummies(prefix="CITY",data=df_train_cp["CITY"],
                                         drop_first=True),df_train_cp],axis=1)

In [55]:
df_train_dmy.drop("CITY",inplace=True,axis=1)
df_train_dmy

Unnamed: 0,CITY_B,CITY_C,CITY_D,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,0,0,0,1,1,1,200,1
1,0,0,0,0,2,2,100,0
2,0,1,0,1,3,3,130,0
3,1,0,0,1,4,3,120,0
4,0,0,1,1,1,2,160,1
5,0,1,0,0,2,2,170,0
6,1,0,0,1,5,3,130,0
7,0,0,0,0,3,2,127,0
8,0,1,0,1,2,1,157,0
9,0,0,1,0,5,2,182,1


#### Drawbacks of  One-Hot and Dummy Encoding

- If there are multiple categories in a feature variable in such a case we need a similar number of dummy variables to encode the data. For example, a column with 100 different values will require 100 new variables for coding.
- If we have multiple categorical features in the dataset ,again we will end to have several binary features each representing the categorical feature and their multiple categories.
- In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e several columns having 0s and a few of them having 1s. In other words, it creates multiple dummy features in the dataset without adding much information.
- Due to the massive increase in the dataset, coding slows down the learning of the model along with deteriorating the overall performance that ultimately makes the model computationally expensive. 
- Further, while using tree-based models these encodings are not an optimum choice.

.

.

**When to use a Label Encoding and Ordinal Encoding and One Hot Encoding**

This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:

- We apply One-Hot Encoding when:
            
      The categorical feature is not ordinal (like the CITY feature)
      The number of categorical features is less so one-hot encoding can be effectively applied 

- We apply Label Encoding and Ordinal Encoding when:

      The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
      The number of categories is quite large as one-hot encoding can lead to high memory consumption

-

**I prefer using sklearn.preprocessing.OneHotEncoder instead of pd.get_dummies**

- This is because sklearn.preprocessing.OneHotEncoder returns an object of sklearn.preprocessing.OneHotEncoder class. We can fit this object on the training set and then use the same object to transform the test set. 
- On the other hand, pd.get_dummies returns a dataframe with encodings based on the values in the dataframe we pass to it. This might be good for a quick analysis, but for an extended model building project where you train on training set and will be later testing on a test set, I would suggest using sklearn.preprocessing.OneHotEncoder.
- Scikit-learn comes with a combined version for the methods fit and transform- fit_transform that helps reduce a line or two from your code.

.

.

.

.

.

.

.

.

.

.

.

### Dummy Variable Trap in OneHotEncoding

In [56]:
df_train

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1
5,0,C,like,ms,170,0
6,1,B,dislike,bs,130,0
7,0,A,neutral,ms,127,0
8,1,C,like,phd,157,0
9,0,D,dislike,ms,182,1


In [57]:
df_train.CITY.unique()

array(['A', 'C', 'B', 'D'], dtype=object)

In [58]:
df_train_hot.head(6)

Unnamed: 0,CITY_1,CITY_2,CITY_3,CITY_4,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1.0,0.0,0.0,0.0,1,1,1,200,1
1,1.0,0.0,0.0,0.0,0,2,2,100,0
2,0.0,0.0,1.0,0.0,1,3,3,130,0
3,0.0,1.0,0.0,0.0,1,4,3,120,0
4,0.0,0.0,0.0,1.0,1,1,2,160,1
5,0.0,0.0,1.0,0.0,0,2,2,170,0


We can see from the above table that this "CITY" column does not exist anymore in the returned dataframe and has been replaced by new columns. We can also observe that ‘A’ has been mapped to a vector: A -> [1,0,0,0] , and similarly, B -> [0,1,0,0] , C -> [0,0,1,0]and D -> [0,0,0,1]. Notice that each vector has only one ‘1’ in it. Also notice that each vector is 4-dimensional. This is because "CITY" has four unique values. In fact, the number of dimensions of the one-hot vectors is equal to the number of unique values that the categorical column takes up in the dataset. Here, encoding has been done so that 1 in the first place of a vector means "A", 1 in the second place means "B" and so forth.

From the encoded dataset in "df_train_hot" ,we can observe the following linear relationship only for dummy variables

lets this is for row1 --> b0 + CITY_1*b1 + CITY_2*b2 + CITY_3*b3 + CITY_4*b4 = 1    >>>>>>>>equation-1

lets this is for row2 --> b0 + CITY_1*b1 + CITY_2*b2 + CITY_3*b3 + CITY_4*b4 = 1

-

-

-

lets this is for row20 --> b0 + CITY_1*b1 + CITY_2*b2 + CITY_3*b3 + CITY_4*b4 = 1

we can express any one of the four independent dummy variables in terms of the other three, let us take CITY_1 in LHS and express it in terms of other 3:

    CITY_1b1  = 1 - b0 - CITY_2b2 - CITY_3b3 - CITY_4b4 >>>>>>>>>equation-2
    CITY_2b2  = 1 - b0 - CITY_1b1 - CITY_3b3 - CITY_4b4 >>>>>>>>>equation-3
    CITY_3b3  = 1 - b0 - CITY_1b1 - CITY_2b2 - CITY_4b4 >>>>>>>>>equation-4
    CITY_4b4  = 1 - b0 - CITY_1b1 - CITY_2b2 - CITY_3b3 >>>>>>>>>equation-4

- *The vectors that we use to encode the categorical columns are called ‘Dummy Variables’. We intended to solve the problem of using categorical variables, but got trapped by the problem of Multicollinearity. This is called the Dummy Variable Trap and due to this we get features, which are highly correlated to each other*
- Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.
- So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped.

- However, it also poses some other problems in Machine Learning tasks. Let us say, we train a logistic regression model on the dataset. We would expect our model to learn weights for the following equation:
 
$$
y =\frac{1}{1 + e^{-wX}}
$$


In particular, for the features we have in our dataset, following are the weights that logistic regression would learn:

w = (w_city1 , w_city2 , w_city3 , w_city4)

And the feature vector X is :

X = (CITI_1, CITY_2, CITY_3, CITY_4)

Clearly, it(X) is the power on the exponential function in the denominator of the sigmoid function that actually affects the value of y and contains trainable weights. This expression actually expands to:

wX = (w_city1 * CITI_1 + w_city2 * CITY_2 + w_city3 * CITY_3 + w_city4 * CITY_4)

Obervations from the above:
    
- We can substitute CITY_1 in equation-1 with its value in equation-2. This actually means that (atleast) one of the features we are working with its redundant- that feature could be any one of the four, since equation-2 could be written with any one of them in the LHS. **So, we are making our model learn an additional weight which is not really needed. This consumes computational power and time.** This also gives an optimisation objective that might not be very reasonable and might also be difficult to work with.**Too many independent variables may also lead to Curse of Dimensionality. If multicollinearity also comes alongwith that, things become worse.**

- Since, one-hot-encoding directly induces perfect multicollinearity, we drop one of the dummy columns from the encoded features. For e.g., we may choose to drop "CITY_2" in this case, but the choice is completely arbitrary.

#### how the problem of multicollinearity is introduced after carrying out the one-hot encoding ?

Lets check for multicollinearity with the **Variance Inflation Factor (VIF)**
   - VIF=1, Very Less Multicollinearity
   - VIF<5, Moderate Multicollinearity
   - VIF>5, Extreme Multicollinearity (This is what we have to avoid)


**What Is a Variance Inflation Factor (VIF)?**

- Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. 
- Mathematically, the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model. 

In [59]:
# Lets crtea the VIF function
import statsmodels.regression.linear_model as sm
def check_vif(data):
    df_vif = pd.DataFrame(columns=["feature","VIF"])
    feature_name = data.columns
    for i in range(feature_name.shape[0]):
        x = data[feature_name[i]]
        y = data[feature_name.drop([feature_name[i]])]
        r_squared = sm.OLS(x,y).fit().rsquared
        vif = round(1/(1-r_squared),2)
        df_vif.loc[i] = [feature_name[i],vif]
    return df_vif.sort_values(by="VIF",axis=0,ascending=False,inplace=False)

In [60]:
df_train_hot.iloc[:,0:-1].head()

Unnamed: 0,CITY_1,CITY_2,CITY_3,CITY_4,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT
0,1.0,0.0,0.0,0.0,1,1,1,200
1,1.0,0.0,0.0,0.0,0,2,2,100
2,0.0,0.0,1.0,0.0,1,3,3,130
3,0.0,1.0,0.0,0.0,1,4,3,120
4,0.0,0.0,0.0,1.0,1,1,2,160


In [61]:
check_vif(df_train_hot.iloc[:,0:-1])

Unnamed: 0,feature,VIF
0,CITY_1,21.42
1,CITY_2,17.93
3,CITY_4,11.01
2,CITY_3,10.69
5,RATING,2.45
6,EDUCATION,2.41
4,PAYMENT_MODE_CARD,1.47
7,PURCHAGE_AMOUNT,1.17


From the output, we can see that the dummy variables which are created using one-hot encoding have VIF above 5. We have a multicollinearity problem.

In [62]:
df_train_dmy.iloc[:,0:-1].head()

Unnamed: 0,CITY_B,CITY_C,CITY_D,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT
0,0,0,0,1,1,1,200
1,0,0,0,0,2,2,100
2,0,1,0,1,3,3,130
3,1,0,0,1,4,3,120
4,0,0,1,1,1,2,160


In [63]:
# Lets remove one hot 
check_vif(df_train_dmy.iloc[:,0:-1])

Unnamed: 0,feature,VIF
5,EDUCATION,19.57
4,RATING,12.79
6,PURCHAGE_AMOUNT,7.95
3,PAYMENT_MODE_CARD,3.66
0,CITY_B,2.73
1,CITY_C,1.8
2,CITY_D,1.75


Earlier all the dummy variables had the VIF greater than 5 and now it has come down to 5.Great! VIF has decreased. We solved the problem of multicollinearity. Now, the dataset is ready for building the model.

### Frequency Encoding for Feature Encoding in Machine Learning

We can also encode considering the frequency distribution. This method can be effective at times for nominal features.

In [64]:
df_train

Unnamed: 0,PAYMENT_MODE_CARD,CITY,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE
0,1,A,extremely like,phd,200,1
1,0,A,like,ms,100,0
2,1,C,neutral,bs,130,0
3,1,B,extremely dislike,bs,120,0
4,1,D,extremely like,ms,160,1
5,0,C,like,ms,170,0
6,1,B,dislike,bs,130,0
7,0,A,neutral,ms,127,0
8,1,C,like,phd,157,0
9,0,D,dislike,ms,182,1


In [65]:
# Lets group the "CITY" variable by frequency
frq = df_train.groupby("CITY").size()

In [66]:
frq

CITY
A    7
B    6
C    4
D    3
dtype: int64

In [67]:
len(df_train)

20

In [68]:
frq_dis = df.groupby("CITY").size()/len(df_train)
frq_dis

CITY
A    0.50
B    0.45
C    0.20
D    0.15
dtype: float64

In [69]:
# Lets map the freq_dist to original dataframe df_train
df_train_freq = df_train.copy()

In [70]:
df_train_freq["CITY_FRQ_ENC"] = df_train_freq.CITY.map(frq_dis)

In [71]:
df_train_freq.drop("CITY",axis=1,inplace=True)

In [72]:
df_train_freq

Unnamed: 0,PAYMENT_MODE_CARD,RATING,EDUCATION,PURCHAGE_AMOUNT,CUSTOMER_TYPE,CITY_FRQ_ENC
0,1,extremely like,phd,200,1,0.5
1,0,like,ms,100,0,0.5
2,1,neutral,bs,130,0,0.2
3,1,extremely dislike,bs,120,0,0.45
4,1,extremely like,ms,160,1,0.15
5,0,like,ms,170,0,0.2
6,1,dislike,bs,130,0,0.45
7,0,neutral,ms,127,0,0.5
8,1,like,phd,157,0,0.2
9,0,dislike,ms,182,1,0.15
