# Encodig 
    - Converting categorical variable to numerical varibale cane be considered as encoding

### Advantage

- Straight forward to implement
- Does not requires hrs or variable exploration
- Does not expand massive feature space, i.e. number of columns added in dataset

### Disadvantage

- Does not add any infomation which may make the variable more predictive
- Does not keep the information of the ignored labels


### Type of encoding

- Nominal Encoding 
    - or nominal categorical variables like Gender, State etc
- Ordinal Encoding
    - for ordinal categorical variables like Education
    - Need to care about Rank
    - example can be education ofa persion, BE, BCom, PhD etc
    - We can rank PhD on top, which can help in predicting salary
    
### Major Nomimal encoding techniques 
    - One Hot encoding
    - One hot encoding with many categorical variable
    - Mean encoding
    - Count(or Frequency encoding)

### Major Ordinal encoding techniques 
    - Label Encoding
    - Target guided ordinal encoding

## Lets demonstrate - One hot encoding

![image.png](attachment:image.png)

In [16]:
import pandas as pd

cars = {'Name': ['Charls','Bruce','Diana','Wanda'],
        'State': ['New York','London','Germany','New York']
        }

df = pd.DataFrame(cars, columns = ['Name', 'State'])

print (df)

     Name     State
0  Charls  New York
1   Bruce    London
2   Diana   Germany
3   Wanda  New York


In [17]:
encoded_df = pd.get_dummies(df['State'],drop_first=True)
encoded_df

Unnamed: 0,London,New York
0,0,1
1,1,0
2,0,0
3,0,1


In [18]:
df = df.drop('State', axis=1)

In [19]:
df.head()

Unnamed: 0,Name
0,Charls
1,Bruce
2,Diana
3,Wanda


In [21]:
encoded_df.head()

Unnamed: 0,London,New York
0,0,1
1,1,0
2,0,0
3,0,1


In [22]:
df = pd.concat([df, encoded_df], axis=1)
df

Unnamed: 0,Name,London,New York
0,Charls,0,1
1,Bruce,1,0
2,Diana,0,0
3,Wanda,0,1


## Lets demonstrate - One hot encoding with many categorical variable

In [None]:
import pandas as pd
import numpy as np

# Lets use mercedes benz dataset, and use only categorical variables

df = pd.read_csv("./dataset/mercedesbenz.csv", usecols = ['X1','X2','X3','X4','X5','X6'])

In [None]:
df.head()

In [None]:
# Lets check how many disticnt values wech columns has

total_categorical_values=0
for col in df.columns:
    print("{} column has {} distinct values.".format(col, len(df[col].unique())))
    total_categorical_values = total_categorical_values + len(df[col].unique())

print("So total new columns are {}".format(total_categorical_values))

# meaning, I we do one hot encoding here, xolumn X1 itself will create 27 more columns!! Its problem, no!!
    

In [None]:
# lets see how many columsn will get created with one hot encoding on these variables

pd.get_dummies(df, drop_first=True).shape

So, for 6 categorical columns, we will end up having 177 new columns. Such multiple feature will impact the accuracy

What can we do now??

Following press release is from 2009 KDD cup where they solve such problem by using - Top 10 most frequent labels.
http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf 

Here we are going to use same. Its may not be 10 could be 10, 12, 13 etc.

In [None]:
# Lets find top 10 most frequest caegories in column X2

df.X2.value_counts().sort_values(ascending=False).head(20)

In [None]:
# lets take top 10 labels of column X2 as a list

#col_X2_top10

type(df.X2.value_counts().sort_values(ascending=False).head(20)) # its a Series

col_X2_top10 = [x for x in df.X2.value_counts().sort_values(ascending=False).head(10).index]

col_X2_top10

In [None]:
## [IMP] iterate values of columns X2, if the value in in top 10 lebel, put 1 else 0.
# (2,n) is 1, as n was in top 10
# (2,s) is 0, as s was not in top 10

for label in col_X2_top10:
    df[label] = np.where(df['X2'] == label, 1,0)

In [None]:
df[['X2']+col_X2_top10].head(40)

In [None]:
df.head()

# we can drop columns X2 now as we have done one hot coding on columns x2

df.drop('X2', axis=1)

In [None]:
## So, we have done one hot encoding for column X2. 

# Now, lets start fresh and  do this for all categorical column

df = pd.read_csv("./dataset/mercedesbenz.csv", usecols = ['X1','X2','X3','X4','X5','X6'])
df.head()

In [None]:
df = pd.read_csv("./dataset/mercedesbenz.csv", usecols = ['X1','X2','X3','X4','X5','X6'])

# Method to get top 10 most occureing values in a column
def get_top10_labels(df,col):
    return [x for x in df[col].value_counts().sort_values(ascending=False).head(10).index]

##
# Apply one hot encoding for columns and drop them 
##
def apply_one_hot_coding(df, cols):
    for col in cols:
        col_top10_labels = get_top10_labels(df,col)  
        
        for label in col_top10_labels:
            df[col+"_"+label] = np.where(df[col] == label, 1,0)        
        
        df = df.drop(col, axis=1)
    return df




df = apply_one_hot_coding(df, ['X1','X2','X3','X4','X5','X6'])

df.head()

In [24]:
df.head()

Unnamed: 0,Name,London,New York
0,Charls,0,1
1,Bruce,1,0
2,Diana,0,0
3,Wanda,0,1


## Lets demonstrate - Count (or Frequency encoding)

As name suggest, in this techniques columns values are replaced by numebr of occurance.

### Advantage
    - It is very simple to implement
    - Doesn't incease the feture dimensional space

### Disadvantage
    - If two or more labels are occuring same number of time; then they might loss valuable information
    - Adds arbitrary numbers, therefore weights to the diffrent labels, that may not be related to their predictive power

Refer here for more info: https://www.kaggle.com/general/16927


Lets implement..

In [47]:
df = pd.read_csv("./dataset/mercedesbenz.csv", usecols=['X1','X2'])

In [48]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [49]:
df_col_x2_label_frequency = df['X2'].value_counts().to_dict()
df_col_x2_label_frequency

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'ag': 19,
 'z': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'at': 6,
 'h': 6,
 'al': 5,
 'q': 5,
 'an': 5,
 'av': 4,
 'p': 4,
 'ah': 4,
 'au': 3,
 'l': 1,
 'am': 1,
 'j': 1,
 'aa': 1,
 'c': 1,
 'o': 1,
 'ar': 1,
 'af': 1}

In [50]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [51]:
# lets replace X2 labels in the dataset df

df.X2 = df.X2.map(df_col_x2_label_frequency)

In [53]:
df

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137
...,...,...
4204,s,1659
4205,o,29
4206,v,153
4207,r,81


## Lets demonstrate - Label Encoding or Ordinal number encoding

##### Ordinal Categorical variable

Ordinal data is categorical, statistical data type where the variables have natural, ordered categories and the distance between the categories is not known

![image.png](attachment:image.png)



In [100]:
import pandas as pd
import datetime as dt

In [101]:
df = dt.datetime.today()

# Lets create few dates
date_list = [df - dt.timedelta(days=x) for x in range(0,20)]

type(date_list)

list

In [102]:
df_dates = pd.DataFrame(date_list)

In [103]:
df_dates.columns = ['day']

In [104]:
df_dates

Unnamed: 0,day
0,2020-08-15 21:51:22.822036
1,2020-08-14 21:51:22.822036
2,2020-08-13 21:51:22.822036
3,2020-08-12 21:51:22.822036
4,2020-08-11 21:51:22.822036
5,2020-08-10 21:51:22.822036
6,2020-08-09 21:51:22.822036
7,2020-08-08 21:51:22.822036
8,2020-08-07 21:51:22.822036
9,2020-08-06 21:51:22.822036


In [105]:
df_dates['day_of_week'] = df_dates['day'].dt.day_name()

df_dates

Unnamed: 0,day,day_of_week
0,2020-08-15 21:51:22.822036,Saturday
1,2020-08-14 21:51:22.822036,Friday
2,2020-08-13 21:51:22.822036,Thursday
3,2020-08-12 21:51:22.822036,Wednesday
4,2020-08-11 21:51:22.822036,Tuesday
5,2020-08-10 21:51:22.822036,Monday
6,2020-08-09 21:51:22.822036,Sunday
7,2020-08-08 21:51:22.822036,Saturday
8,2020-08-07 21:51:22.822036,Friday
9,2020-08-06 21:51:22.822036,Thursday


In [106]:
day_map = {
    'Monday':1,
    'Tuesday':2,
    'Wednesday':3,
    'Thursday':4,
    'Friday':5,
    'Saturday':6,
    'Sunday':7
}

In [107]:
df_dates['day_ordinal_numbers'] = df_dates.day_of_week.map(day_map)

In [108]:
df_dates

Unnamed: 0,day,day_of_week,day_ordinal_numbers
0,2020-08-15 21:51:22.822036,Saturday,6
1,2020-08-14 21:51:22.822036,Friday,5
2,2020-08-13 21:51:22.822036,Thursday,4
3,2020-08-12 21:51:22.822036,Wednesday,3
4,2020-08-11 21:51:22.822036,Tuesday,2
5,2020-08-10 21:51:22.822036,Monday,1
6,2020-08-09 21:51:22.822036,Sunday,7
7,2020-08-08 21:51:22.822036,Saturday,6
8,2020-08-07 21:51:22.822036,Friday,5
9,2020-08-06 21:51:22.822036,Thursday,4
