<a href="https://colab.research.google.com/github/abdullahhsamir/practicing-machine-learning/blob/main/Target_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Use Cases for Target Encoding
Target encoding is great for:
1. **High-cardinality features:** A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target. 

2. **Domain-motivated features:** might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativenes




In [1]:
#source 1: https://maxhalford.github.io/blog/target-encoding/
#source 2: https://www.kaggle.com/ryanholbrook/target-encoding
import pandas as pd 
import numpy as np

In [2]:
df = pd.DataFrame({'col1':np.random.rand(10)*100,
                   'cat_0':['red']*5 + ['yellow']*5,
                   'cat_1':['blue']*9 + ['green']*1,
                   'y': [1, 0, 1, 1, 1, 1, 0, 0, 0, 0]
                   })

In [3]:
df

Unnamed: 0,col1,cat_0,cat_1,y
0,54.317197,red,blue,1
1,37.48052,red,blue,0
2,44.335796,red,blue,1
3,98.531253,red,blue,1
4,83.473592,red,blue,1
5,14.595431,yellow,blue,1
6,3.866097,yellow,blue,0
7,75.259831,yellow,blue,0
8,78.363267,yellow,blue,0
9,7.571011,yellow,green,0


In [6]:
mean_0 = df.groupby(['cat_0'])['y'].transform('mean').to_dict()
mean_0

{0: 0.8,
 1: 0.8,
 2: 0.8,
 3: 0.8,
 4: 0.8,
 5: 0.2,
 6: 0.2,
 7: 0.2,
 8: 0.2,
 9: 0.2}

In [7]:
mean_1 =df.groupby(['cat_1'])['y'].transform('mean').to_dict()
mean_1

{0: 0.5555555555555556,
 1: 0.5555555555555556,
 2: 0.5555555555555556,
 3: 0.5555555555555556,
 4: 0.5555555555555556,
 5: 0.5555555555555556,
 6: 0.5555555555555556,
 7: 0.5555555555555556,
 8: 0.5555555555555556,
 9: 0.0}

In [8]:
df.groupby(['cat_0'])['y'].agg(['mean','count'])

Unnamed: 0_level_0,mean,count
cat_0,Unnamed: 1_level_1,Unnamed: 2_level_1
red,0.8,5
yellow,0.2,5


In [9]:
def cal_smooth(df,cat_name,y,weight):
  #the equation as following:
  # 1- compute global mean
  # 2- compute * number of value aka 'count' and mean for our cat
  # 3- putting the weight value

  #1- computing global mean:
  glob_mean = df[y].mean()
  #2- computing number of number of values and mean
  aggs = df.groupby([cat_name])[y].agg(['mean','count'])
  mean = aggs['mean'] 
  count = aggs['count']

  #equation of smoothing the mean:
  smth = (count * mean ) + (glob_mean * weight) / (count + weight ) 
  
  return df[cat_name].map(smth)


In [10]:
weight = 10 #needing to adjust hyperparameter

df['cat_0_enc'] = cal_smooth(df,'cat_0','y',weight)

df

Unnamed: 0,col1,cat_0,cat_1,y,cat_0_enc
0,54.317197,red,blue,1,4.333333
1,37.48052,red,blue,0,4.333333
2,44.335796,red,blue,1,4.333333
3,98.531253,red,blue,1,4.333333
4,83.473592,red,blue,1,4.333333
5,14.595431,yellow,blue,1,1.333333
6,3.866097,yellow,blue,0,1.333333
7,75.259831,yellow,blue,0,1.333333
8,78.363267,yellow,blue,0,1.333333
9,7.571011,yellow,green,0,1.333333


In [11]:
df['cat_1_enc'] = cal_smooth(df,'cat_1','y',weight)
df

Unnamed: 0,col1,cat_0,cat_1,y,cat_0_enc,cat_1_enc
0,54.317197,red,blue,1,4.333333,5.263158
1,37.48052,red,blue,0,4.333333,5.263158
2,44.335796,red,blue,1,4.333333,5.263158
3,98.531253,red,blue,1,4.333333,5.263158
4,83.473592,red,blue,1,4.333333,5.263158
5,14.595431,yellow,blue,1,1.333333,5.263158
6,3.866097,yellow,blue,0,1.333333,5.263158
7,75.259831,yellow,blue,0,1.333333,5.263158
8,78.363267,yellow,blue,0,1.333333,5.263158
9,7.571011,yellow,green,0,1.333333,0.454545


In [12]:
weight = 5 #needing to adjust hyperparameter
df['cat_1_enc'] = cal_smooth(df,'cat_1','y',weight)
df

Unnamed: 0,col1,cat_0,cat_1,y,cat_0_enc,cat_1_enc
0,54.317197,red,blue,1,4.333333,5.178571
1,37.48052,red,blue,0,4.333333,5.178571
2,44.335796,red,blue,1,4.333333,5.178571
3,98.531253,red,blue,1,4.333333,5.178571
4,83.473592,red,blue,1,4.333333,5.178571
5,14.595431,yellow,blue,1,1.333333,5.178571
6,3.866097,yellow,blue,0,1.333333,5.178571
7,75.259831,yellow,blue,0,1.333333,5.178571
8,78.363267,yellow,blue,0,1.333333,5.178571
9,7.571011,yellow,green,0,1.333333,0.416667
