## Method Description

1. Divide original dataset into train and test
2. Use train dataset to derive the mean target values per category that we will use to replace the original category values of the variables (intervals for numerical variables)
3. In the test dataset, encode numerical and categorical variables with the target mean
4. With this encoding/prediction, we try to determine the performance metric

## Mean Coding: Example

##### Categorical Variables
Color: 3 red (target 1, 0, 1), 2 blue (target 1, 1), 2 green (target 0, 0), 2 yellow (target 1, 0) out of 10 total colors

* Mean for red: 0.66 (2 out of 3 with target=1)
* Mean for blue: 1 (2 out of 2)
* Mean for Green: 0 (0 out of 2)
* Mean for Yellow: 0.5 (1 out of 2)

Capture this as a dictionary in Python: {Red:0.66, Blue:1, Green:0, Yellow:0.5}  
Replace all Target values (0s and 1s) with these "mean", hence "mean encoding"  
With the encoding and the target, derive a performance metric

##### Numerical Variables
For example, if we have the continuous variable, price, we can put them in bins, e.g. [1-5],[6-10],[11-15], etc.
Do the same using these intervals as "categories"

### Using a performance metric (for example, the ROC-AUC)
* Obtain a performance metric for each feature
* then <b>RANK</b> the features by the best performance metric (best features)

We can use any performance metrics we like:
* (Classification) ROC-AUC, accuracy, precision, recall, etc.
* (Regression) MSE, RMSE, R-squared, etc.
* Different metrics may lead to different selected features

# Example Demo (using the Titanic Dataset)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

In [29]:
# load the titanic dataset
df = pd.read_csv('../precleaned-datasets/titanic.csv')
df.shape

(418, 12)

In [30]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [31]:
df = df.copy()

In [32]:
#only want the following columns
df = df[['Pclass','Survived','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']]

In [33]:
df.head()

Unnamed: 0,Pclass,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,0,male,34.5,0,0,7.8292,,Q
1,3,1,female,47.0,1,0,7.0,,S
2,2,0,male,62.0,0,0,9.6875,,Q
3,3,0,male,27.0,0,0,8.6625,,S
4,3,1,female,22.0,1,1,12.2875,,S


In [34]:
df.isnull().sum()

Pclass        0
Survived      0
Sex           0
Age          86
SibSp         0
Parch         0
Fare          1
Cabin       327
Embarked      0
dtype: int64

In [35]:
#Quick Missing Data Imputation 
df["Age"].fillna(df["Age"].median(), inplace = True) #use median age
df["Cabin"]=df["Cabin"].fillna("Unknown")

In [36]:
df.isnull().sum()

Pclass      0
Survived    0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        1
Cabin       0
Embarked    0
dtype: int64

In [37]:
#For fare, just drop it
df = df.dropna()

In [38]:
df

Unnamed: 0,Pclass,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,0,male,34.5,0,0,7.8292,Unknown,Q
1,3,1,female,47.0,1,0,7.0000,Unknown,S
2,2,0,male,62.0,0,0,9.6875,Unknown,Q
3,3,0,male,27.0,0,0,8.6625,Unknown,S
4,3,1,female,22.0,1,1,12.2875,Unknown,S
...,...,...,...,...,...,...,...,...,...
413,3,0,male,27.0,0,0,8.0500,Unknown,S
414,1,1,female,39.0,0,0,108.9000,C105,C
415,3,0,male,38.5,0,0,7.2500,Unknown,S
416,3,0,male,27.0,0,0,8.0500,Unknown,S


In [39]:
#just use the first letter for cabin
df['Cabin'] = df['Cabin'].str[0]
df['Cabin'].unique()

array(['U', 'B', 'E', 'A', 'C', 'D', 'F', 'G'], dtype=object)

## Feature Selection on Categorical Variables

In [40]:
# separate train and test sets
#select features using only training set = to avoid overfitting

X_train, X_test, y_train, y_test = train_test_split(
    df[['Pclass', 'Sex', 'Embarked', 'Cabin', 'Survived']],
    df['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((291, 5), (126, 5))

### Replace Categories by Target Mean

In [45]:
# function that determines the target mean per category

def mean_encoding(df_train, df_test, categorical_vars):
    
    # temporary copy of the original dataframes
    df_train_temp = df_train.copy()
    df_test_temp = df_test.copy()
    
    # iterate over each variable
    for col in categorical_vars:
        
        # make a dictionary of categories, target-mean pairs
        target_mean_dict = df_train.groupby([col])['Survived'].mean().to_dict()
        
        # replace the categories by the mean of the target
        df_train_temp[col] = df_train[col].map(target_mean_dict)
        df_test_temp[col] = df_test[col].map(target_mean_dict)
    
    # drop the target from the daatset
    df_train_temp.drop(['Survived'], axis=1, inplace=True)
    df_test_temp.drop(['Survived'], axis=1, inplace=True)
    
    # return  remapped datasets
    return df_train_temp, df_test_temp

In [46]:
categorical_vars = ['Pclass', 'Sex', 'Embarked', 'Cabin']

X_train_enc, X_test_enc = mean_encoding(X_train, X_test, categorical_vars)

X_train_enc.head()
#X_test_enc.head()

Unnamed: 0,Pclass,Sex,Embarked,Cabin
96,0.414286,1.0,0.305263,0.47619
381,0.309677,0.0,0.53125,0.30131
89,0.30303,0.0,0.305263,0.30131
234,0.414286,0.0,0.318841,0.47619
192,0.309677,0.0,0.305263,0.30131


### Determine the roc-auc using the variable values as input

In [47]:
roc_values = []

for feature in categorical_vars:
    
    roc_values.append(roc_auc_score(y_test, X_test_enc[feature])) 

In [48]:
m1 = pd.Series(roc_values)
m1.index = categorical_vars
m1.sort_values(ascending=False)

Sex         1.000000
Pclass      0.581818
Embarked    0.569654
Cabin       0.519078
dtype: float64

#### All variables have ROC-AUC higher than 0.5, gender & passenger class are the most important

## Feature Selection on Numerical Variables
* Identical to previous procedure with categorical variables, but need to divide the continuous variable into bins.

In [49]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df[['Age', 'Fare', 'Survived']],
    df['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((291, 3), (126, 3))

### Divide age into bins

In [53]:
# Use qcut (quantile cut)
# retbins = True captures the limits of each interval

X_train['age_binned'], intervals = pd.qcut(
    X_train['Age'],
    q = 5,
    labels=False,
    retbins=True,
    precision=3,
    duplicates='drop',
)

X_train[['age_binned', 'Age']].head(10)

Unnamed: 0,age_binned,Age
96,3,76.0
381,1,26.0
89,0,2.0
234,2,39.0
192,0,11.5
354,0,0.17
254,2,32.5
92,1,27.0
241,3,45.0
195,2,33.0


In [61]:
#X_train['age_binned'].nunique()

In [60]:
#intervals

In [57]:
X_test['age_binned'] = pd.cut(x = X_test['Age'], bins=intervals, labels=False)

In [59]:
X_test[['age_binned', 'Age']].head(10)

Unnamed: 0,age_binned,Age
410,1,27.0
171,1,27.0
225,1,27.0
391,3,51.0
309,3,45.0
308,3,55.0
150,1,23.0
10,1,27.0
21,0,9.0
262,2,29.0


### Do the same with the Fare variable

In [62]:
# train
X_train['fare_binned'], intervals = pd.qcut(
    X_train['Fare'],
    q=5,
    labels=False,
    retbins=True,
    precision=3,
    duplicates='drop',
)

# test
X_test['fare_binned'] = pd.cut(x = X_test['Fare'], bins=intervals, labels=False)

In [63]:
X_test['fare_binned'].nunique()

5

### Use the previous function to encode the variables (replace bins with target mean)

In [66]:
binned_vars = ['age_binned', 'fare_binned']

X_train_enc, X_test_enc = mean_encoding(
    X_train[binned_vars+['Survived']], X_test[binned_vars+['Survived']], binned_vars)

X_train_enc.head()

Unnamed: 0,age_binned,fare_binned
96,0.362069,0.482759
381,0.263158,0.238095
89,0.428571,0.350877
234,0.339286,0.482759
192,0.428571,0.320755


### Determine the ROC-AUC using encodings

In [67]:
roc_values = []

for feature in binned_vars:
    
    roc_values.append(roc_auc_score(y_test, X_test_enc[feature])) 

In [68]:
m1 = pd.Series(roc_values)
m1.index = binned_vars
m1.sort_values(ascending=False)

fare_binned    0.652369
age_binned     0.459411
dtype: float64

#### Fare is a better predictor of survival. Age is not as helpful as its ROC-AUC is below 0.5