<a href="https://colab.research.google.com/github/Venkatpandey/DataScience_ML/blob/main/featureSelection/06.4-KDD-method-with-Feature-engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Select with Target Mean as Performance Proxy

This transformer contains the methods of feature selection described in the notebook **06.2-Method-used-in-a-KDD-competition**

The functionality has now been included in Feature-engine.

Feature-engine automatically detects categorical and numerical variables. 

- Categories in categorical variables will be replaced by the mean value of the target.

- Numerical variables will be first discretised and then, each bin replaced by the target mean value.

In [1]:
pip install feature_engine

Collecting feature_engine
  Downloading feature_engine-1.2.0-py2.py3-none-any.whl (205 kB)
[?25l[K     |█▋                              | 10 kB 21.1 MB/s eta 0:00:01[K     |███▏                            | 20 kB 10.2 MB/s eta 0:00:01[K     |████▉                           | 30 kB 6.6 MB/s eta 0:00:01[K     |██████▍                         | 40 kB 6.0 MB/s eta 0:00:01[K     |████████                        | 51 kB 5.3 MB/s eta 0:00:01[K     |█████████▋                      | 61 kB 5.3 MB/s eta 0:00:01[K     |███████████▏                    | 71 kB 5.1 MB/s eta 0:00:01[K     |████████████▊                   | 81 kB 5.8 MB/s eta 0:00:01[K     |██████████████▍                 | 92 kB 5.8 MB/s eta 0:00:01[K     |████████████████                | 102 kB 5.2 MB/s eta 0:00:01[K     |█████████████████▋              | 112 kB 5.2 MB/s eta 0:00:01[K     |███████████████████▏            | 122 kB 5.2 MB/s eta 0:00:01[K     |████████████████████▊           | 133 kB 5.2 MB/

In [26]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from feature_engine.selection import SelectByTargetMeanPerformance

In [27]:
# load the titanic dataset

data = pd.read_csv('https://raw.githubusercontent.com/Venkatpandey/DataScience_ML/main/dataset/titanic.csv')
data.shape

(1306, 9)

In [28]:
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B5,S
1,1,1,male,0.9167,1,2,151.55,C22 C26,S
2,1,0,female,2.0,1,2,151.55,C22 C26,S
3,1,0,male,30.0,1,2,151.55,C22 C26,S
4,1,0,female,25.0,1,2,151.55,C22 C26,S


In [29]:
# Variable preprocessing:

# then I will narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was located

# captures first letter of string (the letter of the cabin)
data['cabin'] = data['cabin'].str[0]

# now we will rename those cabin letters that appear only 1 or 2 in the
# dataset by N

# replace rare cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])

data['cabin'].unique()

array(['B', 'C', 'E', 'D', 'A', nan, 'N', 'F'], dtype=object)

In [30]:
# number of passenges per cabin

data['cabin'].value_counts()

C    94
B    63
D    46
E    41
A    22
F    21
N     6
Name: cabin, dtype: int64

In [31]:
# number of passengers per value
data['parch'].value_counts()

0    999
1    170
2    113
3      8
5      6
4      6
9      2
6      2
Name: parch, dtype: int64

In [32]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['parch'] = np.where(data['parch']>3,3,data['parch'])

In [33]:
data['sibsp'].value_counts()

0    888
1    319
2     42
4     22
3     20
8      9
5      6
Name: sibsp, dtype: int64

In [34]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])

In [35]:
# cast discrete variables as categorical

# feature-engine considers categorical variables all those of type
# object. So in order to work with numerical variables as if they
# were categorical, we  need to cast them as object

data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')

In [36]:
# check absence of missing data

data.isnull().sum()

pclass         0
survived       0
sex            0
age            0
sibsp          0
parch          0
fare           0
cabin       1013
embarked       0
dtype: int64

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [37]:
# separate train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((914, 8), (392, 8))

In [38]:
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

In [39]:
# feautre engine automates the selection for both
# categorical and numerical variables

sel = SelectByTargetMeanPerformance(
    variables=None, # automatically finds categorical and numerical variables
    scoring="roc_auc_score", # the metric to evaluate performance
    threshold=0.6, # the threshold for feature selection, 
    bins=3, # the number of intervals to discretise the numerical variables
    strategy="equal_frequency", # whether the intervals should be of equal size or equal number of observations
    cv=2,# cross validation
    random_state=1, #seed for reproducibility
)

sel.fit(X_train, y_train)

SelectByTargetMeanPerformance(bins=3, cv=2, random_state=1,
                              strategy='equal_frequency', threshold=0.6)

In [40]:
# after fitting, we can find the categorical variables
# using this attribute

sel.variables_categorical_

['sex', 'cabin', 'embarked']

In [41]:
# and here we find the numerical variables

sel.variables_numerical_

['pclass', 'age', 'sibsp', 'parch', 'fare']

In [42]:
# here the selector stores the roc-auc per feature

sel.feature_performance_

{'age': 0.5442311856532609,
 'cabin': 0.6356486869241289,
 'embarked': 0.5653882429366952,
 'fare': 0.6566784837126252,
 'parch': 0.5,
 'pclass': 0.6547771899168928,
 'sex': 0.7490977488556398,
 'sibsp': 0.5094691535150646}

In [43]:
# and these are the features that will be dropped

sel.features_to_drop_

['age', 'sibsp', 'parch', 'embarked']

In [44]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((914, 4), (392, 4))

That is all for this lecture, I hope you enjoyed it and see you in the next one!