### **Supervised**: `Data Leakage`

**Definition**: Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed.

**Do I have data leakage?**

An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true.

For example, if you normalize or standardize your entire dataset, then estimate the performance of your model using cross-validation. The effect is overfitting and having an overly optimistic evaluation of your models performance on unseen data. You have committed the sin of data leakage.

**Tips to Combat Data Leakage**

- Use Pipelines. Heavily use pipeline architectures that allow a sequence of data preparation steps to be performed within cross validation folds.
- Use a Holdout Dataset. Hold back an unseen dataset as a final sanity check of your model before you use it.

Data generated with condition that there is no relationship between X and y.

In [6]:
import numpy as np
import random

rnd = np.random.RandomState(seed = 2020)
X = rnd.normal(size =(100,1000))
y = rnd.randint(0,2,size =(100,))

**Data Leakage**

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import f_classif, SelectPercentile

In [8]:
selector = SelectPercentile(score_func=f_classif,percentile = 5)    # milih fitur berdasarkan anova
selector.fit(X,y)  # leakage

X_selected = selector.transform(X)

cross_val_score( # data yang bocor
    LogisticRegression(),
    X_selected,
    y,
    cv = 5,
    scoring ='accuracy'
)


array([0.8 , 0.85, 0.9 , 0.75, 0.85])

This result indicates a very good model while data is generated entirely random, weird isn't it? This is caused by the feature selection process involving the entire data.

**No Information Leakage**

In [9]:
selector = SelectPercentile(score_func=f_classif, percentile=5)

model_pipeline = Pipeline([
    ('selection', selector),
    ('estimator', LogisticRegression())
])

cross_val_score( #data bocor jadi lebih sedikit dengan pipeline
    model_pipeline,
    X,
    y,
    cv=5,
    scoring='accuracy'
)

array([0.4 , 0.45, 0.35, 0.4 , 0.5 ])

This result is reasonable for completely random data. Because the feature selection process in only performed on the training dataset, not the entire data.

**How to Build a Pipeline?**

This example is how to create a complex Pipeline with a ColumnTransformer and a classifier, DecisionTreeClassifier, and then apply it to GridSearch, and display its visual representation.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

numerical_preprocessor = Pipeline([
    ('imputation', SimpleImputer(strategy='mean' )),
    ('scaler', StandardScaler())
])

categorical_preprocessor = Pipeline([
    ('imputation', SimpleImputer(strategy='consonant', fill_value='unknown' ) ),
    ('onehot', OneHotEncoder())
])

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, ['state', 'gender']),
    ('numerical', numerical_preprocessor, ['age', 'income'])
])

model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier)
])

param_grid = {
    'max_depth' : np.arange(1,21),
    'min_samples_split' : np.arange(5,51,5),
    'criterion' : ['gini', 'entropy']
}

tuned_model = GridSearchCV(
    model_pipeline,
    param_grid=param_grid,
    scoring='accuracy',
    n_jobs=1,
    cv=5
)

tuned_model

TypeError: _HTMLDocumentationLinkMixin._get_doc_link() missing 1 required positional argument: 'self'

TypeError: _HTMLDocumentationLinkMixin._get_doc_link() missing 1 required positional argument: 'self'

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('categorical',
                                                                         Pipeline(steps=[('imputation',
                                                                                          SimpleImputer(fill_value='unknown',
                                                                                                        strategy='consonant')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder())]),
                                                                         ['state',
                                                                          'gender']),
                                                                        ('numerical',
              