<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/4_DataLeakage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In this notebook we will discuss data leakage that occurs when splitting the data into training and test sets. If the split happens after the data has been prepared, this will cause data leakage.

In [None]:
from IPython.display import Image
Image("splitting-data.png" , width=640)

**Data Leakage can occur during the data preparation phase of machine learning**<br>
How data preparation techniques are applied to data matters. <br>
A common approach to data preparation:<br>

1. Prepare Dataset<br>
2. Split Data<br>
3. Evaluate Models<br>

Although common, it is dangerously incorrect in most cases. 

Applying data preparation techniques before splitting data for model evaluation can lead to data leakage and can result in an incorrect estimate of a model’s
performance on the problem. <br>

Data leakage refers to the problem where information about the
test or validation dataset, is made available to the model in the training dataset.<br>
This leakage is often small and subtle but can have a marked effect on
performance.

In [None]:
from IPython.display import Image
Image("splt1.png" , width=640)

**Import the libraries**

In [None]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

**The Naive Approach**

In [None]:
# naive approach to normalizing the data before splitting the data and evaluating the model

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

**Recommended Approach**
1. Split Data.
2. Fit Data Preparation on Training Dataset.
3. Apply Data Preparation to Train and Test Datasets.
4. Evaluate Models.

In [None]:
from IPython.display import Image
Image("splt2.png" , width=640)

In [None]:
# correct approach for normalizing the data after the data is 
#split before the model is evaluated
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                           n_redundant=5,random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
                                                    random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

In this case, data leakage led to a less accurate model. We would normally expect data leakage to lead to a more accurate model.

**Assignment**<br>
1. Change the synthetic dataset - add more features, more data points. 
Does the accuracy change between the two methods?
2. Change the train-test split. Does the accuracy change between the two methods? 


**Cross Validation Example**<br>
Using a data pipeline

**Naive metho**d<br>
We know this method will produce an incorrect accuracy score because of the data leakage allowed during the data preparation procedure.

In [None]:
# naive data preparation for model evaluation with k-fold cross-validation
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5,random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

**Recommended Method**<br>


In [None]:
from sklearn.pipeline import Pipeline

Use a pipeline to prepare the data.<br>
Pipelines are used to assemble several steps together, which means they can be crossvalidated together. <br>
Pipelines also help avoid leaking the test set to the training set<br>

[sklearn.pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [None]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                           n_redundant=5,
random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

In [None]:
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', 
                         cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))