# Data Leakage

Data Leakage is the scenario where the Machine Learning Model is already aware of some part of test data after training.This causes the problem of overfitting. <br> <br>

In Machine learning, Data Leakage refers to a mistake that is made by the creator of a machine learning model in which they accidentally share the information between the test and training data sets. Typically, when splitting a data set into testing and training sets, the goal is to ensure that no data is shared between these two sets. Ideally, there is no intersection between these two sets. This is because the purpose of the testing set is to simulate the real-world data which is unseen to that model. However, when evaluating a model, we do have full access to both our train and test sets, so it is our duty to ensure that there is no overlapping between the training data and the testing data (i.e, no intersection). <br><br>

In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.

There are two main types of leakage: target leakage and train-test contamination. <br><br>

## Target leakage
Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions. <br>

![](tl_1.png)

## Train-Test Contamination
A different type of leak occurs when you aren't careful to distinguish training data from validation data. <br><br>

Recall that validation is meant to be a measure of how the model does on data that it hasn't considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called train-test contamination : 


In [6]:
import pandas as pd

# Read the data
data = pd.read_csv('./credit_card_data.csv',true_values=['yes'], false_values=['no'])

# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

print("Number of rows in the dataset:", X.shape[0])
X.head()


Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [7]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y,cv=5, scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())


Cross-validation accuracy: 0.981052


With experience, you'll find that it's very rare to find models that are accurate 98% of the time. It happens, but it's uncommon enough that we should inspect the data more closely for target leakage.

Here is an unordered list of variables along with their brief description:

- **card**: Binary variable indicating whether the credit card application was accepted (1) or not (0).
- **reports**: Number of major derogatory reports.
- **age**: Age in years plus twelfths of a year.
- **income**: Yearly income divided by 10,000.
- **share**: Ratio of monthly credit card expenditure to yearly income.
- **expenditure**: Average monthly credit card expenditure.
- **owner**: Binary variable indicating whether the applicant owns a home (1) or rents (0).
- **selfempl**: Binary variable indicating whether the applicant is self-employed (1) or not (0).
- **dependents**: Number of dependents.
- **months**: Number of months the applicant has been living at their current address.
- **majorcards**: Number of major credit cards held.
- **active**: Number of active credit accounts.  <br> <br>
A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before applying?

We can make a data comparison to see it


In [8]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' %(( expenditures_cardholders == 0).mean()))

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [9]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, cv=5,scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.833958


Data leakage refers to a situation where information that is not available at the time of prediction is used to create a machine learning model. In this case, potential_leaks ['expenditure', 'share', 'active', 'majorcards'] are predictors that are not available at the time of prediction, and hence they are considered leaky predictors. This is because they represent information that would only be known after the credit card application has been approved, and hence it would not be useful in predicting the approval outcome. <br><br>

If these leaky predictors were included in the model, they would have provided extra information that could have resulted in overfitting and overly optimistic accuracy estimates. Hence, it is essential to remove such leaky predictors before creating a machine learning model to ensure that the model is accurate and reliable.

Simply it says, if diabetes (the target) depends on the your sugar_level data, but you don't have that info right now for a patient which you have to predict its condition, it can cause a leakage in the data which results for your model just fits the given set. If the model was created with the sugar level data, it would fit the given set too closely, and it may not perform well when it is used to predict the target for new patients. Therefore, it is essential to avoid using such leaky predictors to ensure that the model is not biased and can accurately predict the target outcome for new data.
