# Information Leakage

In this notebook we will explore how information can leak from the training data to the test data. For the example we will create a dummy dataset.


In [0]:
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

We will create a dataset with 500 rows and 10000 feature columns. All features will be randomly generated values between 0 and 1. The targets will be one of two randomly chosen categories.

In [0]:
X = np.random.randn(500, 10000)
y = np.random.choice(2, size=500)

We will select the *best* 25 features to base our model on. Scikit-learn has a build in function to find the columns most correlated with the targets. 

In [0]:
X_best = SelectKBest(k=25).fit_transform(X, y)

As usual we split the data into a train and test set.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_best, y)

## Stop and think!

If we train a model with **X_train** and **y_train**, how well do you expect this model to perform on the test set?

Once you have thought about this question, you can see if the result is as expected.

In [0]:
model = LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)
performance = (model.predict(X_test) == y_test).mean()
print(performance)

We can also with cross-validation

In [0]:
cross_val_score(LogisticRegression(solver='lbfgs', max_iter=1000), X_best, y, cv=5)