### Data Leakage
- A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as **data leakage**,
- where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.
- When data from the future is leaked to the past. Any time that a model is given information that it shouldn‚Äôt have access to when it is making predictions in real time in production, there is leakage.  
- We get data leakage by applying data preparation techniques to the entire dataset. This is not a direct type of data leakage, where we would train the model on the test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training.

#### Example 1: Using Future Data (Time Series Leakage)
- Suppose you are predicting tomorrow‚Äôs stock price using past data
- Your dataset accidentally includes a feature like:

| Date  | Closing Price | 7-Day Moving Avg (calculated using future days!) |
| ----- | ------------- | ------------------------------------------------ |
| Jan 1 | 100           | 105                                              |
| Jan 2 | 98            | 103                                              |
| ...   |               |                                                  |

- If that moving average uses future prices, then your model is seeing information from after the prediction moment.
- üëâ In reality, when predicting Jan 1, you cannot know Jan 7 values.
üìå Result:
    - Training accuracy: 98%
    - Real-world accuracy: 50% or random

#### Example 2: Target Leakage
- Predicting whether a customer will default on a loan.
- Suppose our dataset includes a column, `Paid_late_last_month`.    and the label you are predicting is `loan_default`
- If this feature is strongly correlated with the output because it happens after the loan approval, then the model is learning something it wouldn‚Äôt know at decision time.
- Model learns:

- üëâ ‚ÄúIf paid_late_last_month = Yes ‚Üí predict default.‚Äù
- But in real deployment this feature won‚Äôt exist at the time of loan approval.
- üìå Result: Unrealistically high validation score.

#### Example 3: Data Split Leakage
- Suppose, You scale or normalize the full dataset before splitting into train and test:
- 
``` scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)   # WRONG
    train, test = train_test_split(df_scaled)
```
- fit_transform used all data statistics, including test set.
- So the model indirectly sees test distribution.

#### How to Prevent Leakage
- ‚úî Split data before preprocessing, scaling, and feature selection
- ‚úî For time-series: split chronologically (train ‚Üí past, test ‚Üí future)
- ‚úî Avoid using future or outcome-derived features
- ‚úî Validate feature logic: ‚ÄúWould I know this before prediction?‚Äù
- ‚úî Use pipelines (sklearn.pipeline.Pipeline) to prevent accidental leakage

#### Quick Detection Rule
- Ask this question:
    - Would this information be available at the exact moment the prediction is made?   
    - If the answer is No ‚Üí leakage.

#### Generate Test dataset using sklearn
- sklearn's dataset package has various data generation functions.
    - example: make_classification, make_regression, make_circles, make_moons   
- Using sklearn's real datasets
    - example: load_iris, load_boston, load_diabetes, load_wine
- We can use `faker` or `synthetic-data-generator` to generate synthetic data.
- Also we can use Numpy/pamdas to generate Dataframes manually using    `np.random.randn`, `np.random.randint` 

- https://scikit-learn.org/stable/api/sklearn.datasets.html

- https://github.com/mwaskom/seaborn-data/tree/master



In [2]:
# List all the data generation functions present in sklearn
from sklearn import datasets
import inspect, pprint


# List all callables that start with "make_"
make_funcs = [name for name in dir(datasets) if name.startswith("make_")]
pprint.pprint(make_funcs)

['make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2',
 'make_low_rank_matrix',
 'make_moons',
 'make_multilabel_classification',
 'make_regression',
 'make_s_curve',
 'make_sparse_coded_signal',
 'make_sparse_spd_matrix',
 'make_sparse_uncorrelated',
 'make_spd_matrix',
 'make_swiss_roll']


In [6]:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(
        n_samples=1000,      # total number of data points (rows) to generate
        n_features=20,      # total number of columns (features) per sample
        n_informative=15,   # how many of those 20 features actually influence the class label
        n_redundant=5,      # how many features are linear combinations of the informative ones
        random_state=7)     # seed for the random number generator ‚Üí reproducible results

# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


#### Model training with Data Leakage   

In [8]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


#### Model training without Data Leakage   
 

In [9]:
# Correct approach for normalizing the data after the data is split before the model is evaluated

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455


- In this case, we can see that the estimate for the model is about 85.455 percent, 
- which is more accurate than the estimate with data leakage in the previous section that achieved an accuracy of 84.848 percent. 
- We expect data leakage to result in an incorrect estimate of model performance. 

#### Data Preparation With k-fold Cross-Validation
- **k-fold cross-validation** involves splitting a dataset into k non-overlapping groups of rows. The model is then trained on all but one group to form a training dataset and then evaluated on the held-out fold
- This process is repeated so that each fold is given a chance to be used as the holdout test set. Finally, the average performance across all evaluations is reported. 
- The k-fold cross-validation procedure generally gives a more reliable estimate of model performance than a train-test split, although it is more computationally expensive given the repeated fitting and evaluation of models.

- ##### RepeatedStratifiedKFold is a cross‚Äëvalidation iterator from scikit‚Äëlearn (sklearn.model_selection). 
- It combines two ideas:
    - **StratifiedKFold**: the data are split into K folds while preserving the class‚Äëlabel distribution in each fold (i.e., each fold has roughly the same proportion of each class as the whole dataset). This is important for classification problems with imbalanced classes.
    - **RepeatedKFold**: the whole K-fold splitting process is performed multiple times, each time with a different random shuffling of the data. This yields a larger set of train‚Äëtest splits, which can give a more reliable estimate of model performance.

In [None]:
cv = RepeatedStratifiedKFold(
        n_splits=10,      # create 10 folds (i.e., 10‚Äëfold CV)
        n_repeats=3,      # repeat the 10‚Äëfold split 3 times ‚Üí 30 total train/test splits
        random_state=1)   # seed for reproducibility of the random shuffling

In [12]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold

# -------------------------------------------------
# 1Ô∏è‚É£ Create a tiny binary‚Äëclassification dataset
# -------------------------------------------------
X, y = make_classification(
    n_samples=30,          # only 30 rows ‚Äì easy to inspect
    n_features=5,
    n_informative=3,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.7, 0.3],    # 70‚ÄØ% class 0, 30‚ÄØ% class 1 (imbalanced)
    random_state=42,
)

# -------------------------------------------------
# 2Ô∏è‚É£ Set up repeated stratified K‚Äëfold
#    - 3 folds per repeat
#    - repeat the whole split 2 times
# -------------------------------------------------
cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=1)

# -------------------------------------------------
# 3Ô∏è‚É£ Iterate over the generated splits
# -------------------------------------------------
for i, (train_idx, test_idx) in enumerate(cv.split(X, y), start=1):
    # Which repeat are we in? (every n_splits folds belong to the same repeat)
    repeat_num = (i - 1) // cv.get_n_splits() + 1
    fold_num   = (i - 1) % cv.get_n_splits() + 1

    # Count how many samples of each class appear in train / test
    train_counts = np.bincount(y[train_idx], minlength=2)
    test_counts  = np.bincount(y[test_idx],  minlength=2)

    print(f"Repeat {repeat_num}, Fold {fold_num}")
    print(f"  Train class distribution ‚Üí 0:{train_counts[0]}, 1:{train_counts[1]}")
    print(f"  Test  class distribution ‚Üí 0:{test_counts[0]}, 1:{test_counts[1]}")
    print("---")

Repeat 1, Fold 1
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---
Repeat 1, Fold 2
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---
Repeat 1, Fold 3
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---
Repeat 1, Fold 4
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---
Repeat 1, Fold 5
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---
Repeat 1, Fold 6
  Train class distribution ‚Üí 0:14, 1:6
  Test  class distribution ‚Üí 0:7, 1:3
---


#### Cross-Validation Evaluation With Naive Data Preparation (with Data Leakage)

In [13]:
# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.300 (3.607)


#### Cross-Validation Evaluation With Naive correct Data Preparation (without Data Leakage)
- Data preparation without data leakage when using cross-validation is slightly more challenging.

- we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

- We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

- to correctly evaluating the entire pipeline of data preparation and model together as a single atomic unit. This can be achieved using the `Pipeline` class.

- 
    ```
    # define the pipeline
    steps = list()
    steps.append(('scaler', MinMaxScaler())) 
    steps.append(('model', LogisticRegression())) 
    pipeline = Pipeline(steps=steps)
    ```

- 

In [15]:
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.433 (3.471)
