# Intermediate Machine Learning
(https://www.kaggle.com/learn/intermediate-machine-learning)

# --- 1. Missing Values ---

## Three Approaches
- Drop Columns with Missing Values
- Imputation (for instance, fil in the mean value)
- Extension to Imputation (impute missing values and add columns that show the location of imputed entries)

### Drop Columns with Missing Values
# <img src="img/1.DropColumnswithMissingValues.png" alt="Drop Columns with Missing Values" width = "1000" heigth = "1000">

### Imputation
# <img src="img/1.Imputation.png" alt="Imputation" width = "1000" heigth = "1000">

### Extension to Imputation
# <img src="img/1.ExtensionImputation.png" alt="Extension to Imputation" width = "1000" heigth = "1000">

Working with the Melbourne Housing Dataset

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Load the data
data = pd.read_csv('./melb_data.csv')

# Select target
y = data.Price

# To keep simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

Score_datasets 

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [4]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## Score  from 1st Approach (Drop Columns with Missing Values)

In [5]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
print(cols_with_missing)

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values): ")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

['Car', 'BuildingArea', 'YearBuilt']
MAE from Approach 1 (Drop columns with missing values): 
183550.22137772635


## Score  from 2nd Approach (Imputation)

In [6]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):
178166.46269899711


## Score  from 3rd Approach (Extenstion to Imputation)

In [7]:
# Make copy to avoid changing original data (when imputing)

X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extenstion to Imputation) :")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extenstion to Imputation) :
178927.503183954


Conclusion\
As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).

In [8]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


# --- 2. Categorical Variables ---

A **categorical variable** takes only a limited number of values.

## Three Approaches
- Drop Categorical Variables (work well if the columnds did not contain useful information)
- Ordinal Encoding (It assigns each unique value a different integer [ordinal variables] )
- One-Hot Encoding (It creates new columns indicating the presence of each possible value)

ordinal variables: categorical variables with indisputable ranking to the categories\
nominal variables : categorical variables without an intrinsic ranking

### Ordinal Encoding
# <img src="img/2.OrdinalEncoding.png" alt="Ordinal Encoding" width = "1000" heigth = "1000">

### One-Hot Encoding
# <img src="img/2.OneHoteEncoding.png" alt="One-Hot Encoding" width = "1000" heigth = "1000">

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [10]:
# Read the data
data = pd.read_csv('./melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [11]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


For this dataset, the columns with text indicate categorical variables

In [12]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print('Categorical variables: ')
print(object_cols)

Categorical variables: 
['Type', 'Method', 'Regionname']


In [13]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## Score  from 1st Approach (Drop Categorical Variables)

We drop the object columns with the select_dtypes() method.

In [14]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables): ")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables): 
175703.48185157913


## Score  from 2nd Approach (Ordinal Encoding)

Scikit-learn has a OrdinalEncoder class that can be used to get ordinal encodings.

In [15]:
from sklearn.preprocessing import OrdinalEncoder

In [16]:
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))


MAE from Approach 2 (Ordinal Encoding):
165936.40548390493


## Score  from 3rd Approach (One-Hot Encoding)

We use the OneHotEncoder class from scikit-learn to get one-hot encodings.
- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data
- Setting sparse_output=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [17]:
from sklearn.preprocessing import OneHotEncoder

In [18]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features 
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))


MAE from Approach 3 (One-Hot Encoding):
166089.4893009678


The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!

### Fixing problem Categorical Data appears in validation data but not in training data

In [19]:
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be ordinal encoded: ['Type', 'Method', 'Regionname']

Categorical columns that will be dropped from the dataset: []


### Get number of Unique Entries

In [20]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Type', 3), ('Method', 5), ('Regionname', 8)]

### Get columns with lower cardinality than 10

In [21]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))


# --- 3. Pipelines ---

Pipelines streamline the process of building machine learning models by bundling preprocessing and modeling steps into one cohesive workflow. Instead of handling data manually at every stage, a pipeline lets you treat the entire process as a single unit.

Key benefits include:
- Cleaner code with less data tracking overhead.
- Fewer bugs thanks to consistent preprocessing.
- Easier deployment, making models more production-ready.
- Improved validation through tools like cross-validation.

It's like turning a jumble of parts into a smooth assembly line — one step in, one prediction out

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [23]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


## Step 1 : Define Preprocessing Steps
$ColumnTransformer$ acts like a mini-pipeline for preprocessing: it lets you apply different transformations to different columns all in one go. In this case, it:
- Fills in missing values for numerical columns.
- Fills in missing values and then one-hot encodes the categorical columns.

It’s a clean and efficient way to prep mixed-type data before feeding it into a model.

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
])

## Step 2 : Define the Model

In [25]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

## Step 3 : Create and Evaluate the Pipeline
The $Pipeline$ class allows you to bundle preprocessing and modeling into one streamlined process. This means:
- You can fit the model and preprocess the training data in one line, avoiding the chaos of handling each step manually.
- When predicting, you simply pass in the raw validation data, and the pipeline handles all necessary preprocessing automatically.

In short: fewer steps, cleaner code, and less chance of forgetting a crucial transformation.

In [26]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                              ])

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print("MAE: ", score)

MAE:  160679.18917034855


Conclusion : Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

# --- 4. Cross-Validation ---

Cross-validation is a technique for evaluating model performance more reliably by splitting the data into multiple parts (folds) and rotating which part is used for validation.
For example, in 5-fold cross-validation:
- The data is divided into 5 equal parts.
- The model is trained on 4 parts and validated on the remaining 1 — repeating this 5 times so each part serves as validation once.

This way, every data point is used for validation exactly once, and the overall model quality is based on all parts of the dataset. It helps reduce bias and gives a more robust estimate of how the model will perform on unseen data.

# <img src="img/4.CrossValidation.png" alt="Cross-Validation" width = "1000" heigth = "1000">

## When should you use cross-validation?
Use cross-validation when accuracy matters more than speed—especially for small datasets or when you’re fine-tuning your model.

Use a single validation set when working with large datasets or when your model takes a long time to train.

Still unsure? Try both approaches—if your cross-validation scores are consistent, a single split might be all you need.

In [34]:
import pandas as pd

# Read the data
data = pd.read_csv('./melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50, random_state=0))
])

We obtain the cross-validation scores with the $cross\_val\_score()$ function from scikit-learn. We set the number of folds with the cv parameter.

In [38]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

print("MAE scores :\n", scores)

MAE scores :
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


In [39]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
277707.3795913405


Conclusion: Using cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer need to keep track of separate training and validation sets. So, especially for small datasets, it's a good improvement!

# --- 5. XGBoost ---

**Ensemble methods** combine the predictions of several models (e.g., several trees, in the case of random forests).

**Gradient Boosting** is an iterative ensemble method that builds models one at a time, with each new model correcting the errors of the combined previous ones.

Here’s the cycle in a nutshell:
- Start with a basic (even inaccurate) model.
- Use it to make predictions and calculate the loss (how wrong the predictions are).
- Train a new model to predict the errors (gradients) and reduce the loss.
- Add this new model to the ensemble.
- Repeat the process until performance stabilizes or a limit is reached.

It’s like assembling a team of increasingly specialized experts, each one joining to patch up where the others fell short. Want a visual to make it more intuitive or code to try it out?

# <img src="img/5.GradientBoosting.png" alt="Gradient Boosting" width = "1000" heigth = "1000">

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X,y)


**XGBoost** stands for extreme gradient boosting, which is an implementation of gradient boosting with several additional features focused on performance and speed.

In [3]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [4]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 241542.77457198084


$n_estimators$ in XGBoost controls how many models (trees) are added to the ensemble—essentially, how many boosting rounds are performed.
- Low values → underfitting (model too simple, poor accuracy on both training and test sets).
- High values → overfitting (model memorizes training data, but struggles on test data).
- Typical range: 100 to 1000, often tuned alongside learning_rate to find the right balance.

In [5]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

# --- 6. Data Leakage ---

Data leakage occurs when a model is trained with information it wouldn’t realistically have during real-world predictions, leading to overly optimistic performance during training or validation but poor performance in production.

There are two main types:
- **Target Leakage**: This happens when features used for training include future information that won’t be available at prediction time. For example, using whether a patient took antibiotics to predict pneumonia, even though antibiotics are given after diagnosis. To avoid this, exclude features created after the target event.
# <img src="img/6.TargetLeakage.png" alt="Gradient Boosting" width = "1000" heigth = "1000">

- **Train-Test Contamination**: This occurs when validation data influences how the model is trained—like preprocessing before splitting the data. It gives misleading validation results. To prevent it, ensure all preprocessing is done after splitting the data, ideally using tools like scikit-learn pipelines to keep things clean and contained.

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('./AER_credit_card_data.csv',true_values = ['yes'], false_values = ['no'])

# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

print("Number of rows in the dataset:", X.shape[0])
X.head()

Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')
print("Cross-validation accuracy: ", cv_scores.mean())

Cross-validation accuracy:  0.9802915082382764


In [5]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print("Fraction of those who did not receive a card and had no expenditures: ", (expenditures_noncardholders == 0).mean())
print("Fraction of those who received a card and had no expenditures: ",(expenditures_cardholders == 0).mean())

Fraction of those who did not receive a card and had no expenditures:  1.0
Fraction of those who received a card and had no expenditures:  0.020527859237536656


When everyone without a card had no expenditures, and nearly all who did have a card spent something, it falsely boosted the model’s accuracy. That’s because it relied on future information—how much was spent after getting the card—which shouldn’t be used in prediction. Since other features like **share** depend on **expenditure**, they also carry leakage risk. Variables like **active** and **majorcards** might be problematic too. If their connection to the outcome isn’t clearly safe, it’s best to leave them out to avoid hidden leakage.

In [9]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring='accuracy')
print("Cross-val acurracy: ", cv_scores.mean())

Cross-val acurracy:  0.8354735568613896


The model now shows lower accuracy, which may seem discouraging, but it's actually a more reliable estimate for new, unseen data—around 80% accuracy.

### Conclusion
Data leakage can lead to extremely costly errors in data science. To avoid this:
- Keep training and validation data strictly separate to prevent contamination.
- Use pipelines to ensure consistent and clean preprocessing.
- Apply common sense and thorough data exploration to catch any hidden target leakage early on.