# ML&F - Scikit-learn ML Pipelines

## Step 0: Load the dataset


We need to define our data and target. In this case we build a classification model

In [20]:
import pandas as pd

ames_housing = pd.read_csv("datasets/house_prices.csv", na_values="?")
ames_housing.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [21]:
target_name = "SalePrice"
data, target = (
    ames_housing.drop(columns=target_name),
    ames_housing[target_name],
)
target = (target > 200_000).astype(int)

We inspect the first rows of the dataframe



In [22]:
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,8,2007,WD,Normal
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,2,2010,WD,Normal
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2010,WD,Normal


In [23]:
target

0       1
1       0
2       1
3       0
4       1
       ..
1455    0
1456    1
1457    1
1458    0
1459    0
Name: SalePrice, Length: 1460, dtype: int64

For the sake of simplicity, we can cherry-pick some features and only retain this arbitrary subset of data:



In [24]:
numeric_features = ["LotArea", "FullBath", "HalfBath"]
categorical_features = ["Neighborhood", "HouseStyle"]
data = data[numeric_features + categorical_features]
data.head()

Unnamed: 0,LotArea,FullBath,HalfBath,Neighborhood,HouseStyle
0,8450,2,1,CollgCr,2Story
1,9600,2,0,Veenker,1Story
2,11250,2,1,CollgCr,2Story
3,9550,1,0,Crawfor,2Story
4,14260,2,1,NoRidge,2Story


## Step 1: Train-Test Split (Critical First Step!)

**Before any data preprocessing or pipeline creation**, we must split our data into training and testing sets. This prevents data leakage by ensuring that:
- No information from the test set influences our preprocessing steps
- Our model evaluation is unbiased and realistic
- We follow proper ML workflow practices

In [25]:
from sklearn.model_selection import train_test_split

# Split data BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    data, target, 
    test_size=0.2, 
    random_state=42, 
    stratify=target  # Ensures balanced splits for classification
)

## Step 2: Create the ML Pipeline for Preprocessing

Now we create our pipeline. The key advantage is that all preprocessing steps will be applied consistently to both training and test data, and the pipeline learns preprocessing parameters **only from the training data**.

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        (
            "scaler",
            StandardScaler(),
        ),
    ]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

The next step is to apply the transformations using ``ColumnTransformer``



In [27]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

Then we define the model and join the steps in order



**Important**: We fit the pipeline only on training data. This ensures that:
- Imputation values (median) are calculated only from training data
- Scaling parameters (mean, std) are calculated only from training data  
- One-hot encoding learns categories only from training data
- No information leaks from the test set

In [28]:
from sklearn.linear_model import LogisticRegression

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)
model

Now we can evaluate our trained pipeline on the test set to get an unbiased performance estimate:

## Step 3: Train the Pipeline

In [29]:
model.fit(X_train, y_train)  # Fit ONLY on training data

## Step 4: Evaluate on Test Set

### Option 1: Simple Train-Test Evaluation

In [30]:
test_score = model.score(X_test, y_test)
train_score = model.score(X_train, y_train)

print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
print(f"Difference: {abs(train_score - test_score):.3f}")

Training accuracy: 0.874
Test accuracy: 0.866
Difference: 0.008


### Option 2: Cross-Validation on Training Set

In [31]:
# We can also do cross-validation, but ONLY on training data
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, X_train, y_train, cv=5)
scores = cv_results["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.860 ± 0.022


In [32]:
# Get predictions on test set
test_score = model.score(X_test, y_test)
train_score = model.score(X_train, y_train)

print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
print(f"Difference: {abs(train_score - test_score):.3f}")

# This shows if we have overfitting
if train_score - test_score > 0.05:
    print("⚠️  Potential overfitting detected!")
else:
    print("✅ Good generalization performance")

Training accuracy: 0.874
Test accuracy: 0.866
Difference: 0.008
✅ Good generalization performance


Notice that the diagram changes color once the estimator is fit.

So far we used ``Pipeline`` and ``ColumnTransformer``, which allows us to custom the names of the steps in the pipeline. An alternative is to use ``make_column_transformer`` and ``make_pipeline``, they do not require, and do not permit, naming the estimators. Instead, their names are set to the lowercase of their types automatically.

In [33]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler()
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)
model = make_pipeline(preprocessor, LogisticRegression())
model.fit(X_train, y_train)  # Fit ONLY on training data

## Key Takeaways


**✅ What we did RIGHT:**
1. **Split data FIRST** before any preprocessing or analysis
2. **Fit pipeline only on training data** - all preprocessing parameters learned from training set only
3. **Evaluate on test set** for unbiased performance estimate
4. **Use cross-validation on training data** for robust model selection

**❌ Common mistakes that cause data leakage:**
- Preprocessing entire dataset before train-test split
- Using statistics from test set for normalization/scaling  
- Feature selection using entire dataset
- Cross-validating on entire dataset including test set

**🔑 Why pipelines help:**
- Ensure consistent preprocessing between training and inference
- Automatically apply same transformations to new data
- Make it impossible to accidentally use test set information during training

Note:
In this case, around 86% of the times the pipeline correctly predicts whether the price of a house is above or below the 200_000 dollars threshold. But be aware that this score was obtained by picking some features by hand, which is not necessarily the best thing we can do for this classification task. In this example we can hope that fitting a complex machine learning pipelines on a richer set of features can improve upon this performance level.

Reducing a price estimation problem to a binary classification problem with a single threshold at 200_000 dollars is probably too coarse to be useful in in practice. Treating this problem as a regression problem is probably a better idea. 