# Pandas + ML — Part 7: Pipelines & ColumnTransformer

Goal: show a *production-style* sklearn workflow using `Pipeline` and `ColumnTransformer`.

What we do:
- Ensure cleaned Iris dataset exists (from Parts 2/4/5/6)
- Engineer a simple *categorical* feature to demonstrate one-hot encoding
- Build a preprocessing pipeline (scale numerics + one-hot categoricals)
- Train **LogisticRegression** and **RandomForest** inside full pipelines
- Evaluate with Accuracy and Macro-F1


In [4]:
import os, pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

%matplotlib inline

CLEAN_PATH = "ml_projects/data/iris_cleaned_engineered.csv"

# Ensure cleaned dataset exists (create a minimal version if missing)
if not os.path.exists(CLEAN_PATH):
    iris = load_iris(as_frame=True)
    df0 = iris.frame.copy()
    df0.rename(columns={'target': 'species_index'}, inplace=True)
    df0['species'] = df0['species_index'].map(dict(enumerate(iris.target_names)))
    # simple engineered areas
    df0['sepal_area'] = df0['sepal length (cm)'] * df0['sepal width (cm)']
    df0['petal_area'] = df0['petal length (cm)'] * df0['petal width (cm)']
    # scale numerics for consistency with earlier parts
    num_cols0 = [c for c in df0.columns if df0[c].dtype != 'object' and c != 'species_index']
    df0[num_cols0] = (df0[num_cols0] - df0[num_cols0].mean()) / df0[num_cols0].std()
    os.makedirs("ml_projects/data", exist_ok=True)
    df0.to_csv(CLEAN_PATH, index=False)
    print("Created:", CLEAN_PATH)
else:
    print("Found:", CLEAN_PATH)

df = pd.read_csv(CLEAN_PATH)
df.head()

Found: ml_projects/data/iris_cleaned_engineered.csv


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species_index,species,sepal_area,petal_area
0,-0.900681,1.019004,-1.340227,-1.315444,0,setosa,0.008098,-1.174041
1,-1.143017,-0.131979,-1.340227,-1.315444,0,setosa,-0.932024,-1.174041
2,-1.385353,0.328414,-1.397064,-1.315444,0,setosa,-0.830551,-1.178299
3,-1.506521,0.098217,-1.283389,-1.315444,0,setosa,-1.063343,-1.169783
4,-1.021849,1.249201,-1.340227,-1.315444,0,setosa,0.052866,-1.174041


## Create a categorical feature
Iris has only numeric features and a string target (`species`). For demo, we create one *categorical feature* from a numeric binning so we can show one-hot encoding in the `ColumnTransformer`.


In [5]:
# Derive a categorical feature by binning petal length
quantiles = df['petal length (cm)'].quantile([0.33, 0.66]).values
def bucket_pl(v):
    if v <= quantiles[0]:
        return 'short'
    elif v <= quantiles[1]:
        return 'medium'
    else:
        return 'long'
df['petal_len_bucket'] = df['petal length (cm)'].apply(bucket_pl)
df[['petal length (cm)','petal_len_bucket']].head()

Unnamed: 0,petal length (cm),petal_len_bucket
0,-1.340227,short
1,-1.340227,short
2,-1.397064,short
3,-1.283389,short
4,-1.340227,short


## Define features/target & preprocessing
- **Target**: `species`
- **Numeric features**: all non-object columns except `species_index`
- **Categorical features**: `petal_len_bucket` (engineered above)

Preprocessing:
- Scale numeric features with `StandardScaler`
- One-hot encode categorical features with `OneHotEncoder(handle_unknown='ignore')`
Then pipe into an estimator (LogReg or RandomForest).


In [6]:
X = df.drop(columns=['species','species_index'])
y = df['species']

# mark columns for preprocessing
cat_cols = ['petal_len_bucket']
num_cols = [c for c in X.columns if c not in cat_cols]

# ColumnTransformer: scale numerics, one-hot encode categoricals
preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(with_mean=False), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
len(X_train), len(X_test)

(105, 45)

## Pipelines: Logistic Regression and RandomForest
Each pipeline = **preprocess** → **model**. Evaluate Accuracy and Macro-F1.


In [7]:
pipelines = {
    'LogReg': Pipeline([
        ('prep', preprocess),
        ('model', LogisticRegression(max_iter=300))
    ]),
    'RandomForest': Pipeline([
        ('prep', preprocess),
        ('model', RandomForestClassifier(random_state=42))
    ])
}

rows = []
for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1m = f1_score(y_test, y_pred, average='macro')
    rows.append((name, acc, f1m))
    print(f"\n{name}\nAccuracy: {acc:.4f}  Macro-F1: {f1m:.4f}")
    # Optional: classification report
    print(classification_report(y_test, y_pred))

results = pd.DataFrame(rows, columns=['model','accuracy','f1_macro'])
results


LogReg
Accuracy: 0.9333  Macro-F1: 0.9333
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.88      0.93      0.90        15
   virginica       0.93      0.87      0.90        15

    accuracy                           0.93        45
   macro avg       0.93      0.93      0.93        45
weighted avg       0.93      0.93      0.93        45


RandomForest
Accuracy: 0.9111  Macro-F1: 0.9107
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.82      0.93      0.88        15
   virginica       0.92      0.80      0.86        15

    accuracy                           0.91        45
   macro avg       0.92      0.91      0.91        45
weighted avg       0.92      0.91      0.91        45



Unnamed: 0,model,accuracy,f1_macro
0,LogReg,0.933333,0.933259
1,RandomForest,0.911111,0.910714
