<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/notebooks/12_combining_with_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 12 — Combining Preprocessing with Pipelines
In this final notebook, we’ll **combine all preprocessing steps into a reusable `Pipeline`** for efficient and modular workflows.

We’ll use scikit-learn’s `Pipeline` and `ColumnTransformer` to process:
- Missing values
- Scaling (numeric features)
- Encoding (categorical features)
- Feature selection (optional)

This is ideal for production-ready machine learning pipelines.

## 1. Sample Dataset with Mixed Data Types

In [None]:
import pandas as pd
import numpy as np

# Create small mixed-type dataset
data = {
    "Age": [25, 32, np.nan, 40, 29],
    "Income": [50000, 60000, 55000, np.nan, 52000],
    "City": ["NY", "SF", "NY", "LA", np.nan],
    "Purchased": ["Yes", "No", "Yes", "No", "Yes"]
}

df = pd.DataFrame(data)
df

## 2. Define Preprocessing Steps

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define column types
num_features = ["Age", "Income"]
cat_features = ["City"]

# Numeric pipeline
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

# Categorical pipeline
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

## 3. Build ColumnTransformer

In [None]:
preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])

## 4. Create Full Pipeline with Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

full_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", LogisticRegression())
])

## 5. Train Pipeline

In [None]:
X = df.drop("Purchased", axis=1)
y = df["Purchased"]

# Encode target for binary classification
y_encoded = (y == "Yes").astype(int)

full_pipeline.fit(X, y_encoded)
print("Pipeline trained successfully.")

## 6. Make Predictions

In [None]:
new_data = pd.DataFrame({
    "Age": [34],
    "Income": [58000],
    "City": ["NY"]
})

prediction = full_pipeline.predict(new_data)
print("Prediction (1 = Yes, 0 = No):", prediction[0])

## Summary
- Combined numeric and categorical preprocessing using `ColumnTransformer`
- Integrated preprocessing and modeling in one `Pipeline`
- Ready for deployment, cross-validation, and model tuning

## Congratulations!
You've now completed the **Beginner-Friendly Tabular Data Preprocessing** series. You're ready to:

- Clean and preprocess real-world datasets
- Build modular and production-ready ML pipelines
- Apply feature selection and transformations efficiently

Try building your own pipeline using a real dataset next!