# Python Programming for the Machine Learning Pipeline (Lecture Notebook)

**Course:** Python Programming
**Focus:** How core Python skills power each step of an ML workflow  

This notebook emphasizes **programming practice** over deep ML theory.  
We will follow a typical ML pipeline:

1. Define strategy  
2. Data collection / ingestion  
3. Data preprocessing  
4. Data modeling  
5. Training and evaluation  
6. Optimization  
7. Deployment  
8. Performance monitoring  

You will see **two styles** of solutions where appropriate:
- A more “manual Python” approach (loops, lists, basic NumPy)
- A cleaner “library-first” approach (pandas, scikit-learn)

---

## Learning Outcomes
By the end of this notebook, you should be able to:
1. Identify which Python tools match each ML pipeline step.
2. Read and inspect data using pandas and scikit-learn datasets.
3. Apply basic EDA and preprocessing with both manual and library approaches.
4. Split datasets manually and with `train_test_split`.
5. Train and evaluate a simple model using clean, reusable code.
6. Compare manual tuning loops with `GridSearchCV`.
7. Save and load models for simple deployment demos.
8. Sketch a minimal monitoring workflow using Python.

## 0. Imports and Setup

We will use a small set of standard libraries:
- `numpy` and `pandas` for data handling
- `matplotlib` for basic visualization
- `scikit-learn` for clean ML utilities

The goal is not to memorize every function but to learn a reusable **pattern** for writing ML-ready Python.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

from sklearn.model_selection import GridSearchCV

import joblib

RNG = 42
np.random.seed(RNG)

---
## 1. Define Strategy (Programming View)

In real projects, “define strategy” means more than stating a goal.  
From a programming standpoint, you want to make your intent **explicit** and **configurable**.

Good habits include:
- Defining constants for random seeds
- Keeping dataset and model choices in a dictionary
- Writing small helper functions for repeatable steps

This avoids “magic numbers” scattered across your notebook.

In [None]:
# A simple configuration pattern
CONFIG = {
    "random_state": 42,
    "test_size": 0.25,
    "knn_default_k": 5,
    "grid_k_values": [1, 3, 5, 7, 9]
}

CONFIG

---
## 2. Data Collection / Ingestion

In practice, “data collection” often becomes **data ingestion** in Python:
- Reading CSV files
- Loading Excel sheets
- Pulling from APIs or databases (covered later in the course)
- Using benchmark datasets for learning

For classroom-safe demonstrations, we will:
1. Load the Iris dataset from scikit-learn.
2. Convert it to a pandas DataFrame.
3. Save it to CSV and read it back to simulate a real file workflow.

### 2.1 Load a Built-in Dataset (Iris)

In [None]:
iris = load_iris(as_frame=True)
df = iris.frame.copy()  # includes features + target

df.head()

### 2.2 Simulate “Real-World” File Reading

Even when using known datasets, it's helpful to practice file I/O.

In [None]:
# Save to a local CSV (in the same working directory)
csv_path = "iris_demo.csv"
df.to_csv(csv_path, index=False)

# Read it back
df_csv = pd.read_csv(csv_path)

print("Loaded from CSV:", df_csv.shape)
df_csv.head()

---
## 3. Data Understanding / EDA (Programming Essentials)

Before modeling, Python helps you answer:
- What columns do we have?
- Are there missing values?
- What types are our features?
- Are classes balanced?

You don't need fancy plots at this stage.  
Good **first-pass** EDA uses:
- `head()`, `info()`, `describe()`
- simple counts
- lightweight plots

In [None]:
df_csv.info()

In [None]:
df_csv.describe(include="all")

### 3.1 Target Distribution

In [None]:
target_counts = df_csv["target"].value_counts().sort_index()
target_counts

In [None]:
plt.figure()
target_counts.plot(kind="bar")
plt.title("Iris Target Distribution")
plt.xlabel("Class label")
plt.ylabel("Count")
plt.show()

---
## 4. Data Preprocessing: Manual vs Library

Many ML errors come from weak preprocessing.  
Programming-wise, you should recognize **when to write a loop** and when to use a **library**.

We will demonstrate:
1. Handling missing values (simulated)
2. Scaling numerical features
3. Encoding categorical features (small synthetic example)

The goal is to learn patterns you can reuse across datasets.

### 4.1 Simulate Missing Values

The Iris dataset is clean.  
To practice preprocessing, we'll intentionally insert a few missing values.

In [None]:
df_miss = df_csv.copy()
rng = np.random.default_rng(RNG)

# Randomly set ~2% of numeric cells to NaN
num_cols = iris.feature_names
mask = rng.random((df_miss.shape[0], len(num_cols))) < 0.02

for j, col in enumerate(num_cols):
    df_miss.loc[mask[:, j], col] = np.nan

df_miss.isna().sum()

### 4.2 Missing Value Handling — Manual Loop Approach

We'll fill missing numeric values using the **column mean** manually.  
This approach is fine for learning, but not ideal for production.

In [None]:
df_manual = df_miss.copy()

for col in num_cols:
    mean_val = df_manual[col].mean()
    # loop-style fill
    values = []
    for v in df_manual[col].tolist():
        values.append(mean_val if pd.isna(v) else v)
    df_manual[col] = values

df_manual.isna().sum()

### 4.3 Missing Value Handling — pandas / sklearn Approach

Here is the cleaner “library-first” version.

In [None]:
df_pandas = df_miss.copy()
df_pandas[num_cols] = df_pandas[num_cols].fillna(df_pandas[num_cols].mean())

df_pandas.isna().sum()

### 4.4 Scaling — Manual Standardization

Standardization transforms each feature to:

\[
z = \frac{x - \mu}{\sigma}
\]

We'll do this manually to see the mechanics.

In [None]:
X_num = df_pandas[num_cols].to_numpy()

mu = X_num.mean(axis=0)
sigma = X_num.std(axis=0, ddof=0)

X_scaled_manual = (X_num - mu) / sigma

X_scaled_manual[:5]

### 4.5 Scaling — StandardScaler

This is the preferred approach for clean pipelines.

In [None]:
scaler = StandardScaler()
X_scaled_lib = scaler.fit_transform(df_pandas[num_cols])

X_scaled_lib[:5]

### 4.6 Encoding Categorical Features (Small Synthetic Demo)

Iris is purely numeric.  
To practice encoding, let's create a tiny toy dataset with a categorical column.

In [None]:
toy = pd.DataFrame({
    "height_cm": [160, 170, 165, 180, 175],
    "diet_type": ["A", "B", "A", "C", "B"],
    "target": [0, 1, 0, 1, 1]
})

toy

#### Manual Encoding with a Mapping

In [None]:
toy_manual = toy.copy()
mapping = {"A": 0, "B": 1, "C": 2}
toy_manual["diet_type_enc"] = toy_manual["diet_type"].map(mapping)

toy_manual

#### OneHotEncoder (Preferred for Many Cases)

In [None]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
diet_ohe = ohe.fit_transform(toy[["diet_type"]])

diet_cols = [f"diet_{c}" for c in ohe.categories_[0]]
diet_df = pd.DataFrame(diet_ohe, columns=diet_cols)

pd.concat([toy[["height_cm"]], diet_df, toy[["target"]]], axis=1)

---
## 5. Train/Test Split — Manual vs Library

In ML, we split data to estimate how well our model generalizes.  
You *can* implement splitting manually, but in practice we prefer:

- `train_test_split` for simplicity and correctness
- Stratified splitting when class balance matters

We will show both approaches using the Iris dataset.

### 5.1 Prepare Features and Target

In [None]:
df_ready = df_pandas.copy()  # cleaned version

X = df_ready[num_cols].to_numpy()
y = df_ready["target"].to_numpy()

X.shape, y.shape

### 5.2 Manual Split with Loops

This demonstrates the idea of:
- shuffling indices
- slicing into train/test
- building lists/arrays

This is educational—but not recommended for real projects unless you have a special constraint.

In [None]:
# Manual split
indices = list(range(len(X)))

rng = np.random.default_rng(CONFIG["random_state"])
rng.shuffle(indices)

split_point = int(len(indices) * (1 - CONFIG["test_size"]))

train_idx = indices[:split_point]
test_idx = indices[split_point:]

# Loop-based assembly
X_train_manual, y_train_manual = [], []
X_test_manual, y_test_manual = [], []

for i in train_idx:
    X_train_manual.append(X[i])
    y_train_manual.append(y[i])

for i in test_idx:
    X_test_manual.append(X[i])
    y_test_manual.append(y[i])

X_train_manual = np.array(X_train_manual)
y_train_manual = np.array(y_train_manual)
X_test_manual = np.array(X_test_manual)
y_test_manual = np.array(y_test_manual)

X_train_manual.shape, X_test_manual.shape

### 5.3 Library Split with train_test_split

Cleaner and safer, with optional stratification.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=CONFIG["test_size"],
    random_state=CONFIG["random_state"],
    stratify=y
)

X_train.shape, X_test.shape

---
## 6. Data Modeling (Programming View)

At this stage, Python shifts from data manipulation to object-oriented usage of ML estimators.  
With scikit-learn, models follow a consistent API:

- `model = Estimator(**params)`
- `model.fit(X_train, y_train)`
- `model.predict(X_test)`

This uniformity is one reason scikit-learn is so effective for learning.

### 6.1 Train a Simple KNN Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=CONFIG["knn_default_k"])
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

---
## 7. Training & Evaluation — Manual vs Library Metrics

Evaluation is both a programming and scientific habit.  
We'll compute accuracy in two ways:
1. Manual computation
2. Using `accuracy_score`

### 7.1 Manual Accuracy

In [None]:
correct = 0
for yt, yp in zip(y_test, y_pred):
    if yt == yp:
        correct += 1

acc_manual = correct / len(y_test)
acc_manual

### 7.2 accuracy_score + classification_report

In [None]:
acc_lib = accuracy_score(y_test, y_pred)

print("Manual accuracy:", acc_manual)
print("Library accuracy:", acc_lib)
print()
print(classification_report(y_test, y_pred, target_names=iris.target_names))

---
## 8. Optimization — Manual Hyperparameter Loop vs GridSearchCV

Optimization in ML often means choosing hyperparameters that improve generalization.  
Programming-wise, this is a great place to compare:

- **Manual loops** for transparency
- **Library tools** for reliability and speed

We'll tune \(k\) for KNN.

### 8.1 Manual Loop Tuning

In [None]:
k_values = CONFIG["grid_k_values"]

manual_rows = []
for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    manual_rows.append({"k": k, "accuracy": acc})

manual_results = pd.DataFrame(manual_rows).sort_values("accuracy", ascending=False)
manual_results

### 8.2 GridSearchCV

GridSearchCV automates the search using cross-validation.  
It is typically more reliable than tuning on the test set directly.

In [None]:
param_grid = {"n_neighbors": k_values}

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy"
)

grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)

---
## 9. Preprocessing + Modeling with Pipelines

When projects grow, you should avoid scattered preprocessing steps.  
A **Pipeline** helps you:
- keep steps in order
- prevent data leakage
- simplify experimentation

We'll build a minimal numeric pipeline.

In [None]:
pipe = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", KNeighborsClassifier(n_neighbors=5))
])

# Here we go back to DataFrame-style features for clarity
X_df = df_ready[num_cols]
y_series = df_ready["target"]

X_train_df, X_test_df, y_train_s, y_test_s = train_test_split(
    X_df, y_series,
    test_size=CONFIG["test_size"],
    random_state=CONFIG["random_state"],
    stratify=y_series
)

pipe.fit(X_train_df, y_train_s)
pred = pipe.predict(X_test_df)

print("Pipeline accuracy:", accuracy_score(y_test_s, pred))

---
## 10. Deployment (Intro-Level Programming Demo)

Deployment can mean many things—from a simple script to a production service.  
For a first programming demonstration, we focus on:

- saving a trained model to disk
- loading it later
- using it in a small prediction function

This introduces reproducibility and modular code design.

In [None]:
# Save the pipeline
model_path = "iris_knn_pipeline.joblib"
joblib.dump(pipe, model_path)

# Load it back
loaded_pipe = joblib.load(model_path)

# Predict the first 3 rows of the test set
loaded_pipe.predict(X_test_df.head(3))

---
## 11. Performance Monitoring (Simple Classroom Pattern)

In professional ML, monitoring ensures that models remain reliable after deployment.  
A minimal monitoring script might:

- collect new data batches
- generate predictions
- compare against ground truth when available
- log metrics over time

Below is a simplified example that logs accuracy for a “new batch” (simulated).

In [None]:
# Simulate a "new batch" by sampling from the existing test set
new_batch_X = X_test_df.sample(20, random_state=RNG)
new_batch_y = y_test_s.loc[new_batch_X.index]

new_pred = loaded_pipe.predict(new_batch_X)
new_acc = accuracy_score(new_batch_y, new_pred)

log_row = {
    "timestamp": pd.Timestamp.now(),
    "batch_size": len(new_batch_X),
    "accuracy": new_acc
}

log_path = "monitoring_log.csv"

# Append or create
try:
    existing = pd.read_csv(log_path)
    updated = pd.concat([existing, pd.DataFrame([log_row])], ignore_index=True)
except FileNotFoundError:
    updated = pd.DataFrame([log_row])

updated.to_csv(log_path, index=False)
updated.tail()

---
## 12. Summary: Python Skills Mapped to the ML Pipeline

**Define strategy**
- Use configuration dictionaries, constants, and helper functions.

**Data collection / ingestion**
- Read CSV/Excel; load benchmark datasets; validate schema.

**Data understanding / EDA**
- Use pandas inspection methods and simple plots.

**Preprocessing**
- Handle missingness, encode categories, scale numerics.
- Prefer reusable transformers and pipelines.

**Splitting**
- Understand manual splitting logic, but use `train_test_split`.

**Modeling**
- Learn the estimator API: `fit`, `predict`, `score`.

**Training & evaluation**
- Compute metrics manually to understand them, then use library tools.

**Optimization**
- Start with simple loops; progress to `GridSearchCV`.

**Deployment**
- Save/load models with `joblib`; wrap prediction into functions.

**Monitoring**
- Log batch metrics; watch for drift in real systems.

The big idea: strong ML work depends on **strong Python fundamentals** plus disciplined use of libraries.

---
## Suggested Student Practice (Optional)

1. Replace KNN with Logistic Regression and repeat the manual vs library comparisons.  
2. Create a small synthetic dataset with noise and observe how accuracy changes with different splits.  
3. Extend the monitoring log to also record:
   - class distribution in new batches
   - predicted probability summaries (if using probabilistic models)

## References (Programming-Oriented)
- Géron, A. *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*.  
- VanderPlas, J. *Python Data Science Handbook*.  
- McKinney, W. *Python for Data Analysis*.  
- James, G., Witten, D., Hastie, T., & Tibshirani, R. *An Introduction to Statistical Learning*.  
- scikit-learn, pandas, and NumPy official documentation.