# Day 4: Model Selection & Training

In this notebook, we'll train multiple machine learning models, evaluate them, and explore cross-validation + hyperparameter tuning.

## Step 1: Load Dataset
We are using the Breast Cancer dataset from Scikit-learn. This is a binary classification dataset where the task is to predict whether a tumor is malignant or benign based on medical features.

In [8]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X.shape, y.shape

((569, 30), (569,))

## Step 2: Train/Test Split
We split the dataset into training and testing sets so we can train models on one part of the data and evaluate their performance on unseen data. This helps avoid overfitting.

In [9]:
from sklearn.model_selection import train_test_split

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape

((455, 30), (114, 30))

## Step 3: Train Baseline Models
Here, we train four models:
- **Logistic Regression**: A linear model good for baseline classification.
- **K-Nearest Neighbors (KNN)**: Classifies based on the majority label of neighbors.
- **Decision Tree**: A simple tree-based model that splits data by feature thresholds.
- **Random Forest**: An ensemble of decision trees for better generalization.

We use a Pipeline with StandardScaler to normalize features before training.

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

results = {}
for name, model in models.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    results[name] = classification_report(y_test, y_pred, output_dict=True)

results

{'Logistic Regression': {'0': {'precision': 0.9761904761904762,
   'recall': 0.9761904761904762,
   'f1-score': 0.9761904761904762,
   'support': 42.0},
  '1': {'precision': 0.9861111111111112,
   'recall': 0.9861111111111112,
   'f1-score': 0.9861111111111112,
   'support': 72.0},
  'accuracy': 0.9824561403508771,
  'macro avg': {'precision': 0.9811507936507937,
   'recall': 0.9811507936507937,
   'f1-score': 0.9811507936507937,
   'support': 114.0},
  'weighted avg': {'precision': 0.9824561403508771,
   'recall': 0.9824561403508771,
   'f1-score': 0.9824561403508771,
   'support': 114.0}},
 'KNN': {'0': {'precision': 0.9512195121951219,
   'recall': 0.9285714285714286,
   'f1-score': 0.9397590361445783,
   'support': 42.0},
  '1': {'precision': 0.958904109589041,
   'recall': 0.9722222222222222,
   'f1-score': 0.9655172413793104,
   'support': 72.0},
  'accuracy': 0.956140350877193,
  'macro avg': {'precision': 0.9550618108920814,
   'recall': 0.9503968253968254,
   'f1-score': 0.952

## Step 4: Cross-Validation
Cross-validation helps us evaluate models more reliably by splitting data into multiple folds. We train and test on different folds and then average the results to reduce variance.

In [11]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Cross-validation example with Logistic Regression
log_reg = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression(max_iter=1000))])
cv_scores = cross_val_score(log_reg, X, y, cv=5)
print('Cross-validation accuracy scores:', cv_scores)
print('Mean CV accuracy:', np.mean(cv_scores))

Cross-validation accuracy scores: [0.98245614 0.98245614 0.97368421 0.97368421 0.99115044]
Mean CV accuracy: 0.9806862288464524


## Step 5: Hyperparameter Tuning with GridSearchCV
Models like Random Forest have hyperparameters (e.g., number of trees, depth). GridSearchCV tries different combinations to find the best settings using cross-validation.

In [12]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning with GridSearchCV (Random Forest)
param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [None, 5, 10]
}

pipe = Pipeline([('scaler', StandardScaler()), ('model', RandomForestClassifier(random_state=42))])
grid = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1)
grid.fit(X_train, y_train)

print('Best parameters:', grid.best_params_)
print('Best CV score:', grid.best_score_)

Best parameters: {'model__max_depth': 5, 'model__n_estimators': 100}
Best CV score: 0.953831183920065
