# Model Development, Training, and Hyperparameter Optimization

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Implement model architecture and training pipeline
- Train models with different hyperparameter configurations
- Perform hyperparameter optimization using grid search and automated tools
- Evaluate model performance using appropriate metrics
- Iteratively refine models based on validation results

## ðŸ”— Prerequisites

- âœ… Unit 2: Data collection and preprocessing completed
- âœ… Understanding of ML model training
- âœ… Python, scikit-learn, TensorFlow/PyTorch knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 12, Unit 3**:
- Implementing model architecture and training pipeline
- Training models with different hyperparameter configurations
- Performing hyperparameter optimization using grid search or automated tools
- Evaluating model performance using appropriate metrics
- Analyzing model outputs and identifying areas for improvement
- Iteratively refining the model based on validation results
- Documenting training procedures and results
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 3 Practical Content

---

## Introduction

**Model Development** involves selecting appropriate algorithms, designing architectures, training models, and optimizing hyperparameters to achieve the best performance.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.datasets import make_classification

print("âœ… Libraries imported!")


âœ… Libraries imported!


In [2]:
# Generate sample dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}, Validation set: {X_val.shape}")


Training set: (800, 20), Validation set: (200, 20)


## Part 1: Baseline Model Training

Train a baseline model to establish performance benchmarks.


In [3]:
# Baseline model with default hyperparameters
baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)

# Evaluate baseline
y_val_pred = baseline_model.predict(X_val)
baseline_accuracy = accuracy_score(y_val, y_val_pred)
baseline_precision = precision_score(y_val, y_val_pred)
baseline_recall = recall_score(y_val, y_val_pred)
baseline_f1 = f1_score(y_val, y_val_pred)

print("=" * 60)
print("Baseline Model Performance")
print("=" * 60)
print(f"Accuracy:  {baseline_accuracy:.4f}")
print(f"Precision: {baseline_precision:.4f}")
print(f"Recall:    {baseline_recall:.4f}")
print(f"F1-Score:  {baseline_f1:.4f}")

print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))


Baseline Model Performance
Accuracy:  0.9050
Precision: 0.9175
Recall:    0.8900
F1-Score:  0.9036

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.92      0.91       100
           1       0.92      0.89      0.90       100

    accuracy                           0.91       200
   macro avg       0.91      0.91      0.90       200
weighted avg       0.91      0.91      0.90       200



## Part 2: Hyperparameter Optimization - Grid Search


In [4]:
# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search
print("=" * 60)
print("Grid Search Hyperparameter Optimization")
print("=" * 60)
print("Searching over parameter combinations...")
print(f"Total combinations: {np.prod([len(v) for v in param_grid.values()])}")

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42), param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate best model
best_model = grid_search.best_estimator_
y_val_pred_best = best_model.predict(X_val)
best_accuracy = accuracy_score(y_val, y_val_pred_best)

print(f"\nValidation Accuracy with best params: {best_accuracy:.4f}")
print(f"Improvement over baseline: {best_accuracy - baseline_accuracy:.4f}")


Grid Search Hyperparameter Optimization
Searching over parameter combinations...
Total combinations: 108
Fitting 5 folds for each of 108 candidates, totalling 540 fits



Best parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Best cross-validation score: 0.8924

Validation Accuracy with best params: 0.8950
Improvement over baseline: -0.0100


## Part 3: Hyperparameter Optimization - Random Search

Random search is more efficient for large parameter spaces.


## Summary

### Key Concepts:
1. **Baseline Model**: Establish performance benchmark with default parameters
2. **Grid Search**: Exhaustive search over all parameter combinations (computationally expensive)
3. **Random Search**: Sample random combinations (more efficient for large spaces)
4. **Cross-Validation**: Use CV scores for hyperparameter selection, not validation set
5. **Model Refinement**: Iteratively improve based on validation performance

### Best Practices:
- Always establish baseline first
- Use cross-validation for hyperparameter tuning
- Keep validation set separate for final evaluation
- Document all hyperparameter configurations tried
- Track training time vs performance trade-offs

**Reference:** Course 12, Unit 3: "Model Development and Training" - All practical activities covered


In [5]:
# Random Search - samples random combinations
param_distributions = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6]
}

print("=" * 60)
print("Random Search Hyperparameter Optimization")
print("=" * 60)
print("Searching random parameter combinations (more efficient for large spaces)...")

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42), param_distributions,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")

# Compare with grid search
print(f"\nGrid Search best score: {grid_search.best_score_:.4f}")
print(f"Random Search best score: {random_search.best_score_:.4f}")
print(f"Difference: {abs(grid_search.best_score_ - random_search.best_score_):.4f}")


Random Search Hyperparameter Optimization
Searching random parameter combinations (more efficient for large spaces)...
Fitting 5 folds for each of 20 candidates, totalling 100 fits



Best parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 20}
Best cross-validation score: 0.8913

Grid Search best score: 0.8924
Random Search best score: 0.8913
Difference: 0.0010
