
# Module 5 — In-Class Activity (Starter)
**Topic:** Ensemble Learning in Practice  
**Time:** ~45–60 minutes

You will:
- Generate a small, human-readable dataset.
- Split and scale data.
- Train **baseline** models (Logistic Regression, Decision Tree, Random Forest).
- Train **ensemble** models (Bagging and AdaBoost).
- Build a **comparison table** (Accuracy, F1) and a quick bar chart.

> This is a **starter** notebook: several spots are marked with `# TODO:`.  
> Fill them in with Python code before running.


In [None]:
# Step 1 — Imports and Setup (pre-filled)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier

import warnings
warnings.filterwarnings("ignore")
np.random.seed(7)  # reproducibility



## Step 2. Generate a simple dataset

Dataset: **Project Habits → High Grade**  
Each row represents a student. Your goal is to predict whether they achieve a high final project grade (≥ 85).


In [1]:
# TODO:
# 1) Create 5+ integer features with np.random.randint(...)
# 2) Assemble a DataFrame df = pd.DataFrame({...})
# 3) Build a 'score' that increases with good habits
# 4) Map score → probability (logistic or threshold)
# 5) df["HighGrade"] = (prob > THRESHOLD).astype(int)
# 6) Quick checks: df.head(), class balance



## Step 3. Split and scale the data

- Use `train_test_split` with `stratify=y` and `test_size=0.25`.
- Standardize features for models that need it (e.g., Logistic Regression) using `StandardScaler`.


In [None]:
# TODO:
# 1) X = df.drop(columns=["HighGrade"]); y = df["HighGrade"]
# 2) train_test_split(..., stratify=y, test_size=0.25, random_state=7)
# 3) (Optional) StandardScaler: fit on train, transform train & test



## Step 4. Train baseline models

Train three baselines:
- **Logistic Regression** (use **scaled** features)
- **Decision Tree** (unscaled)
- **Random Forest** (unscaled)

Store **Accuracy** and **F1** in a results dictionary.


In [None]:
# TODO:
# 1) Initialize, fit, predict for each baseline
# 2) Compute Accuracy, F1 with sklearn.metrics
# 3) Populate results dict
# 4) Display a table (pd.DataFrame(results).T)



## Step 5. Train ensemble models

Implement two ensembles:
- **Bagging (Tree)** — variance reduction by averaging many trees.
- **AdaBoost** — sequentially fixes mistakes using shallow trees.

Add their metrics to the same results dictionary.


In [None]:
# TODO:
# 1) Fit Bagging with DecisionTree base learner
# 2) Fit AdaBoost with shallow trees (e.g., max_depth=2)
# 3) Predict, compute metrics, update results
# 4) Show updated table sorted by F1



## Step 6. Visualize comparison

Create a quick bar chart of F1 scores to compare models.
Then, tweak **two hyperparameters** (e.g., `n_estimators`, `max_depth`, `learning_rate`) and re-run.


In [None]:
# TODO:
# 1) comparison = pd.DataFrame(results).T.sort_values("F1", ascending=False)
# 2) display(comparison)
# 3) Optional plot: barh of F1 by model
