# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 8 - Template Notebook: Machine Learning Model Lifecycle
**Instructor:** Amir Charkhi | **Goal:** Full ML Pipeline: From Data to Deployment


### Learning Objectives
- Understand the complete ML workflow from data loading to model evaluation
- Learn proper data splitting to avoid data leakage
- Compare linear and tree-based models
- Master cross-validation and hyperparameter tuning
- Apply best practices for model evaluation

---

## 1. Import Libraries

**What you need to do:**  
Import all necessary libraries for data manipulation, visualization, and machine learning.

**Required imports:**
- NumPy and Pandas for data handling
- Matplotlib and Seaborn for visualization
- Scikit-learn for dataset, preprocessing, models, and evaluation

**üí° Hint:** Import `train_test_split`, `LinearRegression`, `DecisionTreeRegressor`, `cross_val_score`, `GridSearchCV`, and regression metrics.

In [None]:
# Your code here
# Import all necessary libraries


---
## 2. Load the Dataset

**What you need to do:**  
Load the California Housing dataset using sklearn's built-in dataset.

**Theory:**  
The California Housing dataset contains information from the 1990 census with features like median income, house age, and location. The target variable is the median house value.

**üí° Hint:** Use `fetch_california_housing()` and convert to a pandas DataFrame. Set `as_frame=True` for easy handling.

In [None]:
# Your code here
# Load the California Housing dataset


---
## 3. Initial Data Inspection

**What you need to do:**  
Perform a quick inspection of the dataset before any splitting.

**Tasks:**
- Display the first few rows
- Check dataset shape
- Display feature names and target variable
- Check for missing values

**üí° Hint:** Use `.head()`, `.shape`, `.info()`, and `.isnull().sum()` methods.

In [None]:
# Your code here
# Inspect the dataset structure


---
## 4. Train-Validation-Test Split

**‚ö†Ô∏è CRITICAL: Split BEFORE detailed EDA to prevent data leakage!**

**What you need to do:**  
Split the data into three sets:
- **Training set (60%)**: For model training
- **Validation set (20%)**: For model selection and hyperparameter tuning
- **Test set (20%)**: For final, unbiased evaluation (DO NOT TOUCH until the very end!)

**Theory:**  
The test set represents unseen data in production. It must remain completely isolated from all training decisions to give an honest estimate of model performance.

**üí° Hint:** Use `train_test_split()` twice. First split into train+val (80%) and test (20%), then split train+val into train (75% of 80% = 60% total) and validation (25% of 80% = 20% total). Set `random_state=42` for reproducibility.

In [None]:
# Your code here
# Split data into train, validation, and test sets
# Remember: Test set should be locked away!


---
## 5. Exploratory Data Analysis (EDA)

**‚ö†Ô∏è IMPORTANT: Perform EDA ONLY on the training set to avoid data leakage!**

**What you need to do:**  
Analyze the training data to understand patterns, distributions, and relationships.

**Tasks:**
1. Display summary statistics for all features
2. Visualize target variable distribution (histogram)
3. Create a correlation heatmap
4. Identify the top 3 features most correlated with the target
5. Create scatter plots for top correlated features vs target
6. Check for outliers using box plots

**üí° Hint:** Use `.describe()`, `plt.hist()`, `sns.heatmap()`, and `sns.scatterplot()` on training data only.

In [None]:
# Your code here
# Summary statistics


In [None]:
# Your code here
# Target variable distribution


In [None]:
# Your code here
# Correlation analysis and heatmap


In [None]:
# Your code here
# Scatter plots for top features


---
## 6. Baseline Model: Linear Regression

**Theory:**  
Linear Regression assumes a linear relationship between features and target. It's fast, interpretable, and serves as an excellent baseline. The model learns coefficients (weights) for each feature to minimize the sum of squared errors.

**What you need to do:**  
Train a Linear Regression model and evaluate it on the validation set.

**Tasks:**
1. Initialize the Linear Regression model
2. Train (fit) the model on training data
3. Make predictions on validation set
4. Calculate and display:
   - Mean Absolute Error (MAE)
   - Mean Squared Error (MSE)
   - Root Mean Squared Error (RMSE)
   - R¬≤ Score

**üí° Hint:** Use `.fit()`, `.predict()`, and metrics from `sklearn.metrics`.

In [None]:
# Your code here
# Train Linear Regression model


In [None]:
# Your code here
# Make predictions and calculate metrics


---
## 7. Cross-Validation for Linear Regression

**Theory:**  
Cross-validation provides a more robust estimate of model performance by training and evaluating the model multiple times on different subsets of data. K-Fold CV splits data into K folds, trains on K-1 folds, and validates on the remaining fold, rotating through all combinations.

**What you need to do:**  
Perform 5-fold cross-validation on the training set to get a better estimate of model performance.

**Tasks:**
1. Use `cross_val_score()` with 5 folds
2. Calculate RMSE for each fold (use `scoring='neg_mean_squared_error'` and take square root)
3. Display mean and standard deviation of CV scores

**üí° Hint:** `cross_val_score()` returns negative MSE, so you need to negate and take the square root. Use `scoring='neg_root_mean_squared_error'` if available.

In [None]:
# Your code here
# Perform cross-validation


---
## 8. Tree-Based Model: Decision Tree Regressor

**Theory:**  
Decision Trees partition the feature space into regions through recursive binary splits. They can capture non-linear relationships and interactions between features without requiring feature scaling. However, they tend to overfit if not properly regularized.

**What you need to do:**  
Train a Decision Tree Regressor and compare its performance to Linear Regression.

**Tasks:**
1. Initialize a Decision Tree Regressor with `random_state=42`
2. Train on training data
3. Evaluate on validation set
4. Calculate the same metrics as Linear Regression
5. Compare performance to Linear Regression

**üí° Hint:** Without constraints, Decision Trees can perfectly memorize training data. We'll tune this in the next section.

In [None]:
# Your code here
# Train Decision Tree model


In [None]:
# Your code here
# Make predictions and calculate metrics


---
## 9. Cross-Validation for Decision Tree

**What you need to do:**  
Perform 5-fold cross-validation on the Decision Tree model.

**üí° Hint:** If CV scores vary significantly from validation scores, the model may be overfitting. This motivates hyperparameter tuning.

In [None]:
# Your code here
# Perform cross-validation for Decision Tree


---
## 10. Hyperparameter Tuning: Decision Tree

**Theory:**  
Hyperparameter tuning finds the optimal model configuration that balances bias and variance. For Decision Trees, key hyperparameters include:
- `max_depth`: Maximum tree depth (prevents overfitting)
- `min_samples_split`: Minimum samples required to split a node
- `min_samples_leaf`: Minimum samples required at leaf nodes
- `max_features`: Number of features to consider for each split

**What you need to do:**  
Use GridSearchCV to find the best hyperparameters for the Decision Tree.

**Tasks:**
1. Define a parameter grid with:
   - `max_depth`: [3, 5, 7, 10, None]
   - `min_samples_split`: [2, 5, 10]
   - `min_samples_leaf`: [1, 2, 4]
2. Use GridSearchCV with 5-fold CV
3. Fit on training data
4. Display best parameters and best CV score
5. Evaluate the best model on validation set

**üí° Hint:** Use `scoring='neg_mean_squared_error'` and set `n_jobs=-1` to use all CPU cores.

In [None]:
# Your code here
# Define parameter grid and perform GridSearchCV


In [None]:
# Your code here
# Display best parameters and evaluate on validation set


---
## 11. Model Comparison

**What you need to do:**  
Create a summary comparison of all models tested.

**Tasks:**
1. Create a DataFrame or table comparing:
   - Linear Regression
   - Decision Tree (default)
   - Decision Tree (tuned)
2. Include metrics: RMSE, MAE, R¬≤
3. Identify which model performs best on validation data

**üí° Hint:** Store all results in a dictionary and convert to a pandas DataFrame for clean visualization.

In [None]:
# Your code here
# Create model comparison table


---
## 12. Final Evaluation on Test Set

**‚ö†Ô∏è CRITICAL: This is your ONE AND ONLY test set evaluation!**

**Theory:**  
The test set provides an unbiased estimate of how your model will perform on completely unseen data in production. This is your final report card. If you used the test set during development, this number would be artificially optimistic.

**What you need to do:**  
Evaluate your best model (from validation performance) on the held-out test set.

**Tasks:**
1. Select your best model based on validation performance
2. Make predictions on the test set
3. Calculate final metrics: RMSE, MAE, R¬≤
4. Compare test set performance to validation performance
5. Create a scatter plot: Actual vs Predicted values
6. Display residuals distribution

**üí° Hint:** If test performance is significantly worse than validation, your model may have overfit to the validation set.

In [None]:
# Your code here
# Final evaluation on test set


In [None]:
# Your code here
# Visualize predictions vs actual values


In [None]:
# Your code here
# Analyze residuals


---
## 13. Key Takeaways & Next Steps

**What you should have learned:**
1. ‚úÖ Proper data splitting prevents data leakage
2. ‚úÖ EDA helps understand data before modeling
3. ‚úÖ Start with simple baselines (Linear Regression)
4. ‚úÖ Cross-validation provides robust performance estimates
5. ‚úÖ Hyperparameter tuning improves model performance
6. ‚úÖ Test set evaluation gives final, unbiased performance

**Reflection Questions:**
- Which model performed better and why?
- How did hyperparameter tuning affect Decision Tree performance?
- What's the difference between validation and test set performance?
- Which features were most important for prediction?

---

### üöÄ Extension Activities

**This notebook structure is ready for plug-and-play with other models!**

Try replacing the Decision Tree with:
- **Random Forest Regressor** (ensemble of trees)
- **Gradient Boosting Regressor** (sequential boosting)
- **XGBoost Regressor** (optimized gradient boosting)
- **LightGBM Regressor** (fast gradient boosting)
- **Support Vector Regressor** (SVR)

For each new model:
1. Follow the same workflow (sections 8-10)
2. Use appropriate hyperparameters for that model
3. Compare results in section 11
4. Update final evaluation if it becomes the best model

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*