# Machine Learning Model Lifecycle - SOLUTION
## California Housing Price Prediction

**AI Tech Institute**  
*Full ML Pipeline: From Data to Deployment*

---

### Learning Objectives
- Understand the complete ML workflow from data loading to model evaluation
- Learn proper data splitting to avoid data leakage
- Compare linear and tree-based models
- Master cross-validation and hyperparameter tuning
- Apply best practices for model evaluation

---

## 1. Import Libraries

**What you need to do:**  
Import all necessary libraries for data manipulation, visualization, and machine learning.

**Required imports:**
- NumPy and Pandas for data handling
- Matplotlib and Seaborn for visualization
- Scikit-learn for dataset, preprocessing, models, and evaluation

**üí° Hint:** Import `train_test_split`, `LinearRegression`, `DecisionTreeRegressor`, `cross_val_score`, `GridSearchCV`, and regression metrics.

In [None]:
# Import core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import sklearn components
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

---
## 2. Load the Dataset

**What you need to do:**  
Load the California Housing dataset using sklearn's built-in dataset.

**Theory:**  
The California Housing dataset contains information from the 1990 census with features like median income, house age, and location. The target variable is the median house value.

**üí° Hint:** Use `fetch_california_housing()` and convert to a pandas DataFrame. Set `as_frame=True` for easy handling.

In [None]:
# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)

# Create DataFrame with features
df = housing.frame

# Display basic information
print("‚úÖ Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"Features: {list(housing.feature_names)}")
print(f"Target: {housing.target_names[0]}")

---
## 3. Initial Data Inspection

**What you need to do:**  
Perform a quick inspection of the dataset before any splitting.

**Tasks:**
- Display the first few rows
- Check dataset shape
- Display feature names and target variable
- Check for missing values

**üí° Hint:** Use `.head()`, `.shape`, `.info()`, and `.isnull().sum()` methods.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
print(df.head())

print("\n" + "="*80 + "\n")

# Dataset information
print("Dataset Information:")
print(df.info())

print("\n" + "="*80 + "\n")

# Check for missing values
print("Missing values per column:")
missing = df.isnull().sum()
print(missing)
print(f"\n‚úÖ Total missing values: {missing.sum()}")

---
## 4. Train-Validation-Test Split

**‚ö†Ô∏è CRITICAL: Split BEFORE detailed EDA to prevent data leakage!**

**What you need to do:**  
Split the data into three sets:
- **Training set (60%)**: For model training
- **Validation set (20%)**: For model selection and hyperparameter tuning
- **Test set (20%)**: For final, unbiased evaluation (DO NOT TOUCH until the very end!)

**Theory:**  
The test set represents unseen data in production. It must remain completely isolated from all training decisions to give an honest estimate of model performance.

**üí° Hint:** Use `train_test_split()` twice. First split into train+val (80%) and test (20%), then split train+val into train (75% of 80% = 60% total) and validation (25% of 80% = 20% total). Set `random_state=42` for reproducibility.

In [None]:
# Separate features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate train and validation from temp (75/25 split = 60/20 of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42
)

# Display split sizes
print("\n" + "="*80)
print("Data Split Summary:")
print("="*80)
print(f"Training set:   {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set:       {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Total:          {len(X):,} samples")
print("\nüîí Test set is now locked and will not be used until final evaluation!")

---
## 5. Exploratory Data Analysis (EDA)

**‚ö†Ô∏è IMPORTANT: Perform EDA ONLY on the training set to avoid data leakage!**

**What you need to do:**  
Analyze the training data to understand patterns, distributions, and relationships.

**Tasks:**
1. Display summary statistics for all features
2. Visualize target variable distribution (histogram)
3. Create a correlation heatmap
4. Identify the top 3 features most correlated with the target
5. Create scatter plots for top correlated features vs target
6. Check for outliers using box plots

**üí° Hint:** Use `.describe()`, `plt.hist()`, `sns.heatmap()`, and `sns.scatterplot()` on training data only.

In [None]:
# Summary statistics
print("Training Set - Summary Statistics:")
print("="*80)
print(X_train.describe())

print("\n" + "="*80)
print("Target Variable - Summary Statistics:")
print("="*80)
print(y_train.describe())

In [None]:
# Target variable distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(y_train, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value (100k $)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Target Variable', fontsize=14, fontweight='bold')
plt.axvline(y_train.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {y_train.mean():.2f}')
plt.axvline(y_train.median(), color='green', linestyle='--', linewidth=2, label=f'Median: {y_train.median():.2f}')
plt.legend()

plt.subplot(1, 2, 2)
plt.boxplot(y_train, vert=True)
plt.ylabel('Median House Value (100k $)', fontsize=12)
plt.title('Box Plot of Target Variable', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nüìä Target variable statistics:")
print(f"   Mean: ${y_train.mean():.2f} (100k)")
print(f"   Median: ${y_train.median():.2f} (100k)")
print(f"   Std: ${y_train.std():.2f} (100k)")
print(f"   Range: ${y_train.min():.2f} to ${y_train.max():.2f} (100k)")

In [None]:
# Correlation analysis
# Combine training features and target for correlation analysis
train_data = X_train.copy()
train_data['MedHouseVal'] = y_train

# Calculate correlation matrix
corr_matrix = train_data.corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            fmt='.2f')
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify top correlated features with target
target_corr = corr_matrix['MedHouseVal'].drop('MedHouseVal').sort_values(ascending=False)
print("\nüìà Feature Correlations with Target (MedHouseVal):")
print("="*80)
for feature, corr in target_corr.items():
    print(f"{feature:20s}: {corr:+.4f}")

print(f"\n‚úÖ Top 3 most correlated features: {list(target_corr.head(3).index)}")

In [None]:
# Scatter plots for top correlated features
top_features = list(target_corr.head(3).index)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, feature in enumerate(top_features):
    axes[idx].scatter(X_train[feature], y_train, alpha=0.3, s=10)
    axes[idx].set_xlabel(feature, fontsize=12)
    axes[idx].set_ylabel('Median House Value', fontsize=12)
    axes[idx].set_title(f'{feature} vs Target\n(Correlation: {target_corr[feature]:.3f})', 
                       fontsize=12, fontweight='bold')
    
    # Add trend line
    z = np.polyfit(X_train[feature], y_train, 1)
    p = np.poly1d(z)
    axes[idx].plot(X_train[feature], p(X_train[feature]), "r--", linewidth=2, alpha=0.8)

plt.tight_layout()
plt.show()

print("‚úÖ Scatter plots created for top 3 features")

---
## 6. Baseline Model: Linear Regression

**Theory:**  
Linear Regression assumes a linear relationship between features and target. It's fast, interpretable, and serves as an excellent baseline. The model learns coefficients (weights) for each feature to minimize the sum of squared errors.

**What you need to do:**  
Train a Linear Regression model and evaluate it on the validation set.

**Tasks:**
1. Initialize the Linear Regression model
2. Train (fit) the model on training data
3. Make predictions on validation set
4. Calculate and display:
   - Mean Absolute Error (MAE)
   - Mean Squared Error (MSE)
   - Root Mean Squared Error (RMSE)
   - R¬≤ Score

**üí° Hint:** Use `.fit()`, `.predict()`, and metrics from `sklearn.metrics`.

In [None]:
# Initialize and train Linear Regression model
lr_model = LinearRegression()

print("Training Linear Regression model...")
lr_model.fit(X_train, y_train)
print("‚úÖ Model trained successfully!\n")

# Display feature coefficients
print("Feature Coefficients:")
print("="*80)
coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': lr_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(coef_df.to_string(index=False))
print(f"\nIntercept: {lr_model.intercept_:.4f}")

In [None]:
# Make predictions on training and validation sets
y_train_pred_lr = lr_model.predict(X_train)
y_val_pred_lr = lr_model.predict(X_val)

# Calculate metrics for training set
train_mae_lr = mean_absolute_error(y_train, y_train_pred_lr)
train_mse_lr = mean_squared_error(y_train, y_train_pred_lr)
train_rmse_lr = np.sqrt(train_mse_lr)
train_r2_lr = r2_score(y_train, y_train_pred_lr)

# Calculate metrics for validation set
val_mae_lr = mean_absolute_error(y_val, y_val_pred_lr)
val_mse_lr = mean_squared_error(y_val, y_val_pred_lr)
val_rmse_lr = np.sqrt(val_mse_lr)
val_r2_lr = r2_score(y_val, y_val_pred_lr)

# Display results
print("\n" + "="*80)
print("LINEAR REGRESSION - PERFORMANCE METRICS")
print("="*80)
print(f"\n{'Metric':<25} {'Training Set':>15} {'Validation Set':>15} {'Difference':>15}")
print("-"*80)
print(f"{'MAE':<25} {train_mae_lr:>15.4f} {val_mae_lr:>15.4f} {abs(train_mae_lr-val_mae_lr):>15.4f}")
print(f"{'MSE':<25} {train_mse_lr:>15.4f} {val_mse_lr:>15.4f} {abs(train_mse_lr-val_mse_lr):>15.4f}")
print(f"{'RMSE':<25} {train_rmse_lr:>15.4f} {val_rmse_lr:>15.4f} {abs(train_rmse_lr-val_rmse_lr):>15.4f}")
print(f"{'R¬≤ Score':<25} {train_r2_lr:>15.4f} {val_r2_lr:>15.4f} {abs(train_r2_lr-val_r2_lr):>15.4f}")
print("="*80)

print(f"\nüí° Interpretation:")
print(f"   - The model explains {val_r2_lr*100:.2f}% of variance in validation data")
print(f"   - Average prediction error: ${val_mae_lr*100:.2f}k")
print(f"   - RMSE: ${val_rmse_lr*100:.2f}k (penalizes larger errors more)")

---
## 7. Cross-Validation for Linear Regression

**Theory:**  
Cross-validation provides a more robust estimate of model performance by training and evaluating the model multiple times on different subsets of data. K-Fold CV splits data into K folds, trains on K-1 folds, and validates on the remaining fold, rotating through all combinations.

**What you need to do:**  
Perform 5-fold cross-validation on the training set to get a better estimate of model performance.

**Tasks:**
1. Use `cross_val_score()` with 5 folds
2. Calculate RMSE for each fold (use `scoring='neg_mean_squared_error'` and take square root)
3. Display mean and standard deviation of CV scores

**üí° Hint:** `cross_val_score()` returns negative MSE, so you need to negate and take the square root. Use `scoring='neg_root_mean_squared_error'` if available.

In [None]:
# Perform 5-fold cross-validation
print("Performing 5-Fold Cross-Validation for Linear Regression...\n")

# Calculate negative MSE for each fold
cv_scores_mse = cross_val_score(lr_model, X_train, y_train, 
                                 cv=5, 
                                 scoring='neg_mean_squared_error',
                                 n_jobs=-1)

# Convert to RMSE (negate and take square root)
cv_scores_rmse = np.sqrt(-cv_scores_mse)

# Display results
print("="*80)
print("LINEAR REGRESSION - CROSS-VALIDATION RESULTS (5-Fold)")
print("="*80)
for i, score in enumerate(cv_scores_rmse, 1):
    print(f"Fold {i}: RMSE = {score:.4f}")

print("-"*80)
print(f"Mean RMSE:   {cv_scores_rmse.mean():.4f}")
print(f"Std RMSE:    {cv_scores_rmse.std():.4f}")
print(f"Min RMSE:    {cv_scores_rmse.min():.4f}")
print(f"Max RMSE:    {cv_scores_rmse.max():.4f}")
print("="*80)

print(f"\nüí° Cross-Validation Insights:")
print(f"   - Average RMSE across 5 folds: {cv_scores_rmse.mean():.4f}")
print(f"   - Standard deviation: {cv_scores_rmse.std():.4f}")
print(f"   - Low std suggests model performance is stable across different data splits")

# Store for comparison
lr_cv_rmse_mean = cv_scores_rmse.mean()
lr_cv_rmse_std = cv_scores_rmse.std()

---
## 8. Tree-Based Model: Decision Tree Regressor

**Theory:**  
Decision Trees partition the feature space into regions through recursive binary splits. They can capture non-linear relationships and interactions between features without requiring feature scaling. However, they tend to overfit if not properly regularized.

**What you need to do:**  
Train a Decision Tree Regressor and compare its performance to Linear Regression.

**Tasks:**
1. Initialize a Decision Tree Regressor with `random_state=42`
2. Train on training data
3. Evaluate on validation set
4. Calculate the same metrics as Linear Regression
5. Compare performance to Linear Regression

**üí° Hint:** Without constraints, Decision Trees can perfectly memorize training data. We'll tune this in the next section.

In [None]:
# Initialize and train Decision Tree model (no constraints)
dt_model = DecisionTreeRegressor(random_state=42)

print("Training Decision Tree Regressor (default parameters)...")
dt_model.fit(X_train, y_train)
print("‚úÖ Model trained successfully!\n")

# Display tree information
print(f"Tree Depth: {dt_model.get_depth()}")
print(f"Number of Leaves: {dt_model.get_n_leaves()}")
print(f"Number of Features Used: {dt_model.n_features_in_}")

In [None]:
# Make predictions
y_train_pred_dt = dt_model.predict(X_train)
y_val_pred_dt = dt_model.predict(X_val)

# Calculate metrics for training set
train_mae_dt = mean_absolute_error(y_train, y_train_pred_dt)
train_mse_dt = mean_squared_error(y_train, y_train_pred_dt)
train_rmse_dt = np.sqrt(train_mse_dt)
train_r2_dt = r2_score(y_train, y_train_pred_dt)

# Calculate metrics for validation set
val_mae_dt = mean_absolute_error(y_val, y_val_pred_dt)
val_mse_dt = mean_squared_error(y_val, y_val_pred_dt)
val_rmse_dt = np.sqrt(val_mse_dt)
val_r2_dt = r2_score(y_val, y_val_pred_dt)

# Display results
print("\n" + "="*80)
print("DECISION TREE (DEFAULT) - PERFORMANCE METRICS")
print("="*80)
print(f"\n{'Metric':<25} {'Training Set':>15} {'Validation Set':>15} {'Difference':>15}")
print("-"*80)
print(f"{'MAE':<25} {train_mae_dt:>15.4f} {val_mae_dt:>15.4f} {abs(train_mae_dt-val_mae_dt):>15.4f}")
print(f"{'MSE':<25} {train_mse_dt:>15.4f} {val_mse_dt:>15.4f} {abs(train_mse_dt-val_mse_dt):>15.4f}")
print(f"{'RMSE':<25} {train_rmse_dt:>15.4f} {val_rmse_dt:>15.4f} {abs(train_rmse_dt-val_rmse_dt):>15.4f}")
print(f"{'R¬≤ Score':<25} {train_r2_dt:>15.4f} {val_r2_dt:>15.4f} {abs(train_r2_dt-val_r2_dt):>15.4f}")
print("="*80)

# Compare with Linear Regression
print("\n" + "="*80)
print("COMPARISON: Decision Tree vs Linear Regression (Validation Set)")
print("="*80)
print(f"\n{'Metric':<25} {'Linear Reg':>15} {'Decision Tree':>15} {'Improvement':>15}")
print("-"*80)
print(f"{'MAE':<25} {val_mae_lr:>15.4f} {val_mae_dt:>15.4f} {val_mae_lr-val_mae_dt:>+15.4f}")
print(f"{'RMSE':<25} {val_rmse_lr:>15.4f} {val_rmse_dt:>15.4f} {val_rmse_lr-val_rmse_dt:>+15.4f}")
print(f"{'R¬≤ Score':<25} {val_r2_lr:>15.4f} {val_r2_dt:>15.4f} {val_r2_dt-val_r2_lr:>+15.4f}")
print("="*80)

print(f"\n‚ö†Ô∏è  Overfitting Warning:")
print(f"   - Training R¬≤ = {train_r2_dt:.4f} vs Validation R¬≤ = {val_r2_dt:.4f}")
print(f"   - Large gap suggests overfitting - the tree memorized training data!")
print(f"   - Hyperparameter tuning should help constrain the tree complexity")

---
## 9. Cross-Validation for Decision Tree

**What you need to do:**  
Perform 5-fold cross-validation on the Decision Tree model.

**üí° Hint:** If CV scores vary significantly from validation scores, the model may be overfitting. This motivates hyperparameter tuning.

In [None]:
# Perform 5-fold cross-validation
print("Performing 5-Fold Cross-Validation for Decision Tree...\n")

cv_scores_mse_dt = cross_val_score(dt_model, X_train, y_train,
                                    cv=5,
                                    scoring='neg_mean_squared_error',
                                    n_jobs=-1)

cv_scores_rmse_dt = np.sqrt(-cv_scores_mse_dt)

# Display results
print("="*80)
print("DECISION TREE (DEFAULT) - CROSS-VALIDATION RESULTS (5-Fold)")
print("="*80)
for i, score in enumerate(cv_scores_rmse_dt, 1):
    print(f"Fold {i}: RMSE = {score:.4f}")

print("-"*80)
print(f"Mean RMSE:   {cv_scores_rmse_dt.mean():.4f}")
print(f"Std RMSE:    {cv_scores_rmse_dt.std():.4f}")
print(f"Min RMSE:    {cv_scores_rmse_dt.min():.4f}")
print(f"Max RMSE:    {cv_scores_rmse_dt.max():.4f}")
print("="*80)

# Compare CV results
print("\n" + "="*80)
print("CROSS-VALIDATION COMPARISON")
print("="*80)
print(f"\n{'Model':<30} {'Mean CV RMSE':>20} {'Std':>15}")
print("-"*80)
print(f"{'Linear Regression':<30} {lr_cv_rmse_mean:>20.4f} {lr_cv_rmse_std:>15.4f}")
print(f"{'Decision Tree (default)':<30} {cv_scores_rmse_dt.mean():>20.4f} {cv_scores_rmse_dt.std():>15.4f}")
print("="*80)

# Store for comparison
dt_cv_rmse_mean = cv_scores_rmse_dt.mean()
dt_cv_rmse_std = cv_scores_rmse_dt.std()

---
## 10. Hyperparameter Tuning: Decision Tree

**Theory:**  
Hyperparameter tuning finds the optimal model configuration that balances bias and variance. For Decision Trees, key hyperparameters include:
- `max_depth`: Maximum tree depth (prevents overfitting)
- `min_samples_split`: Minimum samples required to split a node
- `min_samples_leaf`: Minimum samples required at leaf nodes
- `max_features`: Number of features to consider for each split

**What you need to do:**  
Use GridSearchCV to find the best hyperparameters for the Decision Tree.

**Tasks:**
1. Define a parameter grid with:
   - `max_depth`: [3, 5, 7, 10, None]
   - `min_samples_split`: [2, 5, 10]
   - `min_samples_leaf`: [1, 2, 4]
2. Use GridSearchCV with 5-fold CV
3. Fit on training data
4. Display best parameters and best CV score
5. Evaluate the best model on validation set

**üí° Hint:** Use `scoring='neg_mean_squared_error'` and set `n_jobs=-1` to use all CPU cores.

In [None]:
# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print("Starting Hyperparameter Tuning with GridSearchCV...")
print(f"Testing {len(param_grid['max_depth']) * len(param_grid['min_samples_split']) * len(param_grid['min_samples_leaf'])} combinations")
print("This may take a few moments...\n")

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

print("\n‚úÖ Hyperparameter tuning completed!")

In [None]:
# Display best parameters
print("="*80)
print("HYPERPARAMETER TUNING RESULTS")
print("="*80)
print("\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

best_cv_rmse = np.sqrt(-grid_search.best_score_)
print(f"\nBest Cross-Validation RMSE: {best_cv_rmse:.4f}")

# Get the best model
best_dt_model = grid_search.best_estimator_

print(f"\nBest Tree Information:")
print(f"  Tree Depth: {best_dt_model.get_depth()}")
print(f"  Number of Leaves: {best_dt_model.get_n_leaves()}")

# Evaluate on validation set
y_val_pred_dt_tuned = best_dt_model.predict(X_val)
y_train_pred_dt_tuned = best_dt_model.predict(X_train)

# Calculate metrics
train_mae_dt_tuned = mean_absolute_error(y_train, y_train_pred_dt_tuned)
train_rmse_dt_tuned = np.sqrt(mean_squared_error(y_train, y_train_pred_dt_tuned))
train_r2_dt_tuned = r2_score(y_train, y_train_pred_dt_tuned)

val_mae_dt_tuned = mean_absolute_error(y_val, y_val_pred_dt_tuned)
val_rmse_dt_tuned = np.sqrt(mean_squared_error(y_val, y_val_pred_dt_tuned))
val_r2_dt_tuned = r2_score(y_val, y_val_pred_dt_tuned)

# Display results
print("\n" + "="*80)
print("DECISION TREE (TUNED) - PERFORMANCE METRICS")
print("="*80)
print(f"\n{'Metric':<25} {'Training Set':>15} {'Validation Set':>15} {'Difference':>15}")
print("-"*80)
print(f"{'MAE':<25} {train_mae_dt_tuned:>15.4f} {val_mae_dt_tuned:>15.4f} {abs(train_mae_dt_tuned-val_mae_dt_tuned):>15.4f}")
print(f"{'RMSE':<25} {train_rmse_dt_tuned:>15.4f} {val_rmse_dt_tuned:>15.4f} {abs(train_rmse_dt_tuned-val_rmse_dt_tuned):>15.4f}")
print(f"{'R¬≤ Score':<25} {train_r2_dt_tuned:>15.4f} {val_r2_dt_tuned:>15.4f} {abs(train_r2_dt_tuned-val_r2_dt_tuned):>15.4f}")
print("="*80)

print(f"\nüí° Improvement from Tuning:")
print(f"   - Validation RMSE: {val_rmse_dt:.4f} ‚Üí {val_rmse_dt_tuned:.4f} (Œî = {val_rmse_dt-val_rmse_dt_tuned:+.4f})")
print(f"   - Validation R¬≤: {val_r2_dt:.4f} ‚Üí {val_r2_dt_tuned:.4f} (Œî = {val_r2_dt_tuned-val_r2_dt:+.4f})")
print(f"   - Train-Val gap reduced: {abs(train_r2_dt-val_r2_dt):.4f} ‚Üí {abs(train_r2_dt_tuned-val_r2_dt_tuned):.4f}")

---
## 11. Model Comparison

**What you need to do:**  
Create a summary comparison of all models tested.

**Tasks:**
1. Create a DataFrame or table comparing:
   - Linear Regression
   - Decision Tree (default)
   - Decision Tree (tuned)
2. Include metrics: RMSE, MAE, R¬≤
3. Identify which model performs best on validation data

**üí° Hint:** Store all results in a dictionary and convert to a pandas DataFrame for clean visualization.

In [None]:
# Create comprehensive comparison
results = {
    'Model': ['Linear Regression', 'Decision Tree (Default)', 'Decision Tree (Tuned)'],
    'Training RMSE': [train_rmse_lr, train_rmse_dt, train_rmse_dt_tuned],
    'Validation RMSE': [val_rmse_lr, val_rmse_dt, val_rmse_dt_tuned],
    'Training MAE': [train_mae_lr, train_mae_dt, train_mae_dt_tuned],
    'Validation MAE': [val_mae_lr, val_mae_dt, val_mae_dt_tuned],
    'Training R¬≤': [train_r2_lr, train_r2_dt, train_r2_dt_tuned],
    'Validation R¬≤': [val_r2_lr, val_r2_dt, val_r2_dt_tuned],
    'CV RMSE (Mean)': [lr_cv_rmse_mean, dt_cv_rmse_mean, best_cv_rmse],
    'Overfitting Gap': [
        abs(train_r2_lr - val_r2_lr),
        abs(train_r2_dt - val_r2_dt),
        abs(train_r2_dt_tuned - val_r2_dt_tuned)
    ]
}

comparison_df = pd.DataFrame(results)

print("="*120)
print("MODEL COMPARISON SUMMARY")
print("="*120)
print(comparison_df.to_string(index=False))
print("="*120)

# Identify best model
best_model_idx = comparison_df['Validation RMSE'].idxmin()
best_model_name = comparison_df.loc[best_model_idx, 'Model']
best_model_rmse = comparison_df.loc[best_model_idx, 'Validation RMSE']
best_model_r2 = comparison_df.loc[best_model_idx, 'Validation R¬≤']

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"   - Validation RMSE: {best_model_rmse:.4f}")
print(f"   - Validation R¬≤: {best_model_r2:.4f}")
print(f"   - This model will be used for final test set evaluation")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# RMSE Comparison
x = np.arange(len(comparison_df))
width = 0.35

axes[0].bar(x - width/2, comparison_df['Training RMSE'], width, label='Training', alpha=0.8)
axes[0].bar(x + width/2, comparison_df['Validation RMSE'], width, label='Validation', alpha=0.8)
axes[0].set_xlabel('Model', fontsize=12)
axes[0].set_ylabel('RMSE', fontsize=12)
axes[0].set_title('RMSE Comparison: Training vs Validation', fontsize=14, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(comparison_df['Model'], rotation=15, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# R¬≤ Comparison
axes[1].bar(x - width/2, comparison_df['Training R¬≤'], width, label='Training', alpha=0.8)
axes[1].bar(x + width/2, comparison_df['Validation R¬≤'], width, label='Validation', alpha=0.8)
axes[1].set_xlabel('Model', fontsize=12)
axes[1].set_ylabel('R¬≤ Score', fontsize=12)
axes[1].set_title('R¬≤ Score Comparison: Training vs Validation', fontsize=14, fontweight='bold')
axes[1].set_xticks(x)
axes[1].set_xticklabels(comparison_df['Model'], rotation=15, ha='right')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 12. Final Evaluation on Test Set

**‚ö†Ô∏è CRITICAL: This is your ONE AND ONLY test set evaluation!**

**Theory:**  
The test set provides an unbiased estimate of how your model will perform on completely unseen data in production. This is your final report card. If you used the test set during development, this number would be artificially optimistic.

**What you need to do:**  
Evaluate your best model (from validation performance) on the held-out test set.

**Tasks:**
1. Select your best model based on validation performance
2. Make predictions on the test set
3. Calculate final metrics: RMSE, MAE, R¬≤
4. Compare test set performance to validation performance
5. Create a scatter plot: Actual vs Predicted values
6. Display residuals distribution

**üí° Hint:** If test performance is significantly worse than validation, your model may have overfit to the validation set.

In [None]:
# Select best model (Decision Tree Tuned based on validation performance)
final_model = best_dt_model
final_model_name = best_model_name

print("="*80)
print("üîì UNLOCKING TEST SET FOR FINAL EVALUATION")
print("="*80)
print(f"\nSelected Model: {final_model_name}")
print("\nMaking predictions on test set...\n")

# Make predictions on test set
y_test_pred = final_model.predict(X_test)

# Calculate final metrics
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)

# Display results
print("="*80)
print("FINAL TEST SET RESULTS")
print("="*80)
print(f"\n{'Metric':<25} {'Training':>15} {'Validation':>15} {'Test':>15}")
print("-"*80)
print(f"{'MAE':<25} {train_mae_dt_tuned:>15.4f} {val_mae_dt_tuned:>15.4f} {test_mae:>15.4f}")
print(f"{'RMSE':<25} {train_rmse_dt_tuned:>15.4f} {val_rmse_dt_tuned:>15.4f} {test_rmse:>15.4f}")
print(f"{'R¬≤ Score':<25} {train_r2_dt_tuned:>15.4f} {val_r2_dt_tuned:>15.4f} {test_r2:>15.4f}")
print("="*80)

# Performance comparison
val_test_diff = abs(val_rmse_dt_tuned - test_rmse)
print(f"\nüìä Performance Analysis:")
print(f"   - Test RMSE: ${test_rmse*100:.2f}k")
print(f"   - Test R¬≤: {test_r2:.4f} (model explains {test_r2*100:.2f}% of variance)")
print(f"   - Validation vs Test RMSE difference: {val_test_diff:.4f}")

if val_test_diff < 0.05:
    print(f"   ‚úÖ Excellent! Model generalizes well to unseen data")
elif val_test_diff < 0.10:
    print(f"   ‚úÖ Good! Model performance is consistent")
else:
    print(f"   ‚ö†Ô∏è  Warning: Larger gap may indicate some overfitting to validation set")

print(f"\nüí° Final Interpretation:")
print(f"   - On average, predictions are ${test_mae*100:.2f}k off from actual prices")
print(f"   - Model can be deployed with expected RMSE of ${test_rmse*100:.2f}k")

In [None]:
# Visualize predictions vs actual values
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Scatter plot: Actual vs Predicted
axes[0].scatter(y_test, y_test_pred, alpha=0.5, s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Values', fontsize=12)
axes[0].set_ylabel('Predicted Values', fontsize=12)
axes[0].set_title('Actual vs Predicted Values (Test Set)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Add R¬≤ annotation
axes[0].text(0.05, 0.95, f'R¬≤ = {test_r2:.4f}\nRMSE = {test_rmse:.4f}',
             transform=axes[0].transAxes, fontsize=11,
             verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Residual plot
residuals = y_test - y_test_pred
axes[1].scatter(y_test_pred, residuals, alpha=0.5, s=20)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Values', fontsize=12)
axes[1].set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
axes[1].set_title('Residual Plot (Test Set)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Visualizations created successfully")

In [None]:
# Analyze residuals
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Histogram of residuals
axes[0].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[0].axvline(residuals.mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: {residuals.mean():.4f}')
axes[0].axvline(residuals.median(), color='green', linestyle='--', linewidth=2,
                label=f'Median: {residuals.median():.4f}')
axes[0].set_xlabel('Residuals', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Residuals', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Q-Q plot for normality check
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot (Normality Check)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Residual statistics
print("\n" + "="*80)
print("RESIDUAL ANALYSIS")
print("="*80)
print(f"Mean of Residuals:     {residuals.mean():>10.4f} (should be close to 0)")
print(f"Std of Residuals:      {residuals.std():>10.4f}")
print(f"Min Residual:          {residuals.min():>10.4f}")
print(f"Max Residual:          {residuals.max():>10.4f}")
print(f"Median Abs Residual:   {np.abs(residuals).median():>10.4f}")
print("="*80)

print(f"\nüí° Residual Insights:")
print(f"   - Residuals should be randomly distributed around 0")
print(f"   - If Q-Q plot follows the red line, residuals are normally distributed")
print(f"   - Patterns in residuals suggest the model is missing important features")

---
## 13. Key Takeaways & Next Steps

**What you should have learned:**
1. ‚úÖ Proper data splitting prevents data leakage
2. ‚úÖ EDA helps understand data before modeling
3. ‚úÖ Start with simple baselines (Linear Regression)
4. ‚úÖ Cross-validation provides robust performance estimates
5. ‚úÖ Hyperparameter tuning improves model performance
6. ‚úÖ Test set evaluation gives final, unbiased performance

**Reflection Questions:**
- Which model performed better and why?
- How did hyperparameter tuning affect Decision Tree performance?
- What's the difference between validation and test set performance?
- Which features were most important for prediction?

---

### üöÄ Extension Activities

**This notebook structure is ready for plug-and-play with other models!**

Try replacing the Decision Tree with:
- **Random Forest Regressor** (ensemble of trees)
- **Gradient Boosting Regressor** (sequential boosting)
- **XGBoost Regressor** (optimized gradient boosting)
- **LightGBM Regressor** (fast gradient boosting)
- **Support Vector Regressor** (SVR)

For each new model:
1. Follow the same workflow (sections 8-10)
2. Use appropriate hyperparameters for that model
3. Compare results in section 11
4. Update final evaluation if it becomes the best model

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*

In [None]:
# Final summary
print("\n" + "="*80)
print("üéì MACHINE LEARNING LIFECYCLE - COMPLETION SUMMARY")
print("="*80)
print(f"\n‚úÖ Dataset: California Housing ({len(df):,} samples)")
print(f"‚úÖ Train/Val/Test Split: {len(X_train):,} / {len(X_val):,} / {len(X_test):,}")
print(f"‚úÖ Models Trained: Linear Regression, Decision Tree (Default & Tuned)")
print(f"‚úÖ Best Model: {final_model_name}")
print(f"‚úÖ Final Test RMSE: {test_rmse:.4f}")
print(f"‚úÖ Final Test R¬≤: {test_r2:.4f}")
print("\n" + "="*80)
print("üéâ Congratulations! You've completed the full ML lifecycle!")
print("="*80)