# Q7: Modeling

**Phase 8:** Modeling  
**Points: 9 points**

**Focus:** Train multiple models, evaluate performance, compare models, extract feature importance.

**Lecture Reference:** Lecture 11, Notebook 4 ([`11/demo/04_modeling_results.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/04_modeling_results.ipynb)), Phase 8. Also see Lecture 10 (modeling with sklearn and XGBoost).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import os

# Load prepared data from Q6
X_train = pd.read_csv('output/q6_X_train.csv')
X_test = pd.read_csv('output/q6_X_test.csv')
y_train = pd.read_csv('output/q6_y_train.csv').squeeze()  # Convert to Series
y_test = pd.read_csv('output/q6_y_test.csv').squeeze()

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: `{list(X_train.columns[:5])}...` ({len(X_train.columns)} total)")

Training set: (82847, 9)
Test set: (20712, 9)
Features: `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'Barometric Pressure']...` (9 total)


---

## Objective

Train multiple models, evaluate performance, compare models, and extract feature importance.

---

## ‚ö†Ô∏è Data Leakage Warning

If you see suspiciously perfect model performance, this likely indicates data leakage. Common warning signs:

**Warning Metrics:**
- **Perfect R¬≤ = 1.0000** (or very close, like 0.9999+)
- **Zero or near-zero RMSE/MAE** (e.g., RMSE < 0.01¬∞C for temperature prediction)
- **Train and test performance nearly identical** (difference < 0.01)
- **Unrealistic precision**: Errors smaller than measurement precision (e.g., < 0.1¬∞C for temperature sensors)
- **Feature correlation > 0.99** with target (check correlations between features and target)

**Common Causes:**
- **Circular prediction logic**: Using rolling windows of the target variable to predict itself
  - Example: Using `air_temp_rolling_7h` to predict `Air Temperature`
  - This is like predicting temperature from smoothed temperature - circular reasoning!
- **Features nearly identical to target**: Any feature with correlation > 0.99 with the target
- **Including target variable directly**: Accidentally including the target in features

**How to Check:**
- Calculate correlations between each feature and the target
- If any feature has correlation > 0.95, investigate whether it's legitimate or leakage
- For time series: Be especially careful with rolling windows, lag features, or any transformation of the target variable

**Example of Problematic Feature:**
- `air_temp_rolling_7h` (7-hour rolling mean of Air Temperature) when predicting Air Temperature
- This feature has ~99.4% correlation with the target - too high to be useful and indicates circular logic

**Solution:**
- Only create rolling windows for **predictor variables**, not the target
- Use rolling windows of: Wind Speed, Humidity, Barometric Pressure, etc.
- Avoid rolling windows of: Air Temperature (if that's your target)

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q7_predictions.csv`
**Format:** CSV file
**Required Columns (exact names):**
- `actual` - Actual target values from test set
- `predicted_linear` or `predicted_model1` - Predictions from first model (e.g., Linear Regression)
- `predicted_xgboost` or `predicted_model2` - Predictions from second model (e.g., XGBoost)
- Additional columns for additional models (e.g., `predicted_random_forest` or `predicted_model3`)

**Requirements:**
- Must have at least 2 model prediction columns (in addition to `actual`)
- All values must be numeric (float)
- Same number of rows as test set
- **No index column** (save with `index=False`)

**Example:**
```csv
actual,predicted_linear,predicted_xgboost
15.2,14.8,15.1
15.3,15.0,15.2
...
```

### 2. `output/q7_model_metrics.txt`
**Format:** Plain text file
**Content:** Performance metrics for each model
**Required information for each model:**
- Model name
- At least R¬≤ score for both train and test sets (additional metrics like RMSE, MAE recommended but optional)

**Requirements:**
- Clearly labeled (model name, metric name)
- **At minimum:** R¬≤ (or R-squared or R^2) for train and test for each model
- Additional metrics (RMSE, MAE) are recommended for a complete analysis
- Format should be readable

**Example format (minimum - R¬≤ only):**
```
MODEL PERFORMANCE METRICS
========================

LINEAR REGRESSION:
  Train R¬≤: 0.3048
  Test R¬≤:  0.3046

XGBOOST:
  Train R¬≤: 0.9091
  Test R¬≤:  0.7684
```

**Example format (recommended - with additional metrics):**
```
MODEL PERFORMANCE METRICS
========================

LINEAR REGRESSION:
  Train R¬≤: 0.3048
  Test R¬≤:  0.3046
  Train RMSE: 8.42
  Test RMSE:  8.43
  Train MAE:  7.03
  Test MAE:   7.04

XGBOOST:
  Train R¬≤: 0.9091
  Test R¬≤:  0.7684
  Train RMSE: 3.45
  Test RMSE:  4.87
  Train MAE:  2.58
  Test MAE:   3.66
```

### 3. `output/q7_feature_importance.csv`
**Format:** CSV file
**Required Columns (exact names):** `feature`, `importance`
**Content:** Feature importance from tree-based models (XGBoost, Random Forest)
**Requirements:**
- One row per feature
- `feature`: Feature name (string)
- `importance`: Importance score (float, typically 0-1, sum to 1)
- Sorted by importance (descending)
- **No index column** (save with `index=False`)

**Note:** Tree-based models (XGBoost, Random Forest) provide feature importance directly via `.feature_importances_`. If using only Linear Regression, you can use the absolute values of coefficients as a proxy for importance.

**Example:**
```csv
feature,importance
Air Temperature,0.6539
hour,0.1234
month,0.0892
Water Temperature,0.0456
...
```

---

## Requirements Checklist

- [ ] At least 2 different models trained
  - **Suggested:** Linear Regression and XGBoost (or Random Forest)
  - You may choose other models if appropriate
- [ ] Performance evaluated on both train and test sets
- [ ] Models compared
- [ ] Feature importance extracted
  - Tree-based models: use `.feature_importances_`
  - Linear Regression: use absolute coefficient values
- [ ] Model performance documented with **at least R¬≤** (additional metrics like RMSE, MAE recommended)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Check for data leakage** - Before training, compute correlations between features and target. Any feature with correlation > 0.95 should be investigated and considered for removal.
2. **Train at least 2 models** - Fit models to training data, generate predictions for both train and test sets
3. **Calculate metrics** - At minimum R¬≤ for train and test; RMSE and MAE recommended
4. **Extract feature importance** - Use `.feature_importances_` for tree-based models, or coefficient magnitudes for linear models
5. **Save predictions** - DataFrame with `actual` column plus `predicted_*` columns for each model
6. **Save metrics** - Write clearly labeled metrics to text file

---

## Decision Points

- **Model selection:** Train at least 2 different models. We suggest starting with **Linear Regression** and **XGBoost** - these work well and demonstrate different modeling approaches (linear vs gradient boosting). You may choose other models if appropriate (e.g., Random Forest, Gradient Boosting, etc.). See Lecture 11 Notebook 4 for examples.
- **Evaluation metrics:** Report at least one metric for each model. We suggest **R¬≤ score** (coefficient of determination) - it works for both Linear Regression and XGBoost, and all regression models. It measures the proportion of variance explained and is easy to interpret. Alternative metrics that work well for both models include **RMSE** (Root Mean Squared Error) or **MAE** (Mean Absolute Error). You may include additional metrics if relevant (e.g., MAPE, adjusted R¬≤). Compare train vs test performance to check for overfitting.
- **Feature importance:** If using tree-based models (like XGBoost), extract feature importance to understand which features matter most.

---

## Interpreting Model Performance

**Warning Signs of Data Leakage:**
- R¬≤ = 1.0000 (perfect score) or R¬≤ > 0.999
- RMSE or MAE = 0.0 or unrealistically small (< 0.01 for temperature)
- Train and test performance nearly identical (difference < 0.01)
- Any feature with correlation > 0.99 with target

**Realistic Expectations:**
- For temperature prediction: RMSE of 0.5-2.0¬∞C is realistic
- R¬≤ of 0.85-0.98 is strong but realistic
- Some difference between train and test performance is normal

**If you see warning signs:**
1. Check your features for data leakage (see Data Leakage Warning above)
2. Calculate correlations between features and target
3. Remove features that are transformations of the target variable
4. Re-train models and verify performance is now realistic

---

## Checkpoint

After Q7, you should have:
- [ ] At least 2 models trained (suggested: Linear Regression and XGBoost)
- [ ] Performance metrics calculated (at minimum: one metric like R¬≤, RMSE, or MAE for train and test; additional metrics recommended)
- [ ] Models compared
- [ ] Feature importance extracted (if applicable - tree-based models like XGBoost)
- [ ] All 3 artifacts saved: `q7_predictions.csv`, `q7_model_metrics.txt`, `q7_feature_importance.csv`

---

**Next:** Continue to `q8_results.md` for Results.


## Start Modeling

### Check for Date Leakage

In [2]:
# Check 1: Are any features too highly correlated with target?
# High correlation (>0.95) suggests potential leakage
print("Correlation with target (fare_amount):")
correlations = X_train.corrwith(y_train).abs().sort_values(ascending=False)
suspicious = correlations[correlations > 0.95]
if len(suspicious) > 0:
    print(f"‚ö†Ô∏è WARNING: {len(suspicious)} features have correlation > 0.95 with target:")
    print(suspicious)
else:
    print("‚úÖ No suspiciously high correlations")

# Check 2: Do any feature names suggest they depend on the target?
target_related = [col for col in X_train.columns if 'air' in col.lower() or 'temperature' in col.lower()]
if target_related:
    print(f"\n‚ö†Ô∏è WARNING: Features with 'Air' or 'Temperature' in name: {target_related}")
    print("Verify these don't leak target information!")
else:
    print("\n‚úÖ No obviously problematic feature names")

Correlation with target (fare_amount):
Wet Bulb Temperature    0.981478
dtype: float64

Verify these don't leak target information!


In [3]:
# Adopted from Lecture 11 demo
# Model evaluation helper functions
def evaluate_model(y_true, y_pred, dataset_name="Dataset"):
    """
    Calculate standard regression metrics.

    Demonstrates DRY principle: evaluation logic in one place.

    Parameters:
    -----------
    y_true : array-like
        True values
    y_pred : array-like
        Predicted values
    dataset_name : str
        Name for display purposes

    Returns:
    --------
    dict : Dictionary containing RMSE, MAE, and R¬≤ scores
    """
    return {
        'dataset': dataset_name,
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'mae': mean_absolute_error(y_true, y_pred),
        'r2': r2_score(y_true, y_pred)
    }

def assess_overfitting(train_r2, test_r2):
    """
    Assess model overfitting by comparing train and test R¬≤ scores.

    Overfitting gap = Train R¬≤ - Test R¬≤
    - < 5%: Excellent generalization
    - 5-10%: Good generalization
    - 10-20%: Some overfitting - consider regularization
    - > 20%: Severe overfitting - model needs adjustment

    Parameters:
    -----------
    train_r2 : float
        R¬≤ score on training set
    test_r2 : float
        R¬≤ score on test set

    Returns:
    --------
    tuple : (gap, status_message)
    """
    gap = train_r2 - test_r2

    if gap < 0.05:
        return gap, "‚úÖ Excellent generalization"
    elif gap < 0.10:
        return gap, "‚úÖ Good generalization"
    elif gap < 0.20:
        return gap, "‚ö†Ô∏è Some overfitting - consider regularization"
    else:
        return gap, "‚ùå Severe overfitting - model needs adjustment"

# Model hyperparameters
RANDOM_SEED = 42  # For reproducible results

# Random Forest hyperparameters
RF_N_ESTIMATORS = 100  # Number of trees (more = better but slower)
RF_MAX_DEPTH = 10      # Max tree depth (lower = less overfitting)

# XGBoost hyperparameters
XGB_N_ESTIMATORS = 100    # Number of boosting rounds
XGB_MAX_DEPTH = 6         # Max tree depth (XGBoost default, shallower than RF)
XGB_LEARNING_RATE = 0.1   # Step size shrinkage (lower = more conservative)

### Baseline Model: Linear Regression

In [4]:
# Adopted from Lecture 11 demo
# Train linear regression model
print("üìä Model 1: Linear Regression")

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_train_pred_lr = lr_model.predict(X_train)
y_test_pred_lr = lr_model.predict(X_test)

# Evaluate using helper function
train_metrics_lr = evaluate_model(y_train, y_train_pred_lr, "Training")
test_metrics_lr = evaluate_model(y_test, y_test_pred_lr, "Test")

# Check for overfitting using helper function
overfit_lr, overfit_status = assess_overfitting(train_metrics_lr['r2'], test_metrics_lr['r2'])

print("Performance Results")
display(pd.DataFrame({
    'Metric': ['RMSE', 'MAE', 'R¬≤'],
    'Training': [
        f"{train_metrics_lr['rmse']:.2f}¬∞C",
        f"{train_metrics_lr['mae']:.2f}¬∞C",
        f"{train_metrics_lr['r2']:.4f}"
    ],
    'Test': [
        f"{test_metrics_lr['rmse']:.2f}¬∞C",
        f"{test_metrics_lr['mae']:.2f}¬∞C",
        f"{test_metrics_lr['r2']:.4f}"
    ]
}))
print(f"Overfitting (R¬≤ difference): {overfit_lr:.4f} ‚Äî {overfit_status}")

# Store for comparison later
train_mae_lr, test_mae_lr = train_metrics_lr['mae'], test_metrics_lr['mae']
train_rmse_lr, test_rmse_lr = train_metrics_lr['rmse'], test_metrics_lr['rmse']
train_r2_lr, test_r2_lr = train_metrics_lr['r2'], test_metrics_lr['r2']

üìä Model 1: Linear Regression
Performance Results


Unnamed: 0,Metric,Training,Test
0,RMSE,1.81¬∞C,1.73¬∞C
1,MAE,1.36¬∞C,1.25¬∞C
2,R¬≤,0.9703,0.9697


Overfitting (R¬≤ difference): 0.0006 ‚Äî ‚úÖ Excellent generalization


### Comparison Model: XGBoost Model

In [5]:
# Adopted from Lecture 11 demo
# Train XGBoost model
print("üöÄ Model 3: XGBoost")

xgb_model = xgb.XGBRegressor(
    n_estimators=XGB_N_ESTIMATORS,
    max_depth=XGB_MAX_DEPTH,
    learning_rate=XGB_LEARNING_RATE,
    random_state=RANDOM_SEED,
    n_jobs=-1
)
xgb_model.fit(X_train, y_train)

# Make predictions
y_train_pred_xgb = xgb_model.predict(X_train)
y_test_pred_xgb = xgb_model.predict(X_test)

# Evaluate using helper function
train_metrics_xgb = evaluate_model(y_train, y_train_pred_xgb, "Training")
test_metrics_xgb = evaluate_model(y_test, y_test_pred_xgb, "Test")

# Check for overfitting using helper function
overfit_xgb, overfit_status = assess_overfitting(train_metrics_xgb['r2'], test_metrics_xgb['r2'])

print("Performance Results")
display(pd.DataFrame({
    'Metric': ['RMSE', 'MAE', 'R¬≤'],
    'Training': [
        f"{train_metrics_xgb['rmse']:.2f}¬∞C",
        f"{train_metrics_xgb['mae']:.2f}¬∞C",
        f"{train_metrics_xgb['r2']:.4f}"
    ],
    'Test': [
        f"{test_metrics_xgb['rmse']:.2f}¬∞C",
        f"{test_metrics_xgb['mae']:.2f}¬∞C",
        f"{test_metrics_xgb['r2']:.4f}"
    ]
}))
print(f"Overfitting (R¬≤ difference): {overfit_xgb:.4f} ‚Äî {overfit_status}")

# Store for comparison later
train_mae_xgb, test_mae_xgb = train_metrics_xgb['mae'], test_metrics_xgb['mae']
train_rmse_xgb, test_rmse_xgb = train_metrics_xgb['rmse'], test_metrics_xgb['rmse']
train_r2_xgb, test_r2_xgb = train_metrics_xgb['r2'], test_metrics_xgb['r2']

üöÄ Model 3: XGBoost
Performance Results


Unnamed: 0,Metric,Training,Test
0,RMSE,1.28¬∞C,1.55¬∞C
1,MAE,0.92¬∞C,1.10¬∞C
2,R¬≤,0.9850,0.9758


Overfitting (R¬≤ difference): 0.0092 ‚Äî ‚úÖ Excellent generalization


In [6]:
# Adopted from Lecture 11 demo
# Extract feature importance from the trained model
xgb_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("üîë Top 10 Most Important Features")
display(xgb_importance.head(10))

üîë Top 10 Most Important Features


Unnamed: 0,feature,importance
3,Wet Bulb Temperature,0.961444
7,wind_v,0.008494
6,Solar Radiation,0.008435
0,hour,0.007726
8,wind_u,0.0043
2,month,0.003175
5,Total Rain,0.002878
4,Barometric Pressure,0.002697
1,day_of_week,0.00085


### Save Modeling Result

In [None]:
# Save prediction results
predictions = pd.DataFrame({
    'actual': y_test,
    'predicted_linear': y_test_pred_lr,
    'predicted_xgboost': y_test_pred_xgb
})

# Check
display(predictions.head())
# Save to csv without index
predictions.to_csv('output/q7_predictions.csv', index=False)

# 
metric_results = f"""MODEL PERFORMANCE METRICS
========================

LINEAR REGRESSION:
  Train R¬≤: {train_r2_lr:.4f}
  Test R¬≤: {test_r2_lr:.4f}
  Train RMSE: {train_rmse_lr:.2f}
  Test RMSE:  {test_rmse_lr:.2f}
  Train MAE:  {train_mae_lr:.2f}
  Test MAE:   {test_mae_lr:.2f}

XGBOOST:
  Train R¬≤: {train_r2_xgb:.4f}
  Test R¬≤:  {test_r2_xgb:.4f}
  Train RMSE: {train_rmse_xgb:.2f}
  Test RMSE:  {test_rmse_xgb:.2f}
  Train MAE:  {train_mae_xgb:.2f}
  Test MAE:   {test_mae_xgb:.2f}
"""
with open('output/q7_model_metrics.txt', 'w') as f:
    f.write(metric_results)

# Save Feature Importance
xgb_importance.to_csv('output/q7_feature_importance.csv', index=False)

Unnamed: 0,actual,predicted_linear,predicted_xgboost
0,20.7,22.067598,20.780703
1,21.3,21.742437,21.65295
2,21.0,22.233919,21.434092
3,21.3,22.979513,21.942102
4,21.5,22.126428,21.918514
