# Q8: Results

**Phase 9:** Results & Insights  
**Points: 3 points**

**Focus:** Generate final visualizations, create summary tables, document key findings.

**Lecture Reference:** Lecture 11, Notebook 4 ([`11/demo/04_modeling_results.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/04_modeling_results.ipynb)), Phase 9. Also see Lecture 07 (visualization).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load model results from Q7
predictions = pd.read_csv('output/q7_predictions.csv')
metrics = open('output/q7_model_metrics.txt').read()
feature_importance = pd.read_csv('output/q7_feature_importance.csv')

---

## Objective

Generate final visualizations, create summary tables, and document key findings.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q8_final_visualizations.png`
**Format:** PNG image file
**Content:** Final summary visualizations
**Required visualizations (at least 2 of these):**
1. **Model performance comparison:** Bar plot or line plot comparing R², RMSE, or MAE across models
2. **Predictions vs Actual:** Scatter plot showing predicted vs actual values (with perfect prediction line)
3. **Feature importance:** Bar plot showing top N features by importance
4. **Residuals plot:** Scatter plot of residuals (actual - predicted) vs predicted

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

### 2. `output/q8_summary.csv`
**Format:** CSV file
**Content:** Key findings summary table
**Required columns:**
- `Metric` - Metric name (e.g., "R² Score", "RMSE", "MAE")
- One column per model (e.g., `Linear Regression`, `Random Forest`, `XGBoost`)

**Requirements:**
- Must include at least R², RMSE, MAE metrics
- One row per metric
- **No index column** (save with `index=False`)

**Example:**
```csv
Metric,Linear Regression,Random Forest,XGBoost
R² Score,-0.0201,0.9705,0.9967
RMSE,12.7154,2.1634,0.7276
MAE,9.8468,1.3545,0.4480
```

### 3. `output/q8_key_findings.txt`
**Format:** Plain text file
**Content:** Text summary of main insights
**Required information:**
- Best performing model and why
- Key findings from feature importance
- Temporal patterns identified
- Data quality summary

**Example format:**
```
KEY FINDINGS SUMMARY
===================

MODEL PERFORMANCE:
- Best performing model: XGBoost (R² = 0.9967)
- All models show reasonable performance (R² > 0.7 for tree-based models)
- XGBoost achieves lowest RMSE: 0.73°C

FEATURE IMPORTANCE:
- Most important feature: Air Temperature (importance: 0.6539)
- Top 3 features account for 93.6% of total importance
- Temporal features (hour, month) are highly important

TEMPORAL PATTERNS:
- Clear seasonal patterns in temperature data
- Daily and monthly cycles are important predictors

DATA QUALITY:
- Dataset cleaned: 50,000 → 50,000 rows
- Missing values handled via forward-fill and median imputation
- Outliers capped using IQR method
```

---

## Requirements Checklist

- [ ] Final visualizations created (model performance, key insights)
- [ ] Summary tables generated
- [ ] Key findings documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Create visualizations** - Multi-panel figure with model comparison, predictions vs actual, feature importance, and/or residuals
2. **Create summary table** - DataFrame with metrics as rows and models as columns
3. **Document key findings** - Text summary covering model performance, feature importance insights, temporal patterns, and data quality notes

---

## Decision Points

- **Visualizations:** What best communicates your findings? Model performance plots? Time series with predictions? Feature importance plots?
- **Summary:** What are the key takeaways? Document the most important findings from your analysis.

---

## Checkpoint

After Q8, you should have:
- [ ] Final visualizations created (2+ plots)
- [ ] Summary tables generated
- [ ] Key findings documented
- [ ] All 3 artifacts saved: `q8_final_visualizations.png`, `q8_summary.csv`, `q8_key_findings.txt`

---

**Next:** Continue to `q9_writeup.md` for Writeup.


In [2]:
# =========================
# Q8: RESULTS
# =========================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import os

# Ensure output dir exists
os.makedirs("output", exist_ok=True)

In [3]:

# -------------------------
# Load Q7 outputs
# -------------------------
predictions = pd.read_csv("output/q7_predictions.csv")
feature_importance = pd.read_csv("output/q7_feature_importance.csv")

# If you want to see the raw text:
metrics_text = open("output/q7_model_metrics.txt").read()
print(metrics_text)


MODEL PERFORMANCE METRICS

LINEAR REGRESSION:
  Train R²:  0.4219
  Test R²:   0.2229
  Train RMSE: 8.021
  Test RMSE:  8.986
  Train MAE:  6.505
  Test MAE:   7.608

XGBOOST:
  Train R²:  0.7685
  Test R²:   0.3977
  Train RMSE: 5.076
  Test RMSE:  7.912
  Train MAE:  3.766
  Test MAE:   6.083

RANDOM FOREST:
  Train R²:  0.9304
  Test R²:   0.3561
  Train RMSE: 2.784
  Test RMSE:  8.180
  Train MAE:  1.869
  Test MAE:   6.172



In [4]:
# =========================
# 1. Recompute TEST metrics per model
# =========================

y_true = predictions["actual"]

model_cols = {
    "Linear Regression": "predicted_linear",
    "Random Forest": "predicted_random_forest",
    "XGBoost": "predicted_xgboost",
}

metrics_by_model = {}

for model_name, col in model_cols.items():
    y_pred = predictions[col]
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    metrics_by_model[model_name] = {
        "R2": r2,
        "RMSE": rmse,
        "MAE": mae,
    }

print("Test metrics by model:")
for m, vals in metrics_by_model.items():
    print(m, vals)


Test metrics by model:
Linear Regression {'R2': 0.22290079264296436, 'RMSE': np.float64(8.986378002325663), 'MAE': 7.60829985161003}
Random Forest {'R2': 0.3561490328310547, 'RMSE': np.float64(8.179732965621744), 'MAE': 6.171701917112534}
XGBoost {'R2': 0.3976542191649065, 'RMSE': np.float64(7.911692257831281), 'MAE': 6.082769868632844}


In [5]:

# =========================
# 2. Create Q8 summary table (q8_summary.csv)
# =========================

summary_rows = []

for metric in ["R2", "RMSE", "MAE"]:
    row = {"Metric": metric}
    for model_name in model_cols.keys():
        row[model_name] = metrics_by_model[model_name][metric]
    summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)
summary_df.to_csv("output/q8_summary.csv", index=False)
print("✅ output/q8_summary.csv saved")


✅ output/q8_summary.csv saved


In [11]:
# =========================
# 3. Final visualizations (q8_final_visualizations.png)
# =========================

sns.set_style("whitegrid")

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# ---- (A) Bar plot of R2 by model ----
ax0 = axes[0]
model_names = list(model_cols.keys())
r2_values = [metrics_by_model[m]["R2"] for m in model_names]

ax0.bar(model_names, r2_values)
ax0.set_title("Test R² by Model")
ax0.set_ylabel("R²")
ax0.set_ylim(0, max(r2_values) + 0.1)

# ---- (B) Actual vs Predicted (XGBoost) ----
ax1 = axes[1]
y_pred_xgb = predictions["predicted_xgboost"]

ax1.scatter(y_true, y_pred_xgb, alpha=0.3, s=10, label="XGBoost predictions")
min_val = min(y_true.min(), y_pred_xgb.min())
max_val = max(y_true.max(), y_pred_xgb.max())
ax1.plot([min_val, max_val], [min_val, max_val], "r--", label="Perfect prediction")

ax1.set_title("Actual vs Predicted (XGBoost)")
ax1.set_xlabel("Actual Air Temperature")
ax1.set_ylabel("Predicted Air Temperature")
ax1.legend()

# ---- (C) Feature importance (top 10 from XGBoost) ----
ax2 = axes[2]
fi_top = feature_importance.sort_values("importance", ascending=False).head(10)

ax2.barh(fi_top["feature"], fi_top["importance"])
ax2.set_title("Top 10 Feature Importances (XGBoost)")
ax2.set_xlabel("Importance")
ax2.invert_yaxis()  # most important on top

plt.suptitle("Final Model Results and Key Insights", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.95])

plt.savefig("output/q8_final_visualizations.png", dpi=200)
plt.close(fig)
print("✅ output/q8_final_visualizations.png saved")



✅ output/q8_final_visualizations.png saved


In [12]:
# =========================
# 4. Key findings text file (q8_key_findings.txt)
# =========================

best_model = "XGBoost"  # based on highest test R² and lowest error

key_lines = [
    "KEY FINDINGS SUMMARY",
    "===================",
    "",
    "MODEL PERFORMANCE:",
    f"- Best performing model on the test set: {best_model} "
    f"(R² ≈ {metrics_by_model[best_model]['R2']:.3f}, "
    f"RMSE ≈ {metrics_by_model[best_model]['RMSE']:.2f}, "
    f"MAE ≈ {metrics_by_model[best_model]['MAE']:.2f}).",
    "- Linear Regression captured only part of the variability in air temperature and had the weakest test performance.",
    "- Random Forest fit the training data extremely well but showed clear overfitting (very high train R² with noticeably lower test R²).",
    "",
    "FEATURE IMPORTANCE (XGBoost):",
    "- Total Rain, Barometric Pressure, and Solar Radiation were the most important predictors of air temperature.",
    "- Station location indicators (e.g., Foster and Oak Street Weather Stations) also contributed substantially, suggesting local microclimate effects.",
    "- Interaction terms such as wind–humidity interaction and normalized solar radiation further improved performance, highlighting combined effects of wind, moisture, and radiation.",
    "",
    "TEMPORAL PATTERNS (from earlier phases):",
    "- Air temperature showed clear seasonal patterns: warmer values in mid-year months and cooler values in winter months.",
    "- Daily patterns were evident, with higher temperatures during daytime hours and lower temperatures overnight.",
    "- Humidity and precipitation-related variables tended to increase during cooler, wetter periods.",
    "",
    "DATA QUALITY SUMMARY:",
    "- The cleaned dataset contained approximately 196,000 sensor readings after preprocessing.",
    "- Missing values were primarily concentrated in rain-related and heading fields and were handled using time-series appropriate methods (forward-fill/backward-fill and occasional median imputation).",
    "- A small number of extreme outliers were capped using an IQR-based approach, and duplicate records were not a major issue.",
    "- Final modeling matrices contained no missing values and only predictors (no target-derived features), minimizing the risk of data leakage.",
]

with open("output/q8_key_findings.txt", "w") as f:
    f.write("\n".join(key_lines))

print("✅ output/q8_key_findings.txt saved")

✅ output/q8_key_findings.txt saved
