# Q8: Results

**Phase 9:** Results & Insights  
**Points: 3 points**

**Focus:** Generate final visualizations, create summary tables, document key findings.

**Lecture Reference:** See **Lecture 11, Notebook 4** (`11/demo/04_modeling_results.ipynb`), Phase 9 for examples of final visualizations and results communication.

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load model results from Q7
predictions = pd.read_csv('output/q7_predictions.csv')
metrics = open('output/q7_model_metrics.txt').read()
feature_importance = pd.read_csv('output/q7_feature_importance.csv')

---

## Objective

Generate final visualizations, create summary tables, and document key findings.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q8_final_visualizations.png`
**Format:** PNG image file
**Content:** Final summary visualizations
**Required visualizations (at least 2 of these):**
1. **Model performance comparison:** Bar plot or line plot comparing R², RMSE, or MAE across models
2. **Predictions vs Actual:** Scatter plot showing predicted vs actual values (with perfect prediction line)
3. **Feature importance:** Bar plot showing top N features by importance
4. **Residuals plot:** Scatter plot of residuals (actual - predicted) vs predicted

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

### 2. `output/q8_summary.csv`
**Format:** CSV file
**Content:** Key findings summary table
**Required columns:**
- `Metric` - Metric name (e.g., "R² Score", "RMSE", "MAE")
- One column per model (e.g., `Linear Regression`, `Random Forest`, `XGBoost`)

**Requirements:**
- Must include at least R², RMSE, MAE metrics
- One row per metric
- **No index column** (save with `index=False`)

**Example:**
```csv
Metric,Linear Regression,Random Forest,XGBoost
R² Score,-0.0201,0.9705,0.9967
RMSE,12.7154,2.1634,0.7276
MAE,9.8468,1.3545,0.4480
```

### 3. `output/q8_key_findings.txt`
**Format:** Plain text file
**Content:** Text summary of main insights
**Required information:**
- Best performing model and why
- Key findings from feature importance
- Temporal patterns identified
- Data quality summary

**Example format:**
```
KEY FINDINGS SUMMARY
===================

MODEL PERFORMANCE:
- Best performing model: XGBoost (R² = 0.9967)
- All models show reasonable performance (R² > 0.7 for tree-based models)
- XGBoost achieves lowest RMSE: 0.73°C

FEATURE IMPORTANCE:
- Most important feature: Air Temperature (importance: 0.6539)
- Top 3 features account for 93.6% of total importance
- Temporal features (hour, month) are highly important

TEMPORAL PATTERNS:
- Clear seasonal patterns in temperature data
- Daily and monthly cycles are important predictors

DATA QUALITY:
- Dataset cleaned: 50,000 → 50,000 rows
- Missing values handled via forward-fill and median imputation
- Outliers capped using IQR method
```

---

## Requirements Checklist

- [ ] Final visualizations created (model performance, key insights)
- [ ] Summary tables generated
- [ ] Key findings documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Create visualizations:**
   ```python
   fig, axes = plt.subplots(2, 2, figsize=(16, 12))
   
   # Plot 1: Model performance comparison
   # Plot 2: Predictions vs Actual
   # Plot 3: Feature importance
   # Plot 4: Residuals plot
   
   plt.savefig('output/q8_final_visualizations.png', dpi=150, bbox_inches='tight')
   ```

2. **Create summary table:**
   ```python
   summary_data = {
       'Metric': ['R² Score', 'RMSE', 'MAE'],
       'Linear Regression': [lr_r2, lr_rmse, lr_mae],
       'Random Forest': [rf_r2, rf_rmse, rf_mae],
       'XGBoost': [xgb_r2, xgb_rmse, xgb_mae]
   }
   summary_df = pd.DataFrame(summary_data)
   summary_df.to_csv('output/q8_summary.csv', index=False)
   ```

3. **Document key findings:**
   - Write summary to `output/q8_key_findings.txt`

---

## Decision Points

- **Visualizations:** What best communicates your findings? Model performance plots? Time series with predictions? Feature importance plots?
- **Summary:** What are the key takeaways? Document the most important findings from your analysis.

---

## Checkpoint

After Q8, you should have:
- [ ] Final visualizations created (2+ plots)
- [ ] Summary tables generated
- [ ] Key findings documented
- [ ] All 3 artifacts saved: `q8_final_visualizations.png`, `q8_summary.csv`, `q8_key_findings.txt`

---

**Next:** Continue to `q9_writeup.md` for Writeup.


In [10]:
# Q8: Final Visualizations, Summary Table, Key Findings
# ======================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Ensure output directory exists
os.makedirs("output", exist_ok=True)

# -------------------------
# Load Q7 outputs
# -------------------------
predictions = pd.read_csv("output/q7_predictions.csv")
feature_importance = pd.read_csv("output/q7_feature_importance.csv")

# Expect predictions to contain:
# 'actual', 'predicted_linear', 'predicted_random_forest', 'predicted_xgboost'
# Map expected prediction column names -> model friendly names
pred_to_model = {
    "predicted_linear": "Linear Regression",
    "predicted_random_forest": "Random Forest",
    "predicted_xgboost": "XGBoost"
}

# Ensure 'actual' exists
if "actual" not in predictions.columns:
    raise ValueError("predictions CSV must contain an 'actual' column.")

# -------------------------
# Compute metrics per model (safe if column missing)
# -------------------------
metrics = {}
for pred_col, model_name in pred_to_model.items():
    if pred_col in predictions.columns:
        y_true = predictions["actual"].values
        y_pred = predictions[pred_col].values
        r2 = r2_score(y_true, y_pred)
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        mae = mean_absolute_error(y_true, y_pred)
        metrics[model_name] = {"R2": r2, "RMSE": rmse, "MAE": mae}
    else:
        metrics[model_name] = {"R2": np.nan, "RMSE": np.nan, "MAE": np.nan}

# -------------------------
# Build q8_summary.csv exactly with columns Metric,Linear Regression,Random Forest,XGBoost
# -------------------------
summary_df = pd.DataFrame({
    "Metric": ["R² Score", "RMSE", "MAE"],
    "Linear Regression": [metrics["Linear Regression"]["R2"],
                          metrics["Linear Regression"]["RMSE"],
                          metrics["Linear Regression"]["MAE"]],
    "Random Forest": [metrics["Random Forest"]["R2"],
                      metrics["Random Forest"]["RMSE"],
                      metrics["Random Forest"]["MAE"]],
    "XGBoost": [metrics["XGBoost"]["R2"],
                metrics["XGBoost"]["RMSE"],
                metrics["XGBoost"]["MAE"]],
})

# Save csv (no index)
summary_df.to_csv("output/q8_summary.csv", index=False)
print("Saved: output/q8_summary.csv")

# -------------------------
# Create visualizations (4-panel)
# -------------------------
plt.style.use("default")
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Panel 1: Model performance comparison (R², RMSE, MAE) as grouped bars
ax = axes[0, 0]
metrics_plot = summary_df.melt(id_vars="Metric", var_name="Model", value_name="Value")
# Plot grouped bars
sns.barplot(data=metrics_plot, x="Metric", y="Value", hue="Model", ax=ax)
ax.set_title("Model Performance Comparison (R², RMSE, MAE)")
ax.set_xlabel("")
ax.set_ylabel("Metric value")
ax.legend(title="Model", loc="best")

# Panel 2: Predictions vs Actual scatter (each model)
ax = axes[0, 1]
colors = {"Linear Regression": "tab:blue", "Random Forest": "tab:green", "XGBoost": "tab:orange"}
for pred_col, model_name in pred_to_model.items():
    if pred_col in predictions.columns:
        ax.scatter(predictions["actual"], predictions[pred_col], label=model_name, alpha=0.5, s=20)
# Perfect prediction line
min_val = predictions["actual"].min()
max_val = predictions["actual"].max()
ax.plot([min_val, max_val], [min_val, max_val], "k--", linewidth=1.5, label="Perfect")
ax.set_title("Predictions vs Actual")
ax.set_xlabel("Actual")
ax.set_ylabel("Predicted")
ax.legend()

# Panel 3: Feature importance (top N)
ax = axes[1, 0]
if feature_importance.shape[0] > 0:
    top_n = min(15, feature_importance.shape[0])
    top_features = feature_importance.sort_values("importance", ascending=False).head(top_n)
    sns.barplot(x="importance", y="feature", data=top_features, ax=ax)
    ax.set_title(f"Top {top_n} Feature Importances (tree-based model)")
    ax.set_xlabel("Importance")
    ax.set_ylabel("Feature")
else:
    ax.text(0.5, 0.5, "No feature importance available", ha="center", va="center")
    ax.set_title("Feature Importance")
    ax.set_axis_off()

# Panel 4: Residuals plot (choose XGBoost if available, else first available)
ax = axes[1, 1]
chosen_pred_col = None
for col in ["predicted_xgboost", "predicted_random_forest", "predicted_linear"]:
    if col in predictions.columns:
        chosen_pred_col = col
        break

if chosen_pred_col is not None:
    resid = predictions["actual"] - predictions[chosen_pred_col]
    ax.scatter(predictions[chosen_pred_col], resid, alpha=0.5, s=20)
    ax.axhline(0, color="red", linestyle="--")
    model_label = pred_to_model.get(chosen_pred_col, chosen_pred_col)
    ax.set_title(f"Residuals vs Predicted ({model_label})")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Residual (actual - predicted)")
else:
    ax.text(0.5, 0.5, "No model predictions available for residuals", ha="center", va="center")
    ax.set_axis_off()

plt.tight_layout()
plt.suptitle("Final Summary Visualizations (Q8)", fontsize=16, y=1.02)
plt.savefig("output/q8_final_visualizations.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved: output/q8_final_visualizations.png")

# -------------------------
# Compose key findings (text)
# -------------------------
# Determine best model by R2 (ignore NaN)
best_model = None
best_r2 = -np.inf
for model_name, vals in metrics.items():
    r2v = vals["R2"]
    if not np.isnan(r2v) and r2v > best_r2:
        best_r2 = r2v
        best_model = model_name

# Top feature summary
top_features_list = []
if feature_importance.shape[0] > 0:
    fi_sorted = feature_importance.sort_values("importance", ascending=False)
    top_features_list = fi_sorted.head(3).apply(lambda r: f"{r['feature']} ({r['importance']:.3f})", axis=1).tolist()
    top3_sum = fi_sorted.head(3)["importance"].sum()
else:
    top3_sum = np.nan

with open("output/q8_key_findings.txt", "w") as f:
    f.write("KEY FINDINGS SUMMARY\n")
    f.write("===================\n\n")
    # Model performance
    f.write("MODEL PERFORMANCE:\n")
    if best_model is not None:
        f.write(f"- Best performing model: {best_model} (R² = {best_r2:.4f})\n")
    else:
        f.write("- Best performing model: None (no valid model R²)\n")
    # list metrics per model
    for model_name in ["Linear Regression", "Random Forest", "XGBoost"]:
        vals = metrics.get(model_name, {"R2": np.nan, "RMSE": np.nan, "MAE": np.nan})
        f.write(f"- {model_name}: R² = {vals['R2'] if not np.isnan(vals['R2']) else 'NaN'}, "
                f"RMSE = {vals['RMSE'] if not np.isnan(vals['RMSE']) else 'NaN'}, "
                f"MAE = {vals['MAE'] if not np.isnan(vals['MAE']) else 'NaN'}\n")
    f.write("\nFEATURE IMPORTANCE:\n")
    if top_features_list:
        f.write(f"- Top features: {', '.join(top_features_list)}\n")
        f.write(f"- Top 3 features account for {top3_sum*100:.1f}% of total importance\n")
    else:
        f.write("- No feature importance available\n")
    f.write("\nTEMPORAL PATTERNS:\n")
    f.write("- Daily (hour) and monthly cycles were used as temporal features and contribute to model performance (see feature importances).\n")
    f.write("\nDATA QUALITY:\n")
    f.write("- Dataset was cleaned in earlier steps (missing values handled, outliers capped via IQR method) as documented in Q2/Q3.\n")
    f.write(f"- Predictions file rows (test set size): {len(predictions)}\n")

print("Saved: output/q8_key_findings.txt")


Saved: output/q8_summary.csv
Saved: output/q8_final_visualizations.png
Saved: output/q8_key_findings.txt
