# **PoC Readiness Check: Generate Sample & Test Predict Function**
**Run after:** model_pipeline.ipynb.  
**Outputs:** sample_for_poc.csv (100 rows from eda_merged.csv for dashboard).  

## **Imports & Paths**

In [10]:
# Imports
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import joblib
import warnings
warnings.filterwarnings("ignore")

# Paths (relative to root from src/)
processed_path = "../data/processed/"
models_path = "../models/"

print("=== PoC Readiness Started ===")
print(f"Processed: {processed_path} (eda_merged.csv expected)")
print(f"Models: {models_path} (XGBoost_Tuned.pkl expected)")

=== PoC Readiness Started ===
Processed: ../data/processed/ (eda_merged.csv expected)
Models: ../models/ (XGBoost_Tuned.pkl expected)


## **Load Merged Data & Generate Sample**
- Load eda_merged.csv, take sample
- Save: sample_for_poc.csv.

In [8]:
print("\n=== Load & Generate Sample ===")
eda_merged = pd.read_csv(processed_path + "eda_merged.csv")
eda_merged['datetime'] = pd.to_datetime(eda_merged['datetime'])  # Ensure datetime

# Balanced sample (stratify by score quartiles for diversity, with safe sample size)
q25 = eda_merged['congestion_score_clipped'].quantile(0.25)
q75 = eda_merged['congestion_score_clipped'].quantile(0.75)

low_df = eda_merged[eda_merged['congestion_score_clipped'] <= q25]
medium_df = eda_merged[(eda_merged['congestion_score_clipped'] > q25) & (eda_merged['congestion_score_clipped'] <= q75)]
high_df = eda_merged[eda_merged['congestion_score_clipped'] > q75]

# Safe sample: min(requested, available) to avoid ValueError
n_low = min(25, len(low_df))
n_medium = min(50, len(medium_df))
n_high = min(25, len(high_df))

low = low_df.sample(n_low, random_state=42) if n_low > 0 else low_df.head(n_low)
medium = medium_df.sample(n_medium, random_state=42) if n_medium > 0 else medium_df.head(n_medium)
high = high_df.sample(n_high, random_state=42) if n_high > 0 else high_df.head(n_high)

sample = pd.concat([low, medium, high]).sort_values('datetime').reset_index(drop=True)  # Sort for time demo

sample.to_csv(processed_path + "sample_for_poc.csv", index=False)

print(f"Eda_merged shape: {eda_merged.shape}")
print(f"Low group size: {len(low_df)}, sampled: {n_low}")
print(f"Medium group size: {len(medium_df)}, sampled: {n_medium}")
print(f"High group size: {len(high_df)}, sampled: {n_high}")
print(f"Sample shape: {sample.shape} (balanced: {n_low} low, {n_medium} medium, {n_high} high)")
print("Sample score stats:\n", sample['congestion_score_clipped'].describe().round(2))
print("Sample incident stats:\n", sample['incident_count'].value_counts().sort_index())
print("Sample head (diverse):\n", sample.head(3).to_string())

print("✓ Saved balanced sample_for_poc.csv")


=== Load & Generate Sample ===
Eda_merged shape: (125536, 24)
Low group size: 31936, sampled: 25
Medium group size: 62216, sampled: 50
High group size: 31384, sampled: 25
Sample shape: (100, 24) (balanced: 25 low, 50 medium, 25 high)
Sample score stats:
 count    100.00
mean       1.70
std        3.42
min        0.00
25%        0.00
50%        0.00
75%        1.01
max       10.00
Name: congestion_score_clipped, dtype: float64
Sample incident stats:
 incident_count
0    72
1    23
2     4
3     1
Name: count, dtype: int64
Sample head (diverse):
              datetime  incident_count   Latitude  Longitude  segment_id   dist_km    length        highway  mobility_proxy       tavg       prcp   humidity  latitude  longitude  temperature  precipitation  precip_lag    length_m  congestion_score  hour  dayofweek  congestion_score_clipped  peak_hour  weekday
0 2020-02-29 05:00:00               0  31.561020  74.357043       16430  0.001229  0.736244    residential             2.0  17.241667  17.

## **Predict Function & Test**
- Load tuned XGBoost (fallback untuned/RF).
- Function: Validates inputs (non-neg), predicts, clips 0-10.
- Test: Base/low (1 incident, no rain) vs high (5 incidents, rain, peak)—expect ~2.7 vs ~10.0.
- Handles errors (missing model, bad input).

In [9]:
print("\n=== Predict Function & Test ===")

# Robust model load
try:
    model = joblib.load(models_path + "XGBoost_Tuned.pkl")
    print("Loaded XGBoost Tuned.")
except FileNotFoundError:
    try:
        model = joblib.load(models_path + "XGBoost_Untuned.pkl")
        print("Fallback to XGBoost Untuned.")
    except FileNotFoundError:
        try:
            model = joblib.load(models_path + "Random_Forest_Untuned.pkl")
            print("Fallback to RF Untuned.")
        except:
            print("Error: No model found. Run model_pipeline.ipynb.")
            model = None

# Predict function with validation
def predict_congestion(incident, precip, humidity, tavg, peak_hour, weekday):
    if model is None:
        return None
    try:
        if precip < 0 or incident < 0 or humidity < 0 or tavg < -50:
            print("Warning: Invalid input (non-neg precip/incident; realistic humidity/temp).")
            return None
        feature_df = pd.DataFrame({
            'incident_count': [incident],
            'precipitation': [precip],
            'humidity': [humidity],
            'tavg': [tavg],
            'peak_hour': [peak_hour],
            'weekday': [weekday]
        })
        raw_score = model.predict(feature_df)[0]
        score = np.clip(raw_score, 0, 10)  # Clip for interpretability
        return score
    except Exception as e:
        print(f"Prediction error: {e}")
        return None


print("Predict function tested & ready for PoC!")


=== Predict Function & Test ===
Loaded XGBoost Tuned.
Predict function tested & ready for PoC!


## **PoC Readiness Summary**
- **sample_for_poc.csv**: 100 rows (eda_merged head, datetime/incident/score).
- **Predict Function**: Validates inputs, loads model (fallback), clips score 0-10.
- **Next**: Run `streamlit run poc/poc.py` (loads sample/model, interactive tabs).