
# AI for IoT — TP2: Pipelines, Resource Profiling, and ESP32 Deployment Readiness

This notebook builds on TP1 to:
1) Wrap preprocessing + models into **sklearn Pipelines**  
2) Measure **model size** and **inference time**  
3) Draft a short **deployment analysis** for **ESP32** (≈520 KB SRAM, MBs of Flash)

> Run cells in order. If you use Kaggle download, upload your `kaggle.json` first in Colab (left sidebar → Files → Upload).


In [None]:

# =====================
# 0) Setup & Installs
# =====================
!pip -q install xgboost seaborn

import os, sys, io, time, pickle, warnings, zipfile
warnings.filterwarnings("ignore")
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

from xgboost import XGBClassifier

print("Environment ready ✅")



## 1) Data Acquisition
Preferred: download from Kaggle (`deepcontractor/smoke-detection-dataset`).  
Alternative: manually upload a CSV and set `csv_path` accordingly.


In [None]:

# Try to configure Kaggle if kaggle.json is present
use_kaggle = False
kaggle_json = Path('/content/kaggle.json')  # Colab default suggestion
dataset_dir = Path('/content/data')
dataset_dir.mkdir(parents=True, exist_ok=True)
csv_path = dataset_dir / 'smoke_detection.csv'

if kaggle_json.exists():
    use_kaggle = True
    !mkdir -p ~/.kaggle
    !cp /content/kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    print("Kaggle API configured ✅")
else:
    print("⚠️ kaggle.json not found. You can still upload a CSV manually.")

if use_kaggle:
    # Download and extract
    !kaggle datasets download -d deepcontractor/smoke-detection-dataset -p /content/data -q
    for zf in dataset_dir.glob("*.zip"):
        with zipfile.ZipFile(zf, 'r') as z:
            z.extractall(dataset_dir)
    # Pick a CSV
    candidates = list(dataset_dir.glob("*.csv"))
    if not candidates:
        raise FileNotFoundError("No CSV found after unzip. Please check dataset.")
    pref = [p for p in candidates if 'smoke' in p.name.lower()]
    target = pref[0] if pref else candidates[0]
    target.rename(csv_path)
    print(f"Data ready at: {csv_path}")
else:
    # If you manually uploaded a CSV, set its path here:
    # Example: csv_path = Path('/content/your_uploaded_file.csv')
    if csv_path.exists():
        print(f"Using existing local CSV: {csv_path}")
    else:
        print("Upload a CSV to /content and set csv_path to its path if needed.")



## 2) Load & Quick Clean
- Fill numeric NaNs with median
- Auto-detect target column among common names (edit if needed)
- Drop likely time columns


In [None]:

df = pd.read_csv(csv_path)
print("Shape:", df.shape)
display(df.head())

# Fill numeric NaNs
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Detect target
possible_target_names = [
    "Fire Alarm", "fire_alarm", "FIRE_ALARM", "Target", "target",
    "label", "Label", "fireAlarm", "smoke_detected", "Smoke_Detected"
]
target_col = None
for name in possible_target_names:
    if name in df.columns:
        target_col = name
        break

if target_col is None:
    raise ValueError("Target column not found automatically. Please set `target_col` manually.")

print("✅ Target column:", target_col)

# Drop likely time columns
drop_like = ["UTC", "Time", "Date"]
to_drop = [c for c in df.columns if any(k.lower() in c.lower() for k in drop_like) and c != target_col]
df = df.drop(columns=to_drop, errors="ignore")
print("Dropped (if any):", to_drop)

X = df.drop(columns=[target_col])
y = df[target_col].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train/Test:", X_train.shape, X_test.shape)



## 3) Pipelines
We standardize features and train two models:
- **LR-Pipeline**: `StandardScaler` → `LogisticRegression`
- **XGB-Pipeline**: `StandardScaler` → `XGBClassifier` (kept for consistency)


In [None]:

# Logistic Regression pipeline
lr_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, n_jobs=-1))
])

# XGBoost pipeline
xgb_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        random_state=42,
        n_jobs=-1,
        eval_metric="logloss"
    ))
])

# Fit
lr_pipeline.fit(X_train, y_train)
xgb_pipeline.fit(X_train, y_train)

# Evaluate
def evaluate(model, name):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"=== {name} ===")
    print(f"Accuracy: {acc:.4f} | F1: {f1:.4f}")
    print(classification_report(y_test, y_pred, zero_division=0))
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d')
    plt.title(f"Confusion Matrix - {name}")
    plt.xlabel("Predicted"); plt.ylabel("Actual")
    plt.show()
    return acc, f1

acc_lr, f1_lr = evaluate(lr_pipeline, "LR-Pipeline")
acc_xgb, f1_xgb = evaluate(xgb_pipeline, "XGB-Pipeline")

metrics_df = pd.DataFrame({
    "Model": ["LR-Pipeline", "XGB-Pipeline"],
    "Accuracy": [acc_lr, acc_xgb],
    "F1": [f1_lr, f1_xgb]
})
display(metrics_df)



## 4) Resource Profiling — Model Size (KB)
We serialize each pipeline with `pickle` and measure file size.


In [None]:

import os

with open("lr_model.pkl", "wb") as f:
    pickle.dump(lr_pipeline, f)
with open("xgb_model.pkl", "wb") as f:
    pickle.dump(xgb_pipeline, f)

size_lr_kb = os.path.getsize("lr_model.pkl") / 1024
size_xgb_kb = os.path.getsize("xgb_model.pkl") / 1024

size_df = pd.DataFrame({
    "Model": ["LR-Pipeline", "XGB-Pipeline"],
    "Size (KB)": [size_lr_kb, size_xgb_kb]
})
display(size_df)
print(f"Files saved in working dir: lr_model.pkl ({size_lr_kb:.1f} KB), xgb_model.pkl ({size_xgb_kb:.1f} KB)")



## 5) Resource Profiling — Inference Time
Measure total prediction time on the full test set and average per-sample time.


In [None]:

def measure_inference(model, X, repeats=3):
    # Use repeats to smooth fluctuations
    totals = []
    for _ in range(repeats):
        start = time.time()
        _ = model.predict(X)
        totals.append(time.time() - start)
    total = np.mean(totals)
    per_sample_ms = (total / len(X)) * 1000.0
    return total, per_sample_ms

tot_lr, ps_lr = measure_inference(lr_pipeline, X_test)
tot_xgb, ps_xgb = measure_inference(xgb_pipeline, X_test)

time_df = pd.DataFrame({
    "Model": ["LR-Pipeline", "XGB-Pipeline"],
    "Total Test Inference Time (s)": [tot_lr, tot_xgb],
    "Single Inference Time (ms)": [ps_lr, ps_xgb]
})
display(time_df)



## 6) ESP32 Deployment Readiness — Auto Narrative
This section drafts a short analysis referencing your measured sizes and timings.


In [None]:

ESP32_SRAM_KB = 520  # typical ballpark for total SRAM
REALTIME_PERIOD_S = 1.0  # e.g., need to infer every 1 second

choice = "LR-Pipeline" if (ps_lr <= REALTIME_PERIOD_S*1000 and size_lr_kb <= ESP32_SRAM_KB) else "XGB-Pipeline"
# Simplistic rule; you may refine based on your needs (Flash vs SRAM distinction, etc.)

analysis = f'''
**Memory Feasibility:**  
- LR size ≈ {size_lr_kb:.1f} KB | XGB size ≈ {size_xgb_kb:.1f} KB.  
Typical ESP32 offers ~{ESP32_SRAM_KB} KB SRAM (runtime) and several MB Flash (storage).  
Smaller models are easier to store and load; LR is usually lighter than XGB.

**Time Efficiency:**  
- LR avg single inference ≈ {ps_lr:.3f} ms  
- XGB avg single inference ≈ {ps_xgb:.3f} ms  
If the application needs 1 prediction every {REALTIME_PERIOD_S:.0f} s, both should be << 1000 ms to be safe.

**Conclusion:**  
Based purely on efficiency and embedded constraints, **{choice}** is the safer default for ESP32.  
If XGB is preferred for accuracy, consider pruning, lowering n_estimators/max_depth, or quantization. Also reduce features where possible.
'''
print(analysis)
