
# Week 4+ Mini‑Project — Business Regression (Consulting Style, Colab)

**Goal:** build and compare two models (Linear Regression & Random Forest) to predict a business KPI, write a short stakeholder summary, and export your results.

You can either **(A)** generate a realistic synthetic dataset right here (recommended), or **(B)** upload your own CSV with similar columns.

---

## 📦 What you'll do
1. Create or upload a dataset (sales-style or support-style)
2. Explore & clean
3. Train/test split
4. Baseline: Linear Regression
5. Model 2: Random Forest
6. Compare MAE & R²
7. Feature importance
8. Export predictions + a 5-bullet stakeholder summary


## 0) Choose your dataset

In [None]:

import pandas as pd
import numpy as np

np.random.seed(42)

# OPTION A — Generate a realistic synthetic SALES dataset
def make_sales_data(n=800):
    # Features
    weeks = np.arange(1, n+1)
    marketing_spend = np.random.gamma(shape=6, scale=150, size=n).round(0)
    price = np.random.normal(loc=49, scale=5, size=n).round(2)
    season = np.sin(2*np.pi*weeks/52)
    promo = np.random.binomial(1, 0.25, size=n)
    web_visits = (np.random.normal(5000, 900, size=n) + 800*promo + 400*season).round(0)

    # Outcome with noise
    base = 1200 + 0.6*marketing_spend - 20*price + 900*season + 500*promo + 0.2*web_visits
    noise = np.random.normal(0, 300, size=n)
    sales = (base + noise).clip(min=0).round(0)

    df = pd.DataFrame({
        "week": weeks,
        "marketing_spend": marketing_spend,
        "price": price,
        "promo": promo,
        "web_visits": web_visits,
        "sales": sales
    })
    return df

# OPTION B — Generate a realistic synthetic SUPPORT dataset
def make_support_data(n=600):
    days = pd.date_range("2024-01-01", periods=n, freq="D")
    vip = np.random.binomial(1, 0.2, size=n)
    agents = np.random.choice(["Amara","Liam","Noor","Chen","Sofia"], size=n, p=[0.25,0.2,0.2,0.2,0.15])
    tickets = np.random.poisson(140 + 25*vip, size=n)
    backlog = np.maximum(0, np.random.normal(80, 30, size=n)).round(0)
    first_response_min = np.random.normal(42 - 6*vip + 0.03*backlog, 5, size=n).round(1)
    same_day_rate = np.clip(np.random.normal(0.78 - 0.002*backlog + 0.05*vip, 0.05, size=n), 0, 1)
    csat = np.clip(np.random.normal(4.4 + 0.3*vip - 0.01*(first_response_min-40), 0.25, size=n), 1, 5).round(2)
    next_day_tickets = np.roll(tickets, -1)
    next_day_tickets[-1] = int(next_day_tickets[:-1].mean())

    df = pd.DataFrame({
        "date": days,
        "vip_ratio": vip,
        "agent": agents,
        "tickets": tickets,
        "backlog": backlog,
        "first_response_min": first_response_min,
        "same_day_rate": same_day_rate,
        "csat": csat,
        "next_day_tickets": next_day_tickets
    })
    return df

# Pick ONE option here:
DATASET = "sales"   # change to "support" if you prefer

if DATASET == "sales":
    df = make_sales_data()
    target_col = "sales"
    feature_cols = ["marketing_spend", "price", "promo", "web_visits", "week"]
elif DATASET == "support":
    df = make_support_data()
    target_col = "next_day_tickets"
    df["agent"] = df["agent"].astype("category")
    df["agent_code"] = df["agent"].cat.codes
    feature_cols = ["vip_ratio", "tickets", "backlog", "first_response_min", "same_day_rate", "csat", "agent_code"]
else:
    raise ValueError("DATASET must be 'sales' or 'support'")

print("Dataset:", DATASET)
df.head()



### (Optional) Upload your own CSV
Your file should have a **target column** (what to predict) and feature columns.


In [None]:

# Uncomment to upload your own
# from google.colab import files
# uploaded = files.upload()
# import io
# df = pd.read_csv(io.BytesIO(next(iter(uploaded.values()))))
# target_col = "YOUR_TARGET_COLUMN"
# feature_cols = [c for c in df.columns if c != target_col]
# df.head()


## 1) Explore & clean

In [None]:

df.info()


In [None]:

# Basic summaries
display(df.describe(include='all'))
print("Missing values per column:")
print(df.isna().sum())


In [None]:

# Simple cleanup
df_clean = df.dropna().copy()
len(df), len(df_clean)


## 2) Train / Test split

In [None]:

from sklearn.model_selection import train_test_split

X = df_clean[feature_cols]
y = df_clean[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


## 3) Baseline — Linear Regression

In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

lin = LinearRegression()
lin.fit(X_train, y_train)
lin_pred = lin.predict(X_test)

lin_mae = mean_absolute_error(y_test, lin_pred)
lin_r2  = r2_score(y_test, lin_pred)

print("Linear Regression → MAE:", round(lin_mae,2), " R²:", round(lin_r2,3))


## 4) Model 2 — Random Forest

In [None]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

rf_mae = mean_absolute_error(y_test, rf_pred)
rf_r2  = r2_score(y_test, rf_pred)

print("Random Forest     → MAE:", round(rf_mae,2), " R²:", round(rf_r2,3))
print("\nComparison (lower MAE, higher R² is better)")
print(f"Linear     → MAE={lin_mae:.2f}, R²={lin_r2:.3f}")
print(f"RandomForest→ MAE={rf_mae:.2f}, R²={rf_r2:.3f}")


## 5) Feature Importance (Random Forest)

In [None]:

import pandas as pd
importance = pd.Series(rf.feature_importances_, index=feature_cols).sort_values(ascending=False)
importance


## 6) Visual Check — Predicted vs Actual

In [None]:

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(y_test, rf_pred, alpha=0.3)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Predicted vs Actual (Random Forest)")
plt.show()


## 7) Export predictions

In [None]:

pred_df = X_test.copy()
pred_df["actual"] = y_test.values
pred_df["predicted_rf"] = rf_pred
pred_df["predicted_lin"] = lin_pred

pred_path = "mini_project_predictions.csv"
pred_df.to_csv(pred_path, index=False)
pred_path


## 8) Stakeholder summary (fill these bullets)


- **Business question:** What are we predicting and why does it matter?  
- **Data used:** Rows/columns, key features included.  
- **Model comparison:** Linear vs Random Forest — which performed better (MAE, R²) and by how much?  
- **Drivers:** Top 3 most important features and the intuitive story behind them.  
- **Action:** One recommendation the business can try next (e.g., adjust pricing, schedule promo weeks, manage backlog).  
