## 📊 Day 7 - Predictive Modeling & Dashboard Integration
In this final phase, we’ll build a predictive model to estimate high-profit orders and integrate the output into a Streamlit dashboard.

### 🧹 Step 1: Load Cleaned Data

In [1]:
import pandas as pd

# Load cleaned data
df = pd.read_csv('../data/superstore_cleaned.csv')
df.head()

Unnamed: 0,row_id,order_id,order_date,order_priority,order_quantity,sales,discount,ship_mode,profit,unit_price,...,customer_name,province,region,customer_segment,product_category,product_sub-category,product_name,product_container,product_base_margin,ship_date
0,1,3,10/13/2010,Low,6,261.54,0.04,Regular Air,-213.25,38.94,...,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,10/20/2010
1,49,293,10/1/2012,High,49,10123.02,0.07,Delivery Truck,457.81,208.16,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Jumbo Drum,0.58,10/2/2012
2,50,293,10/1/2012,High,27,244.57,0.01,Regular Air,46.71,8.69,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,10/3/2012
3,80,483,7/10/2011,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,...,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,7/12/2011
4,85,515,8/28/2010,Not Specified,19,394.27,0.08,Regular Air,30.94,21.78,...,Carlos Soltero,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,Holmes HEPA Air Purifier,Medium Box,0.5,8/30/2010


### 🧪 Step 2: Feature Engineering
Create a binary column for high profit, define features, and encode categoricals.

In [2]:
# Create target column: 1 if profit > 100, else 0
df['high_profit'] = (df['profit'] > 100).astype(int)

# Select features
features = ['sales', 'order_quantity', 'discount', 'shipping_cost', 'product_category', 'product_sub-category', 'region']
df_model = df[features + ['high_profit']]

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df_model, drop_first=True)
df_encoded.head()

Unnamed: 0,sales,order_quantity,discount,shipping_cost,high_profit,product_category_Office Supplies,product_category_Technology,product_sub-category_Binders and Binder Accessories,product_sub-category_Bookcases,product_sub-category_Chairs & Chairmats,...,product_sub-category_Storage & Organization,product_sub-category_Tables,product_sub-category_Telephones and Communication,region_Northwest Territories,region_Nunavut,region_Ontario,region_Prarie,region_Quebec,region_West,region_Yukon
0,261.54,6,0.04,35.0,0,True,False,False,False,False,...,True,False,False,False,True,False,False,False,False,False
1,10123.02,49,0.07,68.02,1,True,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
2,244.57,27,0.01,2.99,0,True,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
3,4965.7595,30,0.08,3.99,1,False,True,False,False,False,...,False,False,True,False,True,False,False,False,False,False
4,394.27,19,0.08,5.94,0,True,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


### 🧠 Step 3: Train-Test Split

In [3]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('high_profit', axis=1)
y = df_encoded['high_profit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 🌲 Step 4: Train Random Forest Classifier

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1211
           1       0.93      0.86      0.89       469

    accuracy                           0.94      1680
   macro avg       0.94      0.92      0.93      1680
weighted avg       0.94      0.94      0.94      1680



### 💾 Step 5: Save Model and Scaler

In [5]:
import pickle
import os

# Save model and features
os.makedirs('model', exist_ok=True)

with open('model/model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model/features.pkl', 'wb') as f:
    pickle.dump(X.columns.tolist(), f)

## ✅ Summary of Day 7
- Trained a machine learning model to predict high-profit transactions
- Prepared cleaned input features for prediction
- Integrated the trained model into the Streamlit dashboard
- Handled prediction inputs and displayed results interactively

🚀 Project Complete! Time to polish the Streamlit app and finalize the presentation.