# 1. Modeling Pipeline

In this section, we build and evaluate machine learning models to predict product prices.

## Modeling Pipeline Flow

LOAD CLEANED DATA  
        ↓  
FEATURE ENGINEERING  
        ├── Category-level features  
        ├── Seller-level features  
        ├── Pricing ratios  
        └── Time-based features  
        ↓  
BUILD MODELING DATASET  
        ├── Select features (X)  
        ├── Define target (y)  
        └── Drop missing values  
        ↓  
TRAIN / TEST SPLIT  
        ↓  
BASELINE MODELS  
        ├── Linear Regression  
        └── Ridge Regression  
        ↓  
TREE MODELS  
        ├── Random Forest  
        └── XGBoost  
        ↓  
EVALUATE MODEL PERFORMANCE  
        ├── RMSE  
        ├── MAE  
        └── Compare all models  
        ↓  
SELECT BEST MODEL  
        ↓  
SAVE MODEL (`final_model.pkl`)

In [161]:
# imports

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

sns.set(style="whitegrid")



### 1.1 Load Cleaned Listings Dataset

#### We load the cleaned, unified 'listings' dataset saved from Notebook 1 

In [162]:
listings = pd.read_csv("../data/processed/cleaned_listings.csv")
print("Loaded cleaned listings shape:", listings.shape)
listings.head()

Loaded cleaned listings shape: (112086, 29)


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,product_category_name,product_name_lenght,product_description_lenght,...,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,seller_zip_code_prefix,seller_city,seller_state,product_volume_cm3,shipping_time_days,delivery_delay_days
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29,cool_stuff,58.0,598.0,...,2017-09-13 09:45:35,2017-09-19 18:34:16,2017-09-20 23:43:48,2017-09-29,27277,volta redonda,SP,3528.0,7.0,-9.0
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93,pet_shop,56.0,239.0,...,2017-04-26 11:05:13,2017-05-04 14:35:00,2017-05-12 16:04:24,2017-05-15,3471,sao paulo,SP,60000.0,16.0,-3.0
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87,moveis_decoracao,59.0,695.0,...,2018-01-14 14:48:30,2018-01-16 12:36:48,2018-01-22 13:19:16,2018-02-05,37564,borda da mata,MG,14157.0,7.0,-14.0
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79,perfumaria,42.0,480.0,...,2018-08-08 10:10:18,2018-08-10 13:28:00,2018-08-14 13:32:39,2018-08-20,14403,franca,SP,2400.0,6.0,-6.0
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14,ferramentas_jardim,59.0,409.0,...,2017-02-04 14:10:13,2017-02-16 09:46:09,2017-03-01 16:42:31,2017-03-17,87900,loanda,PR,42000.0,25.0,-16.0


## 1.2 Feature Engineering for Modeling

We now engineer additional features that are useful for price prediction:

- **Category-level stats**: median, mean, std, count of prices per category  
- **Seller-level stats**: median price, average freight, item count per seller  
- **Pricing ratios**: price relative to category median, price per volume, freight/price ratio  
- **Time features**: purchase month and day of week  

These features help the model capture marketplace structure and pricing behavior.

In [163]:
# Ensure category_english exists (from Notebook 1)
assert "category_english" in listings.columns, "category_english column missing."

# 1.2.1 Category-level aggregations
category_stats = listings.groupby("category_english").agg(
    category_median_price=("price", "median"),
    category_mean_price=("price", "mean"),
    category_price_std=("price", "std"),
    category_count=("price", "count"),
).reset_index()

print("Category stats shape:", category_stats.shape)
category_stats.head()

Category stats shape: (71, 5)


Unnamed: 0,category_english,category_median_price,category_mean_price,category_price_std,category_count
0,agro_industry_and_commerce,228.0,282.4175,272.938718,204
1,air_conditioning,139.99,180.493108,171.219659,296
2,art,97.5,85.113654,42.317594,208
3,arts_and_craftmanship,44.9,75.58375,73.997815,24
4,audio,89.0,139.254121,159.685656,364


In [164]:
# Merge category stats into main table
listings_model = listings.merge(category_stats, on="category_english", how="left")

print("After merging category stats:", listings_model.shape)
listings_model[[
    "category_english", "price", "category_median_price", "category_mean_price"
]].head()

After merging category stats: (112086, 33)


Unnamed: 0,category_english,price,category_median_price,category_mean_price
0,cool_stuff,58.9,129.99,159.634187
1,pet_shop,239.9,89.7,105.085564
2,furniture_decor,199.0,65.49,86.492851
3,perfumery,12.99,84.99,116.737312
4,garden_tools,199.9,59.9,98.360433


### 1.2.2 Seller-Level Features

We aggregate per-seller:

- Median selling price  
- Average freight value  
- Total number of items sold  

These features help capture seller pricing behavior and scale.

In [165]:
seller_stats = listings_model.groupby("seller_id").agg(
    seller_median_price=("price", "median"),
    seller_avg_freight=("freight_value", "mean"),
    seller_total_items=("order_id", "count"),
).reset_index()

print("Seller stats shape:", seller_stats.shape)
seller_stats.head()

Seller stats shape: (3062, 4)


Unnamed: 0,seller_id,seller_median_price,seller_avg_freight,seller_total_items
0,0015a82c2db000af6aaaf3ae2ecb0532,895.0,21.02,3
1,001cca7ae9ae17fb1caed9dfb1094831,99.0,37.046611,239
2,001e6ad469a905060d959994f1b41e4f,250.0,17.94,1
3,002100f778ceb8431b7a1020ff7ab48f,17.9,14.430182,55
4,003554e2dce176b5555353e4f3555ac8,120.0,19.38,1


In [166]:
listings_model = listings_model.merge(seller_stats, on="seller_id", how="left")

print("After merging seller stats:", listings_model.shape)
listings_model[[
    "seller_id", "price", "seller_median_price", "seller_avg_freight", "seller_total_items"
]].head()

After merging seller stats: (112086, 36)


Unnamed: 0,seller_id,price,seller_median_price,seller_avg_freight,seller_total_items
0,48436dade18ac8b2bce089ec2a041202,58.9,55.9,19.284305,151
1,dd7ddc04e1b6c2c614352b383efe2d36,239.9,45.9,20.234196,143
2,5b51032eddd242adc84c38acab88f23d,199.0,209.0,19.210714,14
3,9d7a1d34a5052409006425275ba1c2b4,12.99,49.99,17.315625,16
4,df560393f3a51e74553ab94004ba5c87,199.9,87.9,20.901724,29


### 1.2.3 Pricing Ratios & Time Features

We add:

- `purchase_month` and `purchase_dayofweek` from the purchase timestamp  

In [167]:
# Ensure product_volume_cm3 exists from Notebook 1
assert "product_volume_cm3" in listings_model.columns, "product_volume_cm3 missing. Make sure it was created in Notebook 1."

# Ratios
listings_model["price_to_category_median"] = (
    listings_model["price"] / listings_model["category_median_price"]
)

listings_model["price_to_volume"] = (
    listings_model["price"] / listings_model["product_volume_cm3"]
)

listings_model["freight_ratio"] = (
    listings_model["freight_value"] / listings_model["price"]
)

# Time features
purchase_dt = pd.to_datetime(listings_model["order_purchase_timestamp"])
listings_model["purchase_month"] = purchase_dt.dt.month
listings_model["purchase_dayofweek"] = purchase_dt.dt.dayofweek

listings_model[[
    "price", "category_median_price", "price_to_category_median",
    "product_volume_cm3", "price_to_volume",
    "freight_value", "freight_ratio",
    "purchase_month", "purchase_dayofweek"
]].head()

Unnamed: 0,price,category_median_price,price_to_category_median,product_volume_cm3,price_to_volume,freight_value,freight_ratio,purchase_month,purchase_dayofweek
0,58.9,129.99,0.453112,3528.0,0.016695,13.29,0.225637,9,2
1,239.9,89.7,2.67447,60000.0,0.003998,19.93,0.083076,4,2
2,199.0,65.49,3.038632,14157.0,0.014057,17.87,0.089799,1,6
3,12.99,84.99,0.152842,2400.0,0.005412,12.79,0.984604,8,2
4,199.9,59.9,3.337229,42000.0,0.00476,18.14,0.090745,2,5


## 1.3 Prepare Features (X) and Target (y)

We now select the final set of features for modeling and define:

- **y** = price  
- **X** = engineered numerical features  

We also drop any remaining rows with missing values in these columns.


In [168]:
# Define modeling columns
feature_cols = [
    "product_weight_g",
    "product_volume_cm3",
    "freight_value",
    "category_median_price",
    "category_mean_price",
    "category_price_std",
    "category_count",
    "seller_median_price",
    "seller_avg_freight",
    "seller_total_items",
    "purchase_month",
    "purchase_dayofweek",
]

target_col = "price"

# Subset and drop rows with missing in these columns
model_df = listings_model[feature_cols + [target_col]].dropna()

print("Model dataframe shape:", model_df.shape)
model_df.head()

Model dataframe shape: (110469, 13)


Unnamed: 0,product_weight_g,product_volume_cm3,freight_value,category_median_price,category_mean_price,category_price_std,category_count,seller_median_price,seller_avg_freight,seller_total_items,purchase_month,purchase_dayofweek,price
0,650.0,3528.0,13.29,129.99,159.634187,149.935482,3783.0,55.9,19.284305,151,9,2,58.9
1,30000.0,60000.0,19.93,89.7,105.085564,107.424306,1941.0,45.9,20.234196,143,4,2,239.9
2,3050.0,14157.0,17.87,65.49,86.492851,79.173141,8328.0,209.0,19.210714,14,1,6,199.0
3,200.0,2400.0,12.79,84.99,116.737312,101.874864,3419.0,49.99,17.315625,16,8,2,12.99
4,3750.0,42000.0,18.14,59.9,98.360433,120.071827,4314.0,87.9,20.901724,29,2,5,199.9


In [169]:
from sklearn.model_selection import train_test_split

X = model_df[feature_cols]
y = model_df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

X_train.shape, X_test.shape

((88375, 12), (22094, 12))

## 1.4 Baseline Models

Baseline models give us a simple reference point before training more advanced tree-based models.

We train:

**1. Linear Regression**  
- Fast, simple, interpretable  
- Helps detect if relationships are roughly linear  

**2. Ridge Regression**  
- Adds L2 regularization  
- Typically performs better than plain linear for noisy datasets  

In [170]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

pred_lr = lr.predict(X_test)

rmse_lr = np.sqrt(mean_squared_error(y_test, pred_lr))
mae_lr = mean_absolute_error(y_test, pred_lr)

print("Linear Regression RMSE:", rmse_lr)
print("Linear Regression MAE :", mae_lr)

Linear Regression RMSE: 92.24226553482937
Linear Regression MAE : 45.97477786049536


### Ridge Regression

Ridge helps reduce overfitting, especially with correlated features (common in pricing datasets).

In [171]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

pred_ridge = ridge.predict(X_test)

rmse_ridge = np.sqrt(mean_squared_error(y_test, pred_ridge))
mae_ridge = mean_absolute_error(y_test, pred_ridge)

print("Ridge Regression RMSE:", rmse_ridge)
print("Ridge Regression MAE :", mae_ridge)

Ridge Regression RMSE: 92.24226543442363
Ridge Regression MAE : 45.9747774111609


## 1.5 Tree-Based Models

Linear and Ridge models give us a useful baseline, but real-world pricing data is:

- Non-linear  
- Noisy  
- Full of interactions (category × seller × attributes)

Tree-based models usually handle this much better.

Here we train:

- **Random Forest Regressor** – strong, robust ensemble of trees  
- **XGBoost Regressor** – gradient boosting, often state-of-the-art for tabular pricing data  

In [172]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, pred_rf))
mae_rf = mean_absolute_error(y_test, pred_rf)

print("Random Forest RMSE:", rmse_rf)
print("Random Forest MAE :", mae_rf)

Random Forest RMSE: 54.98748582145604
Random Forest MAE : 19.94891093071462


### 1.5.1 XGBoost Regressor

XGBoost is a gradient boosting method that often performs extremely well on structured/tabular data like marketplace prices.

In [173]:
from xgboost import XGBRegressor

xgb = XGBRegressor(
    n_estimators=600,
    learning_rate=0.05,
    max_depth=8,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    objective="reg:squarederror"
)

xgb.fit(X_train, y_train)
pred_xgb = xgb.predict(X_test)

rmse_xgb = np.sqrt(mean_squared_error(y_test, pred_xgb))
mae_xgb = mean_absolute_error(y_test, pred_xgb)

print("XGBoost RMSE:", rmse_xgb)
print("XGBoost MAE :", mae_xgb)

XGBoost RMSE: 54.731974486591284
XGBoost MAE : 25.121655720942233


## 1.6 Model Performance Comparison

We now compare all trained models side-by-side:

- Linear Regression  
- Ridge Regression  
- Random Forest  
- XGBoost  

This helps us choose the best candidate to use in the pricing recommendation and simulation notebook.

In [174]:
import pandas as pd

# Collect metrics in a table
results = pd.DataFrame({
    "model_name": ["linear", "ridge", "random_forest", "xgboost"],
    "RMSE": [rmse_lr, rmse_ridge, rmse_rf, rmse_xgb],
    "MAE":  [mae_lr, mae_ridge, mae_rf, mae_xgb],
})

results_sorted = results.sort_values(by="RMSE")
results_sorted

Unnamed: 0,model_name,RMSE,MAE
3,xgboost,54.731974,25.121656
2,random_forest,54.987486,19.948911
1,ridge,92.242265,45.974777
0,linear,92.242266,45.974778


## 1.7 Select Best Model & Save to Disk

We now:

1. Select the one with the lowest RMSE on the test set  
2. Save the best-performing model to `../models/final_model.pkl`  

This serialized model will be used in the next notebook for:
- price recommendations  
- mispricing detection  
- revenue/profit simulations  

In [175]:
# Map names to actual model objects
model_objects = {
    "linear": lr,
    "ridge": ridge,
    "random_forest": rf,
    "xgboost": xgb,
}

best_row = results_sorted.iloc[0]
best_name = best_row["model_name"]
best_rmse = best_row["RMSE"]
best_mae = best_row["MAE"]

best_model = model_objects[best_name]

print(f"Best model: {best_name}")
print(f"RMSE: {best_rmse:.4f}")
print(f"MAE : {best_mae:.4f}")

Best model: xgboost
RMSE: 54.7320
MAE : 25.1217


In [176]:
import joblib
import os

os.makedirs("../models", exist_ok=True)

model_path = f"../models/final_model_{best_name}.pkl"
joblib.dump(best_model, model_path)

print("Saved best model to:", model_path)

Saved best model to: ../models/final_model_xgboost.pkl
