# Analysis of the impact of shipping mode on revenue & cost (MLE)

Objective: estimate the impact of `Ship Mode` on revenue (`Sales`) and cost (estimate `Cost = Sales - Profit`) using Maximum Likelihood model (OLS ~ normal distribution).

Using control variables for region (`Region`) and product category (`Category`) to adjust for demand variance.

In [None]:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
from pathlib import Path

PATH = "/Users/locseo/Desktop/Master DS/IT5425E_Data management & Visualization/Project/[DATASET C] Retail Supply Chain Sales Analysis/[C] Retail-Supply-Chain-Sales-Analysis.xlsx"
DATA_PATH = Path(PATH)
ORDERS_SHEET = 'Retails Order Full Dataset'

orders = pd.read_excel(DATA_PATH, sheet_name=ORDERS_SHEET)
orders.head()


### Preprocessing
- Keep rows with valid `Ship Mode`, `Sales`, `Profit`
- Calculate `Cost = Sales - Profit`
- Log1p to reduce skew
- Set columns to categories for model convenience

In [None]:
orders_clean = orders.dropna(subset=['Ship Mode', 'Sales', 'Profit', 'Region', 'Category']).copy()

# Pricing (estimation): Sales - Profit
orders_clean['Cost'] = orders_clean['Sales'] - orders_clean['Profit']

# Normalize column names
orders_clean['ship_mode'] = orders_clean['Ship Mode'].astype('category')
orders_clean['region'] = orders_clean['Region'].astype('category')
orders_clean['category'] = orders_clean['Category'].astype('category')

# Log-transform to stabilize variance
orders_clean['log_sales'] = np.log1p(orders_clean['Sales'])
orders_clean['log_cost'] = np.log1p(orders_clean['Cost'].clip(lower=0))

orders_clean[['ship_mode', 'region', 'category', 'Sales', 'Profit', 'Cost']].head()


### MLE model (OLS ~ Gaussian)
Estimation:
- Uncontrolled: `log_sales ~ C(ship_mode)` and `log_cost ~ C(ship_mode)`
- Controlled: `log_sales ~ C(ship_mode) + C(region) + C(category)` and similarly for `log_cost`
`C(...)` automatically creates a dummy variable, defaults to the reference group taken from the category order (usually the first value in alphabetical order).

In [None]:
# Uncontrolled model
model_sales = smf.ols('log_sales ~ C(ship_mode)', data=orders_clean).fit()
model_cost = smf.ols('log_cost ~ C(ship_mode)', data=orders_clean).fit()

# Model with area + category control
model_sales_ctrl = smf.ols('log_sales ~ C(ship_mode) + C(region) + C(category)', data=orders_clean).fit()
model_cost_ctrl = smf.ols('log_cost ~ C(ship_mode) + C(region) + C(category)', data=orders_clean).fit()

print('Revenue (log_sales) - uncontrolled')
print(model_sales.summary().tables[1])
print('Cost (log_cost) - uncontrolled')
print(model_cost.summary().tables[1])

print('=== Revenue (log_sales) - with controlled Region + Category  ===')
print(model_sales_ctrl.summary().tables[1])
print('=== Cost (log cost) - with controlled Region + Category ===')
print(model_cost_ctrl.summary().tables[1])


### Expected revenue/cost by shipping method (controlled)
Transform the log coefficients to the original scale (expm1) for comparison. 
Fixed `Region` and `Category` at the most common value to isolate the effect of `Ship Mode`.

In [None]:

# Get modal values ​​for region and category to fix when forecasting
region_mode = orders_clean['region'].mode()[0]
category_mode = orders_clean['category'].mode()[0]

ship_modes = orders_clean['ship_mode'].cat.categories
pred_table = []
for mode in ship_modes:
    df_mode = pd.DataFrame({'ship_mode': [mode], 'region': [region_mode], 'category': [category_mode]})
    pred_log_sales = model_sales_ctrl.predict(df_mode)[0]
    pred_log_cost = model_cost_ctrl.predict(df_mode)[0]
    pred_table.append({
        'Ship Mode': mode,
        'Region (fixed)': region_mode,
        'Category (fixed)': category_mode,
        'Pred Sales (mean)': np.expm1(pred_log_sales),
        'Pred Cost (mean)': np.expm1(pred_log_cost)
    })

pred_df = pd.DataFrame(pred_table).sort_values('Pred Sales (mean)', ascending=False)
pred_df


### Notes
- Positive coefficient in `C(ship_mode)[T.*]` => higher revenue/cost than reference group (standard ship mode) after fixing Region + Category.
  
- `Pred Sales/Pred Cost` provides a practical comparison between ship modes in the same regional & popular category context.
  
- `region_mode`, `category_mode` can be replaced with other values ​​or weighted average to reflect the sales channel reality.