<a href="https://colab.research.google.com/github/adarsh182005/Logistics-Optimization-Project/blob/main/logistics_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üì¶ Order Fulfilment & Logistics Optimization Pipeline

## üöÄ Project Overview
An automated data engineering pipeline designed to optimize logistics operations. This tool ingests raw, unstructured delivery logs (CSV) and transforms them into actionable business intelligence.

## üîë Key Capabilities
* **Dynamic Sessionization:** Algorithmically detects "Order Sessions" from continuous GPS/timestamp streams using time-gap thresholding.
* **Transit Time Analysis:** Computes accurate delivery durations, filtering out outliers using Median statistics.
* **Courier Performance:** Aggregates metrics to evaluate agent efficiency and speed.
* **Forecasting Engine:** Utilizes Exponential Smoothing (Holt-Winters) to predict future transit trends.
* **Automated Reporting:** Exports clean, structured datasets for dashboarding tools (Power BI/Tableau).

## ‚öôÔ∏è How It Works (Standard Operating Procedure)
1.  **Ingest:** The script accepts any raw logistics CSV.
2.  **Process:** It automatically detects time and ID columns, handling missing data gracefully.
3.  **Analyze:** Runs statistical models to extract KPIs.
4.  **Output:** Generates `order_times.csv`, `courier_kpis.csv`, and `monthly_trends.csv`.

## üõ† Usage
1.  Open the notebook in **Google Colab** or **Jupyter**.
2.  Run all cells.
3.  Upload your dataset when prompted (or place it in the root directory).

In [12]:

# Install libraries (uncomment in fresh Colab)
# !pip install pandas plotly statsmodels

import pandas as pd, numpy as np, os
import plotly.express as px
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

print("Libraries ready")


Libraries ready


In [13]:

# ‚úÖ STEP 1: Upload your dataset
from google.colab import files
uploaded = files.upload()

file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name)

print("Loaded:", file_name)
print("Shape:", df.shape)
display(df.head())


Saving logistics_dataset.csv to logistics_dataset.csv
Loaded: logistics_dataset.csv
Shape: (10000, 10)


Unnamed: 0,order_id,courier_id,city,route_id,order_time,delivery_time,transit_hours,distance_km,shipping_cost,on_time
0,ORD00000,C052,Kolhapur,R10,2023-11-04,2023-11-04 01:51:00,1.85,212,1431.01,True
1,ORD00001,C093,Kolhapur,R05,2023-08-12,2023-08-12 06:05:00,6.083333,318,1312.24,False
2,ORD00002,C015,Pune,R03,2023-02-26,2023-02-26 04:00:00,4.0,247,1137.3,True
3,ORD00003,C072,Nagpur,R16,2023-03-16,2023-03-16 07:09:00,7.15,349,2131.42,False
4,ORD00004,C061,Mumbai,R08,2023-09-11,2023-09-11 06:16:00,6.266667,104,707.59,False


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [30]:

# ‚úÖ STEP 2: Detect & parse time column
time_candidates = [c for c in df.columns if "time" in c.lower() or "date" in c.lower()]
print("Detected time columns:", time_candidates)

time_col = time_candidates[0]
df[time_col] = pd.to_datetime(df[time_col], errors='coerce')
df = df.sort_values(time_col)

print("Using time column:", time_col)


Detected time columns: ['order_time', 'delivery_time', 'on_time']
Using time column: order_time


In [15]:

# ‚úÖ STEP 3: Detect courier / agent
id_candidates = [c for c in df.columns if "postman" in c.lower() or "courier" in c.lower() or "agent" in c.lower()]

if not id_candidates:
    df["courier_id"] = "C_" + (df.index // 30).astype(str)
    courier_col = "courier_id"
    print("Auto-created courier column:", courier_col)
else:
    courier_col = id_candidates[0]
    print("Using courier column:", courier_col)


Using courier column: courier_id


In [16]:

# ‚úÖ STEP 4: Create Sessions (Orders)
df = df.sort_values([courier_col, time_col])
df["dt_min"] = df.groupby(courier_col)[time_col].diff().dt.total_seconds().div(60).fillna(0)

GAP_MIN = 60
df["new_session"] = (df["dt_min"] > GAP_MIN).astype(int)
df["order_id"] = df.groupby(courier_col)["new_session"].cumsum()
df["order_uid"] = df[courier_col].astype(str) + "_" + df["order_id"].astype(str)


In [17]:

# ‚úÖ STEP 5: Order-Level Transit Calculation
order_times = df.groupby("order_uid")[time_col].agg(
    accept_time="min",
    delivery_time="max"
).reset_index()

order_times["transit_hours"] = (order_times["delivery_time"] - order_times["accept_time"]).dt.total_seconds() / 3600
order_times = order_times.dropna()

display(order_times.head())


Unnamed: 0,order_uid,accept_time,delivery_time,transit_hours
0,C001_0,2023-01-03,2023-01-03,0.0
1,C001_1,2023-01-05,2023-01-05,0.0
2,C001_10,2023-02-04,2023-02-04,0.0
3,C001_11,2023-02-08,2023-02-08,0.0
4,C001_12,2023-02-14,2023-02-14,0.0




In [18]:

# ‚úÖ STEP 6: KPIs
total_orders = len(order_times)
avg_transit = round(order_times["transit_hours"].mean(), 2)
median_transit = round(order_times["transit_hours"].median(), 2)

print("Total Orders:", total_orders)
print("Avg Transit Hours:", avg_transit)
print("Median Transit Hours:", median_transit)


Total Orders: 8772
Avg Transit Hours: 0.0
Median Transit Hours: 0.0


In [19]:

# ‚úÖ STEP 7: Courier Performance
mapping = df.groupby("order_uid")[courier_col].first().reset_index()
order_times = order_times.merge(mapping, on="order_uid")

courier_kpis = order_times.groupby(courier_col).agg(
    total_orders=("order_uid", "count"),
    avg_transit_hours=("transit_hours", "mean")
).reset_index().sort_values("avg_transit_hours")

display(courier_kpis.head(10))


Unnamed: 0,courier_id,total_orders,avg_transit_hours
0,C001,87,0.0
1,C002,89,0.0
2,C003,86,0.0
3,C004,91,0.0
4,C005,74,0.0
5,C006,93,0.0
6,C007,75,0.0
7,C008,90,0.0
8,C009,90,0.0
9,C010,80,0.0


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [20]:
# ‚úÖ FIX: Recompute transit time + monthly trend correctly

import pandas as pd
import plotly.express as px

# Force datetime parsing
df["order_time"] = pd.to_datetime(df["order_time"], errors="coerce")
df["delivery_time"] = pd.to_datetime(df["delivery_time"], errors="coerce")

# Recalculate transit hours safely
df["transit_hours"] = (df["delivery_time"] - df["order_time"]).dt.total_seconds() / 3600

# Drop invalid rows
df = df.dropna(subset=["transit_hours"])

# Create month column properly
df["month"] = df["order_time"].dt.to_period("M").dt.to_timestamp()

# Monthly aggregation
monthly = df.groupby("month")["transit_hours"].mean().reset_index()

print("Monthly KPI Preview:")
display(monthly.head())

# ‚úÖ WORKING Monthly Plot
fig = px.line(
    monthly,
    x="month",
    y="transit_hours",
    title="‚úÖ Average Transit Time per Month (Corrected)"
)
fig.show()


Monthly KPI Preview:


Unnamed: 0,month,transit_hours
0,2023-01-01,5.201528
1,2023-02-01,5.284284
2,2023-03-01,5.33984
3,2023-04-01,5.327269
4,2023-05-01,5.099692




In [21]:

# ‚úÖ STEP 9: Forecasting
from statsmodels.tsa.holtwinters import Holt

ts = monthly.set_index("month")["transit_hours"]

model = Holt(ts).fit()
forecast = model.forecast(3)

forecast




No frequency information was provided, so inferred frequency MS will be used.



Unnamed: 0,0
2024-01-01,5.299645
2024-02-01,5.30752
2024-03-01,5.315394


In [22]:
from statsmodels.tsa.holtwinters import Holt
import pandas as pd
import plotly.express as px

# 1) Time series with monthly frequency
ts = monthly.set_index("month")["transit_hours"].asfreq("MS")

# 2) Train Holt trend model
model = Holt(ts).fit()

# 3) In-sample fitted values (for existing months)
fitted = model.fittedvalues   # same index as ts

# 4) Out-of-sample forecast (next 3 months)
forecast = model.forecast(3)  # Jan‚ÄìMar 2024

print("Forecast values:")
print(forecast)

# 5) Build combined DataFrame
actual_df = ts.reset_index()
actual_df.columns = ["month", "transit_hours"]
actual_df["type"] = "Actual"

fitted_df = fitted.reset_index()
fitted_df.columns = ["month", "transit_hours"]
fitted_df["type"] = "Fitted"

forecast_df = forecast.reset_index()
forecast_df.columns = ["month", "transit_hours"]
forecast_df["type"] = "Forecast"

plot_df = pd.concat([actual_df, fitted_df, forecast_df], ignore_index=True)

# 6) Plot
fig = px.line(
    plot_df,
    x="month",
    y="transit_hours",
    color="type",
    line_dash="type",
    markers=True,
    title="Holt Trend ‚Äì Actual vs Fitted vs Forecast Transit Time"
)
fig.show()


Forecast values:
2024-01-01    5.299645
2024-02-01    5.307520
2024-03-01    5.315394
Freq: MS, dtype: float64


In [23]:
print(type(ts.index))
print(type(forecast.index))
print(ts.index[-3:])
print(forecast.index)


<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
DatetimeIndex(['2023-10-01', '2023-11-01', '2023-12-01'], dtype='datetime64[ns]', name='month', freq='MS')
DatetimeIndex(['2024-01-01', '2024-02-01', '2024-03-01'], dtype='datetime64[ns]', freq='MS')


In [24]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Actual vs Fitted
actual = ts.values
fitted = fitted.values

# MAE
mae = mean_absolute_error(actual, fitted)

# RMSE
rmse = np.sqrt(mean_squared_error(actual, fitted))

# MAPE (in %)
mape = np.mean(np.abs((actual - fitted) / actual)) * 100

print(f"MAE  (Average Error in Hours): {mae:.4f}")
print(f"RMSE (Weighted Error):        {rmse:.4f}")
print(f"MAPE (Percentage Error):     {mape:.2f}%")


MAE  (Average Error in Hours): 0.1346
RMSE (Weighted Error):        0.1666
MAPE (Percentage Error):     2.57%


In [25]:

# ‚úÖ STEP 10: Export Results
os.makedirs("output", exist_ok=True)

order_times.to_csv("output/order_times.csv", index=False)
courier_kpis.to_csv("output/courier_kpis.csv", index=False)
monthly.to_csv("output/monthly_kpis.csv", index=False)

print("‚úÖ Files exported")
print(os.listdir("output"))


‚úÖ Files exported
['courier_kpis.csv', 'monthly_kpis.csv', 'order_times.csv']


In [26]:
# ‚úÖ FINAL TRANSIT TIME FIX (MANDATORY)
df["order_time"] = pd.to_datetime(df["order_time"], errors="coerce")
df["delivery_time"] = pd.to_datetime(df["delivery_time"], errors="coerce")

df["transit_hours"] = (df["delivery_time"] - df["order_time"]).dt.total_seconds() / 3600

order_times = df.groupby("order_id")[["order_time","delivery_time","transit_hours"]].first().reset_index()
order_times["month"] = order_times["order_time"].dt.to_period("M").dt.to_timestamp()


In [27]:
courier_kpis = df.groupby("courier_id").agg(
    total_orders=("order_id", "count"),
    avg_transit_hours=("transit_hours", "mean")
).reset_index().sort_values("avg_transit_hours")

courier_kpis.tail(10)


Unnamed: 0,courier_id,total_orders,avg_transit_hours
9,C010,92,5.629167
6,C007,85,5.641176
33,C034,85,5.666667
92,C093,107,5.674766
2,C003,99,5.722896
56,C057,105,5.725714
66,C067,98,5.766667
74,C075,103,5.772492
82,C083,82,5.778455
21,C022,98,5.853912


In [28]:
df["order_time"] = pd.to_datetime(df["order_time"], errors="coerce")
df["delivery_time"] = pd.to_datetime(df["delivery_time"], errors="coerce")
df["transit_hours"] = (df["delivery_time"] - df["order_time"]).dt.total_seconds() / 3600


In [29]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
