<a href="https://colab.research.google.com/github/asadamaanstat/ANALYSIS-AND-PREDICTION-OF-ROAD-ACCIDENTS-IN-INDIA-A-TIME-SERIES-STUDY/blob/main/Amazon_Delivery_time_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Amazon Delivery Time Prediction***



##### **Project Type**    - Supervised Machine Learning Regression Project.
##### **Contribution**    - Individual

# **Project Summary -**

This project leverages machine learning to predict Amazon delivery times using detailed e-commerce, agent, product, environmental, and route data. Through systematic steps—data preparation, cleaning, feature engineering, exploratory data analysis, multiple regression model training, and experiment tracking—the solution can forecast delivery completion times based on order parameters and real-time conditions.

With the ability to ingest new order details and generate accurate predictions, the model will support logistics teams by enabling smarter scheduling, reducing operational costs, and improving customer satisfaction. The included Streamlit front-end enables users to input variables and instantly receive estimated delivery times. Ultimately, the project’s outcome helps Amazon optimize delivery performance, streamline supply chain operations, and deliver a superior customer experience.




# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This project leverages machine learning to predict Amazon delivery times using detailed e-commerce, agent, product, environmental, and route data. Through systematic steps—data preparation, cleaning, feature engineering, exploratory data analysis, multiple regression model training, and experiment tracking—the solution can forecast delivery completion times based on order parameters and real-time conditions.

With the ability to ingest new order details and generate accurate predictions, the model will support logistics teams by enabling smarter scheduling, reducing operational costs, and improving customer satisfaction. The included Streamlit front-end enables users to input variables and instantly receive estimated delivery times. Ultimately, the project’s outcome helps Amazon optimize delivery performance, streamline supply chain operations, and deliver a superior customer experience.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from geopy.distance import geodesic
from sklearn.model_selection import GridSearchCV

#!pip install mlflow

import mlflow
import mlflow.sklearn

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('amazon_delivery.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.dtypes

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The dataset contains detailed information about orders, agents, and delivery conditions:

● **Order_ID**: Unique identifier for each order.

● **Agent_Age**: Age of the delivery agent.

● **Agent_Rating**: Rating of the delivery agent.

● **Store_Latitude/Longitude**: Geographic location of the store.

● **Drop_Latitude/Longitude**: Geographic location of the delivery address.

● **Order_Date/Order_Time**: Date and time when the order was placed.

● **Pickup_Time**: Time when the delivery agent picked up the order.

● **Weather**: Weather conditions during delivery.

● **Traffic**: Traffic conditions during delivery.

● **Vehicle**: Mode of transportation used for delivery.

● **Area**: Type of delivery area (Urban/Metropolitan).

● **Delivery_Time**: Target variable representing the actual time taken for delivery (in hours).

● **Category**: Category of the product being delivered.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.drop('Order_ID', axis=1, inplace=True)
df.drop('Category', axis=1, inplace=True)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Delivery Time by Vehicle Type

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10,6))
sns.histplot(df['Delivery_Time'], bins=30, kde=True)
plt.title('Distribution of Delivery Time')
plt.show()

##### 1. Why did you pick the specific chart?

Visualize the count of deliveries for each product category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Delivery Time by Weather

In [None]:
# Chart - 4 visualization code
sns.boxplot(x="Weather", y="Delivery_Time", data=df)
plt.title("Delivery Time by Weather")
plt.show()


##### 1. Why did you pick the specific chart?

Check how delivery times vary under different weather conditions.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Delivery Time by Traffic Level

In [None]:
# Chart - 6 visualization code
sns.violinplot(x="Traffic", y="Delivery_Time", data=df)
plt.title("Delivery Time Distribution by Traffic")
plt.show()


##### 1. Why did you pick the specific chart?

Analyze delivery times under different traffic conditions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Delivery Time vs Agent Rating

In [None]:
# Chart - 7 visualization code
sns.scatterplot(x="Agent_Rating", y="Delivery_Time", data=df)
plt.title("Agent Rating vs Delivery Time")
plt.show()


##### 1. Why did you pick the specific chart?

Analyze delivery times under different traffic conditions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Delivery Time by Vehicle/Area

In [None]:
# Chart - 8 visualization code
sns.boxplot(x="Vehicle", y="Delivery_Time", data=df)
plt.title("Delivery Time by Vehicle Type")
plt.show()

sns.boxplot(x="Area", y="Delivery_Time", data=df)
plt.title("Delivery Time by Delivery Area")
plt.show()


##### 1. Why did you pick the specific chart?

Explore how vehicle type and delivery area impact times.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Correlation Heatmap

In [None]:
corr = df.select_dtypes(include=['number']).corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="Blues")
plt.title("Feature Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

See overall relationships between numerical features.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df['Agent_Rating'].fillna(df['Agent_Rating'].mean(), inplace=True)
df['Weather'].fillna(df['Weather'].mode()[0], inplace=True)
df['Traffic'].replace(np.nan,df['Traffic'].mode()[0],inplace=True)
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### Convert date columns to datetime

In [None]:
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
# Extract useful time features
df['Hour'] = df['Order_Date'].dt.hour
df['DayOfWeek'] = df['Order_Date'].dt.dayofweek

In [None]:
df['CalculatedDistance'] = df.apply(lambda row: geodesic((row['Store_Latitude'], row['Store_Longitude']), (row['Drop_Latitude'], row['Drop_Longitude'])).km, axis=1)


### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df = pd.get_dummies(df, columns=['Weather', 'Traffic', 'Vehicle','Area'],dtype=int)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
df.drop('Store_Latitude', axis=1, inplace=True)
df.drop('Store_Longitude', axis=1, inplace=True)
df.drop('Drop_Latitude', axis=1, inplace=True)
df.drop('Drop_Longitude', axis=1, inplace=True)

df.drop('Order_Time', axis=1, inplace=True)
df.drop('Pickup_Time', axis=1, inplace=True)
df.drop('Order_Date', axis=1, inplace=True)

In [None]:
df.head()

#### 2. Feature Selection

In [None]:
X = df.drop('Delivery_Time', axis=1)
y = df['Delivery_Time']

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### Model 1: Linear Regression

In [None]:
# ML Model - 1 Implementation
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_test)

lr_rmse = mean_squared_error(y_test, lr_preds)
lr_mae = mean_absolute_error(y_test, lr_preds)
lr_r2 = r2_score(y_test, lr_preds)

print(f"Linear Regression -- RMSE: {lr_rmse:.4f}, MAE: {lr_mae:.4f}, R2: {lr_r2:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### Model 2: Random Forest Regressor

In [None]:
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)

rf_rmse = mean_squared_error(y_test, rf_preds)
rf_mae = mean_absolute_error(y_test, rf_preds)
rf_r2 = r2_score(y_test, rf_preds)

print(f"Random Forest -- RMSE: {rf_rmse:.4f}, MAE: {rf_mae:.4f}, R2: {rf_r2:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### Model 3: Gradient Boosting Regressor

In [None]:
# ML Model - 3 Implementation
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
gb_preds = gb_model.predict(X_test)

gb_rmse = mean_squared_error(y_test, gb_preds)
gb_mae = mean_absolute_error(y_test, gb_preds)
gb_r2 = r2_score(y_test, gb_preds)

print(f"Gradient Boosting -- RMSE: {gb_rmse:.4f}, MAE: {gb_mae:.4f}, R2: {gb_r2:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

In [None]:
## 8. Model Training, Evaluation & MLflow Tracking Separately

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

best_model_name = None
best_rmse = float('inf')
best_model = None

for name, model in models.items():
    with mlflow.start_run(run_name=name):
        print(f"Training and evaluating {name} model...")

        # Train
        model.fit(X_train, y_train)

        # Predict
        predictions = model.predict(X_test)

        # Evaluate
        rmse = mean_squared_error(y_test, predictions)
        mae = mean_absolute_error(y_test, predictions)
        r2 = r2_score(y_test, predictions)

        # Log to MLflow
        mlflow.sklearn.log_model(model, f"{name.lower().replace(' ', '_')}_model")
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mae", mae)
        mlflow.log_metric("r2", r2)

        print(f"{name} - RMSE: {rmse:.4f}, MAE: {mae:.4f}, R2: {r2:.4f}")

        # Track best model based on RMSE
        if rmse < best_rmse:
            best_rmse = rmse
            best_model_name = name
            best_model = model

print(f"\nBest Model: {best_model_name} with RMSE: {best_rmse:.4f}")

In [None]:
## 9. Save the Best Model for Deployment

import joblib

model_filename = f"{best_model_name.lower().replace(' ', '_')}_delivery_model.pkl"
joblib.dump(best_model, model_filename)
print(f"Best model saved as {model_filename}")


# **Conclusion**


- Completed separate training, evaluation, and MLflow experiment tracking for Linear Regression, Random Forest, and Gradient Boosting models.
- Identified the best model based on RMSE metric for deployment.
- Saved the best model in a serialized format for later use in a Streamlit app or other deployment environment.
- Next steps: Build interactive interfaces and automate data pipelines for real-time delivery time prediction.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***