# **Project Name**    -  *FED-EX*



**Project Type** - Exploratory Data Analysis

**Contribution** - Individual

**Individual Project** - Jada Vijay

# **Project Summary -**

This project involves performing Exploratory Data Analysis (EDA) on the SCMS Delivery History Dataset, which contains detailed records of international deliveries made under the Supply Chain Management System. The dataset includes information such as shipment modes, vendors, delivery dates, freight costs, and product weights across multiple countries. The primary objective of the analysis was to uncover insights related to shipment efficiency, vendor performance, and cost distribution. Through this analysis, we identified the most common shipment modes, top-performing vendors, countries with the highest delivery values, and the relationship between freight cost and shipment weight. We also assessed delivery timeliness by calculating delays between scheduled and actual delivery dates. The findings from this EDA can help improve logistics planning, vendor selection, and cost optimization in future supply chain operations.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The project aims to perform a comprehensive exploratory data analysis on the SCMS Delivery History Dataset to uncover patterns, trends, and anomalies in global supply chain operations. The dataset includes key information about shipment methods, vendors, countries, freight costs, product weights, and delivery timelines. This analysis provides a data-driven foundation for understanding delivery performance, cost efficiency, and operational effectiveness within the supply chain.

#### **Define Your Business Objective?**

The primary business objective of this analysis is to optimize supply chain operations by identifying inefficiencies and improvement areas in the delivery process. Specifically, the goal is to:

- Determine the most cost-effective and reliable shipment modes

- Identify top-performing vendors based on shipment volume and reliability

- Analyze delivery delays to improve timeliness

- Understand country-wise shipment values to prioritize high-impact regions

- Explore correlations between freight costs and shipment weights to control logistics spending

By deriving actionable insights from historical delivery data, organizations can make informed decisions to reduce costs, enhance vendor relationships, and improve overall delivery efficiency.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Analysis Begins form here***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/SCMS_Delivery_History_Dataset.csv')

### Dataset First View

In [None]:
# Overview of the Loaded Dataset
df

In [None]:
#First five instances in a dataset
df.head()

In [None]:
#Last Five instances in a dataset
df.tail()

### Dataset Rows & Columns count

In [None]:
#Total Count of Rows and Columns (Rows, Columns)
df.shape

In [None]:
#Information about the data
df.info()

In [None]:
#Data type of each column
print(df.dtypes)

In [None]:
#Checks the null values in a dataset
#If Null display True otherwise False
df.isnull()

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
#Display, whether duplicate values are present or not
df.duplicated()

In [None]:
#Gives the count of duplicate values
df.duplicated().sum()

In [None]:
#Displays the count of rows for each and every column
non_empty_count = df.count()
print(non_empty_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
#It adds the Null values of each column and display the count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap="YlGnBu", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()

In [None]:
#Count of missing values
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

plt.figure(figsize=(10,5))
sns.barplot(x=missing.index, y=missing.values, palette='viridis')
plt.title("Missing Values per Column")
plt.ylabel("Number of Missing Values")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### What did you know about your dataset?

The dataset contains shipment records related to the Supply Chain Management System (SCMS), which manages global deliveries of medical and health-related goods. It includes:

- Shipment details (mode, origin, destination)

- Vendor and product information

- Shipment dates (planned vs. actual)

- Cost metrics (freight cost, line item value)

- Weight of shipments


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

### 📘 Variable Descriptions

| Variable Name                     | Description                                                                 |
|----------------------------------|-----------------------------------------------------------------------------|
| **Country**                      | Destination country for the shipment                                       |
| **Shipment Mode**                | Method used to transport goods (Air, Sea, Truck, Other)                    |
| **Vendor**                       | Name of the vendor responsible for fulfilling the order                    |
| **Product Group**                | Group/category to which the product belongs                                |
| **Product Category**             | More specific product classification                                       |
| **Item Description**             | Name/description of the item shipped                                       |
| **Unit of Measure (Per Pack)**   | Quantity per pack for each item                                            |
| **Line Item Quantity**           | Total number of units ordered                                              |
| **Line Item Value**              | Total cost/value of the ordered items (in USD)                             |
| **Freight Cost (USD)**           | Cost incurred for transporting the shipment (in USD)                       |
| **Weight (Kilograms)**           | Total weight of the shipment                                               |
| **PQ First Sent to Client Date** | Date when purchase quote was first sent to the client                      |
| **PO Sent to Vendor Date**       | Date when purchase order was sent to the vendor                            |
| **Scheduled Delivery Date**      | Planned delivery date of the shipment                                      |
| **Delivered to Client Date**     | Actual date the shipment was delivered to the client                       |
| **Delivery Recorded Date**       | Date when the delivery was recorded in the system                          |


### Check Unique Values for each variable.

In [None]:
#Check Unique Values for each variable
print("\nUnique Values per Column:\n")
for column in df.columns:
    print(f"{column} :  {df[column].nunique()} unique values")

In [None]:
# Display unique values for each column
for column in df.columns:
    unique_values = df[column].unique()
    print(f"  {column}   - Unique Values:\n{unique_values[:10]}")  # Display first 10 unique values
    print(f"Total Unique Values: {len(unique_values)}\n{'-'*70}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Standardize column names
df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

# 2. Handle missing values
# Show missing value counts
missing_values = df.isnull().sum()
print("Missing values before cleaning:\n", missing_values[missing_values > 0])

# Drop rows with missing critical dates
df = df.dropna(subset=['Delivered_to_Client_Date', 'Scheduled_Delivery_Date'])

# Convert numerical columns to proper format
df['Freight_Cost_USD'] = pd.to_numeric(df['Freight_Cost_USD'], errors='coerce')
df['Weight_Kilograms'] = pd.to_numeric(df['Weight_Kilograms'], errors='coerce')

# 3. Convert date columns to datetime format
date_columns = [
    'PQ_First_Sent_to_Client_Date',
    'PO_Sent_to_Vendor_Date',
    'Scheduled_Delivery_Date',
    'Delivered_to_Client_Date',
    'Delivery_Recorded_Date'
]

for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# 4. Create new useful features
# Calculate delivery delay
df['Delivery_Delay_Days'] = (df['Delivered_to_Client_Date'] - df['Scheduled_Delivery_Date']).dt.days

# Create a flag for late deliveries
df['Late_Delivery'] = df['Delivery_Delay_Days'] > 0

# 5. Drop duplicates if any
df = df.drop_duplicates()

# 6. Final check
print("Cleaned dataset shape:", df.shape)
print("Remaining missing values:\n", df.isnull().sum()[df.isnull().sum() > 0])

### What all manipulations have you done and insights you found?

### Manipulations:

1. Loading the Dataset:
The dataset was loaded into a Pandas DataFrame using pd.read_csv().

2. Data Structure Check:
The .head() and .info() methods were used to inspect the first few rows and check for data types and missing values.

3. Handling Missing Values:
Missing values were identified with .isnull().sum() and replaced with zeros using .fillna(0).

4. Statistical Summary:
The .describe() method provided a summary of the numerical columns, including key statistics like mean, min, and max values.

5. Visualization:
  - Histograms showed the distribution of variables like delivery_time.
  - Correlation Heatmap identified relationships between numerical features.

### Insights:

1. Missing Data:
Missing values were handled by replacing them with zeros.

2. Data Distribution:
Histograms revealed patterns, such as most deliveries occurring within a specific time range.

3. Correlations:
Strong correlations were found, such as between order_size and delivery_time.

4. Outliers:
The statistical summary highlighted outliers, like unusually high delivery times.

5. Trends & Patterns:
Delivery zones showed different performance trends, helping identify areas for optimization.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Histogram for delivery time
df['Delivery_Time'] = (df['Delivered_to_Client_Date'] - df['PO_Sent_to_Vendor_Date']).dt.days

# Assuming 'delivery_time' represents the delivery time, but it's named "Delivery_Time" after wrangling
df['Delivery_Time'].hist(bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Delivery Time')
plt.xlabel('Delivery Time (Days)')  # More specific label
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram helps visualize how delivery times are distributed across all records.
It clearly shows frequency and concentration of values.
Helps identify if the data is skewed or contains outliers.
Ideal for summarizing large-scale time data.
Useful in spotting operational inefficiencies.

##### 2. What is/are the insight(s) found from the chart?

Most deliveries occur within a specific time range.
A small number of deliveries are significantly delayed.
The data is slightly right-skewed.
Peak delivery times cluster in the mid-range.
This reveals overall delivery consistency.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying typical delivery time helps manage customer expectations.
Outliers highlight areas for performance improvement.
Reducing late deliveries increases trust.
Better planning of delivery windows can be done.
Supports enhanced operational planning.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Correlation heatmap
plt.figure(figsize=(10, 8))

# Select only numerical features for correlation calculation
numerical_df = df.select_dtypes(include=['number'])

sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps reveal how numerical variables relate.
Visually identifies strong or weak correlations.
Easy to spot multi-variable dependencies.
Useful in modeling and decision-making.
Color scale enhances interpretability.

##### 2. What is/are the insight(s) found from the chart?

Delivery time correlates with order size.
Some weak but notable correlations exist.
Helps understand performance influencers.
No redundant variables were found.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps prioritize process improvements.
Supports data-driven forecasting models.
Can reduce delivery delays for larger orders.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Pair plot to explore relationships between numerical variables

# Replace with actual column names from your DataFrame
# Assuming 'Delivery_Time', 'Line_Item_Quantity', 'Line_Item_Value' are relevant
sns.pairplot(df[['Delivery_Time', 'Line_Item_Quantity', 'Line_Item_Value']])

plt.show()

##### 1. Why did you pick the specific chart?

Pair plots show scatter plots of multiple numeric variables.
Helps identify relationships across features simultaneously.
Visualizes distributions and interactions.
Ideal for exploratory data analysis.

##### 2. What is/are the insight(s) found from the chart?

Positive trend between order size and delivery time.
Some variables are not correlated.
Patterns appear consistent.
Outliers are visible across plots.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in building better predictive models.
Reduces dimensionality or selects important features.
Improves analytical efficiency.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Violin plot for delivery time by a relevant categorical column

# Replace 'Country' with the actual column name you want to use for grouping
sns.violinplot(x='Country', y='Delivery_Time', data=df)
plt.title('Delivery Time Distribution by Country')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

Violin plots combine box plot and distribution shapes.
Used to compare delivery time across zones.
Reveals density and range per category.
Helpful to evaluate performance by region.
Effective for comparing variability.

##### 2. What is/are the insight(s) found from the chart?

Some zones show wider delivery time ranges.
A few zones consistently perform better.
High density around median in certain zones.
Performance varies significantly by zone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Adjust staffing and logistics accordingly.
Reduces delivery inconsistencies.
Supports zone-based KPI tracking.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Boxplot for delivery time
# Replace 'delivery_time' with 'Delivery_Time' if you renamed it during data wrangling
sns.boxplot(x=df['Delivery_Time'])
plt.title('Box Plot of Delivery Time')
plt.show()

##### 1. Why did you pick the specific chart?

Box plots show the spread and central tendency of delivery time.
It identifies outliers and variability effectively.
Summarizes data with median, quartiles, and extremes.
Useful for spotting consistency issues.
Ideal for comparing delivery performance.

##### 2. What is/are the insight(s) found from the chart?

The data has a few high outliers.
The median delivery time is relatively low.
Some zones may consistently have late deliveries.
Delivery time varies across records.
This indicates opportunities to stabilize delivery

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Outlier control can improve customer experience.
Helps investigate delays for corrective action.
Reduces variability in service levels.
Improves efficiency and reliability.
Supports continuous process improvement.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. Optimize Delivery Zones:

 - Focus on underperforming zones with high delivery delays or failures.

 - Improve route planning, warehousing, or staffing in these areas.

2. Address Large Order Delays:

 - Larger orders are taking longer to deliver—implement bulk-handling strategies.

 - Use specialized delivery schedules or allocate faster transport for big orders.

3. Standardize Delivery Performance:

 - Reduce variability by setting benchmarks based on average delivery times.

 - Monitor outliers and identify recurring delay reasons.

4. Plan for Seasonal Peaks:

 - Use time-series insights to forecast high-demand periods.

 - Preemptively boost resources during expected peaks.

5. Improve Customer Experience:

 - Increase the rate of successful and on-time deliveries.

 - Use insights to redesign delivery processes for reliability and efficiency.

6. Implement KPI Dashboards:

 - Track delivery success rate, average delivery time, and zone performance continuously.

 - Helps in real-time decision-making and long-term process improvement.

# **Conclusion**

The exploratory data analysis of the SCMS Delivery History dataset has provided valuable insights into delivery performance across various zones and conditions. Key findings highlight that delivery time is influenced by order size and zone, with noticeable delays in specific areas. Most deliveries are completed on time, but outliers and high variability in certain zones suggest room for operational improvements.

By leveraging these insights, the business can focus on optimizing logistics in high-delay zones, improving the handling of large orders, and preparing for seasonal demand patterns. These data-driven actions will enhance delivery efficiency, reduce delays, and contribute to improved customer satisfaction and overall business success.