# Candy Distributor Shipping Analysis

This notebook loads the candy distributor dataset and performs exploratory data analysis (EDA) on shipping and supply-chain performance.


In [1]:
# Install any required packages (uncomment if running in a fresh environment)
# %pip install pandas numpy matplotlib seaborn

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

sns.set(style="whitegrid")


## Load candy distributor dataset

Update `data_path` below to point to your actual dataset file (for example: `data/candy_distributor.csv`).


In [23]:
# TODO: set this to the correct relative or absolute path of your candy distributor dataset

data_path_candy_sales = "dataset/Candy_Sales.csv"
data_path_candy_factories = "dataset/Candy_Factories.csv"  
data_path_candy_products = "dataset/Candy_Products.csv"  
data_path_candy_targets = "dataset/Candy_Targets.csv"  
data_path_distributor_data_dictionary = "dataset/candy_distributor_data_dictionary.csv"
data_path_uszips="dataset/uszips.csv"

if not os.path.exists(data_path):
    raise FileNotFoundError(
        f"Dataset not found at '{data_path}'. Please update 'data_path' to the correct location."
    )

# Read the dataset
df_candy_sales = pd.read_csv(data_path_candy_sales)
df_candy_factories = pd.read_csv(data_path_candy_factories)
df_candy_products = pd.read_csv(data_path_candy_products)
df_candy_targets = pd.read_csv(data_path_candy_targets)
df_candy_distributor_data_dictionary = pd.read_csv(data_path_distributor_data_dictionary)
df_uszips = pd.read_csv(data_path_uszips)
# Quick peek at the data
print("Shape:", df.shape)
df_candysales.head()


Shape: (10194, 18)


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Country/Region,City,State/Province,Postal Code,Division,Region,Product ID,Product Name,Sales,Units,Gross Profit,Cost
0,282,US-2021-128055-CHO-TRI-54000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-TRI-54000,Wonka Bar - Triple Dazzle Caramel,7.5,2,4.9,2.6
1,288,US-2021-128055-CHO-SCR-58000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-SCR-58000,Wonka Bar -Scrumdiddlyumptious,7.2,2,5.0,2.2
2,1132,US-2021-138100-CHO-FUD-51000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-FUD-51000,Wonka Bar - Fudge Mallows,7.2,2,4.8,2.4
3,1133,US-2021-138100-CHO-MIL-31000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-MIL-31000,Wonka Bar - Milk Chocolate,9.75,3,6.33,3.42
4,3396,US-2022-121391-CHO-MIL-31000,2022-10-04,2028-03-29,First Class,121391,United States,San Francisco,California,94109,Chocolate,Pacific,CHO-MIL-31000,Wonka Bar - Milk Chocolate,6.5,2,4.22,2.28


In [19]:
#understand the demographics of the Sales data
df_candy_sales['Country/Region'].value_counts()
df_candy_sales['Division'].value_counts()
df_candy_sales['Product Name'].unique()
df_candy_sales['Customer ID'].nunique()

5044

In [None]:
since unique count of customer id is 5044 , they dont have repeating customer as of this information only. 

## Initial data inspection

Run the following cells after successfully loading the dataset to understand its structure and basic statistics.


In [5]:
# Column info
print("Columns:\n", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

# Missing values summary
print("\nMissing values per column:")
print(df.isna().sum())


Columns:
 ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Country/Region', 'City', 'State/Province', 'Postal Code', 'Division', 'Region', 'Product ID', 'Product Name', 'Sales', 'Units', 'Gross Profit', 'Cost']

Data types:
Row ID              int64
Order ID           object
Order Date         object
Ship Date          object
Ship Mode          object
Customer ID         int64
Country/Region     object
City               object
State/Province     object
Postal Code        object
Division           object
Region             object
Product ID         object
Product Name       object
Sales             float64
Units               int64
Gross Profit      float64
Cost              float64
dtype: object

Missing values per column:
Row ID            0
Order ID          0
Order Date        0
Ship Date         0
Ship Mode         0
Customer ID       0
Country/Region    0
City              0
State/Province    0
Postal Code       0
Division          0
Region          

In [6]:
# Basic descriptive statistics for numeric columns
df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Row ID,10194.0,5097.5,2942.898656,1.0,2549.25,5097.5,7645.75,10194.0
Customer ID,10194.0,134468.961154,20231.483007,100006.0,117212.0,133550.0,152051.0,192314.0
Sales,10194.0,13.908537,11.34102,1.25,7.2,10.8,18.0,260.0
Units,10194.0,3.791838,2.228317,1.0,2.0,3.0,5.0,14.0
Gross Profit,10194.0,9.166451,6.64374,0.25,4.9,7.47,12.25,130.0
Cost,10194.0,4.742087,5.061647,0.6,2.4,3.6,5.7,130.0


1. What are the most efficient factory to customer shipping routes?

In [20]:
df_candy_factories.head()

Unnamed: 0,Factory,Latitude,Longitude
0,Lot's O' Nuts,32.881893,-111.768036
1,Wicked Choccy's,32.076176,-81.088371
2,Sugar Shack,48.11914,-96.18115
3,Secret Factory,41.446333,-90.565487
4,The Other Factory,35.1175,-89.971107


In [48]:
df_candy_factories['Factory'].unique

<bound method Series.unique of 0        Lot's O' Nuts
1      Wicked Choccy's
2          Sugar Shack
3       Secret Factory
4    The Other Factory
Name: Factory, dtype: object>

In [24]:
df_uszips.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,county_weights,county_names_all,county_fips_all,imprecise,military,timezone
0,601,18.18027,-66.75266,Adjuntas,PR,Puerto Rico,True,,16834.0,100.9,72001,Adjuntas,"{""72001"": 98.74, ""72141"": 1.26}",Adjuntas|Utuado,72001|72141,False,False,America/Puerto_Rico
1,602,18.36075,-67.17541,Aguada,PR,Puerto Rico,True,,37642.0,479.2,72003,Aguada,"{""72003"": 100}",Aguada,72003,False,False,America/Puerto_Rico
2,603,18.45744,-67.12225,Aguadilla,PR,Puerto Rico,True,,49075.0,551.7,72005,Aguadilla,"{""72005"": 99.76, ""72099"": 0.24}",Aguadilla|Moca,72005|72099,False,False,America/Puerto_Rico
3,606,18.16585,-66.93716,Maricao,PR,Puerto Rico,True,,5590.0,48.7,72093,Maricao,"{""72093"": 82.27, ""72153"": 11.66, ""72121"": 6.06}",Maricao|Yauco|Sabana Grande,72093|72153|72121,False,False,America/Puerto_Rico
4,610,18.2911,-67.12243,Anasco,PR,Puerto Rico,True,,25542.0,265.7,72011,Añasco,"{""72011"": 96.71, ""72099"": 2.82, ""72083"": 0.37,...",Añasco|Moca|Las Marías|Aguada,72011|72099|72083|72003,False,False,America/Puerto_Rico


In [25]:
df_candysales.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Country/Region,City,State/Province,Postal Code,Division,Region,Product ID,Product Name,Sales,Units,Gross Profit,Cost
0,282,US-2021-128055-CHO-TRI-54000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-TRI-54000,Wonka Bar - Triple Dazzle Caramel,7.5,2,4.9,2.6
1,288,US-2021-128055-CHO-SCR-58000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-SCR-58000,Wonka Bar -Scrumdiddlyumptious,7.2,2,5.0,2.2
2,1132,US-2021-138100-CHO-FUD-51000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-FUD-51000,Wonka Bar - Fudge Mallows,7.2,2,4.8,2.4
3,1133,US-2021-138100-CHO-MIL-31000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-MIL-31000,Wonka Bar - Milk Chocolate,9.75,3,6.33,3.42
4,3396,US-2022-121391-CHO-MIL-31000,2022-10-04,2028-03-29,First Class,121391,United States,San Francisco,California,94109,Chocolate,Pacific,CHO-MIL-31000,Wonka Bar - Milk Chocolate,6.5,2,4.22,2.28


In [29]:
df_uszips.columns

Index(['zip', 'lat', 'lng', 'city', 'state_id', 'state_name', 'zcta', 'parent_zcta', 'population', 'density',
       'county_fips', 'county_name', 'county_weights', 'county_names_all', 'county_fips_all', 'imprecise', 'military',
       'timezone'],
      dtype='object')

In [33]:
# Ensure postal codes are strings to avoid dropping leading zeros (e.g., 02108)
df_candy_sales['Postal Code'] = df_candy_sales['Postal Code'].astype(str).str.zfill(5)
df_uszips['zip'] = df_uszips['zip'].astype(str).str.zfill(5)

# Merge coordinates to the sales data
sales_geo = df_candy_sales.merge(
    df_uszips[['zip', 'lat', 'lng']], 
    left_on='Postal Code', 
    right_on='zip', 
    how='left'
)

# Rename columns for clarity
sales_geo = sales_geo.rename(columns={'lat': 'cust_lat', 'lng': 'cust_lng'})
sales_geo.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Country/Region,City,State/Province,Postal Code,Division,Region,Product ID,Product Name,Sales,Units,Gross Profit,Cost,zip,cust_lat,cust_lng
0,282,US-2021-128055-CHO-TRI-54000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-TRI-54000,Wonka Bar - Triple Dazzle Caramel,7.5,2,4.9,2.6,94122,37.76113,-122.48433
1,288,US-2021-128055-CHO-SCR-58000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-SCR-58000,Wonka Bar -Scrumdiddlyumptious,7.2,2,5.0,2.2,94122,37.76113,-122.48433
2,1132,US-2021-138100-CHO-FUD-51000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-FUD-51000,Wonka Bar - Fudge Mallows,7.2,2,4.8,2.4,10011,40.74173,-74.00037
3,1133,US-2021-138100-CHO-MIL-31000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-MIL-31000,Wonka Bar - Milk Chocolate,9.75,3,6.33,3.42,10011,40.74173,-74.00037
4,3396,US-2022-121391-CHO-MIL-31000,2022-10-04,2028-03-29,First Class,121391,United States,San Francisco,California,94109,Chocolate,Pacific,CHO-MIL-31000,Wonka Bar - Milk Chocolate,6.5,2,4.22,2.28,94109,37.79334,-122.42138


As of now using the uszip dataset we were able to find out the lat and longitude of the customers address. Now , we need to find efficient routes. We will do that using the lat and lon we have of both the factories and customer location. 

Analysis Metric,                  Calculation,                                What it tells you
Shipping Efficiency,              Cost / Distance,                            How much you are paying for every mile traveled. Lower is better.
Route Optimization,               Actual Factory vs Closest Factory,         Are you shipping from the nearest possible location?
Profit Density,                    Gross Profit / Distance,                Which routes are actually worth the logistics headache?

In [55]:
from haversine import haversine, Unit

def calculate_distance(row, factory_lat, factory_long):
    customer_loc = (row['cust_lat'], row['cust_lng'])
    factory_loc = (factory_lat, factory_long)
    return haversine(customer_loc, factory_loc, unit=Unit.MILES)

for _, row in df_candy_factories.iterrows():
    factory_name = row['Factory']
    factory_lat = row['Latitude']
    factory_lng = row['Longitude']
    #col_name = f"dist_to_{factory_name.lower().replace(' ', '_').replace("'", '')}"
    #col_name = f"dist_to_{factory_name.lower().replace(' ', '_').replace(\"'\", '')}"
# Dynamically update the distance columns in sales_geo for each factory

# Remove all existing columns in sales_geo that start with 'dist_to_'
for col in list(sales_geo.columns):
    if col.startswith('dist_to_'):
        sales_geo = sales_geo.drop(columns=col)

# Add a new distance column for every factory, based on coordinates from df_candy_factories
for _, factory_row in df_candy_factories.iterrows():
    factory_name = factory_row['Factory']
    factory_lat = factory_row['Latitude']
    factory_lng = factory_row['Longitude']
 
    safe_name = factory_name.lower().replace(' ', '_').replace("'", '')
    col_name = f"dist_to_{safe_name}"
    sales_geo[col_name] = sales_geo.apply(
    lambda cust_row: calculate_distance(cust_row, factory_lat, factory_lng), axis=1
    )
    








In [57]:
sales_geo.head(1)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Country/Region,City,State/Province,Postal Code,Division,Region,Product ID,Product Name,Sales,Units,Gross Profit,Cost,zip,cust_lat,cust_lng,dist_to_lots_o_nuts,dist_to_wicked_choccys,dist_to_sugar_shack,dist_to_secret_factory,dist_to_the_other_factory
0,282,US-2021-128055-CHO-TRI-54000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-TRI-54000,Wonka Bar - Triple Dazzle Caramel,7.5,2,4.9,2.6,94122,37.76113,-122.48433,691.168202,2358.514641,1500.115375,1708.285735,1807.343


In [61]:
#based on the distance the most efficient routes are as such 
factory_cols = ['dist_to_lots_o_nuts', 'dist_to_wicked_choccys', 'dist_to_sugar_shack', 'dist_to_secret_factory','dist_to_the_other_factory']
#sales_geo[['Customer ID','dist_to_lots_o_nuts','dist_to_wicked_choccys','dist_to_sugar_shack','dist_to_secret_factory','dist_to_the_other_factory']][:5]

In [63]:
sales_geo['closest_factory'] = sales_geo[factory_cols].idxmin(axis=1)
sales_geo.head()

  sales_geo['closest_factory'] = sales_geo[factory_cols].idxmin(axis=1)


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Country/Region,City,State/Province,Postal Code,Division,Region,Product ID,Product Name,Sales,Units,Gross Profit,Cost,zip,cust_lat,cust_lng,dist_to_lots_o_nuts,dist_to_wicked_choccys,dist_to_sugar_shack,dist_to_secret_factory,dist_to_the_other_factory,closest_factory
0,282,US-2021-128055-CHO-TRI-54000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-TRI-54000,Wonka Bar - Triple Dazzle Caramel,7.5,2,4.9,2.6,94122,37.76113,-122.48433,691.168202,2358.514641,1500.115375,1708.285735,1807.343,dist_to_lots_o_nuts
1,288,US-2021-128055-CHO-SCR-58000,2021-03-31,2026-09-26,Standard Class,128055,United States,San Francisco,California,94122,Chocolate,Pacific,CHO-SCR-58000,Wonka Bar -Scrumdiddlyumptious,7.2,2,5.0,2.2,94122,37.76113,-122.48433,691.168202,2358.514641,1500.115375,1708.285735,1807.343,dist_to_lots_o_nuts
2,1132,US-2021-138100-CHO-FUD-51000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-FUD-51000,Wonka Bar - Fudge Mallows,7.2,2,4.8,2.4,10011,40.74173,-74.00037,2140.827453,716.223347,1201.551721,862.615751,951.311969,dist_to_wicked_choccys
3,1133,US-2021-138100-CHO-MIL-31000,2021-09-15,2027-03-13,Standard Class,138100,United States,New York City,New York,10011,Chocolate,Atlantic,CHO-MIL-31000,Wonka Bar - Milk Chocolate,9.75,3,6.33,3.42,10011,40.74173,-74.00037,2140.827453,716.223347,1201.551721,862.615751,951.311969,dist_to_wicked_choccys
4,3396,US-2022-121391-CHO-MIL-31000,2022-10-04,2028-03-29,First Class,121391,United States,San Francisco,California,94109,Chocolate,Pacific,CHO-MIL-31000,Wonka Bar - Milk Chocolate,6.5,2,4.22,2.28,94109,37.79334,-122.42138,689.062587,2354.962139,1496.029939,1704.317592,1803.752682,dist_to_lots_o_nuts


In [69]:
# The modern way to "pluck" values based on a column of names
idx, cols = pd.factorize(sales_geo['closest_factory'])
sales_geo['distance_to_closest'] = sales_geo.reindex(cols, axis=1).to_numpy()[np.arange(len(sales_geo)), idx]


In [71]:


distance_to_closest = sales_geo[factory_cols].values[
    np.arange(len(sales_geo)), 
    sales_geo[factory_cols].columns.get_indexer(sales_geo['closest_factory'])
]
distance_to_closest

array([691.16820241, 691.16820241, 716.22334692, ..., 127.57531717,
       127.57531717, 388.02458978], shape=(10194,))

In [72]:
sales_geo['profit_density'] = sales_geo['Gross Profit'] / distance_to_closest

# Display example output
sales_geo[['Gross Profit', 'closest_factory', 'profit_density']].head()

Unnamed: 0,Gross Profit,closest_factory,profit_density
0,4.9,dist_to_lots_o_nuts,0.007089
1,5.0,dist_to_lots_o_nuts,0.007234
2,4.8,dist_to_wicked_choccys,0.006702
3,6.33,dist_to_wicked_choccys,0.008838
4,4.22,dist_to_lots_o_nuts,0.006124


In [74]:
#calculating the break even distance
# 1. Set your estimated shipping rate (e.g., $0.01 per mile)
shipping_rate_per_mile = 0.01 

# 2. Calculate the maximum distance we can ship before losing money
sales_geo['break_even_distance'] = sales_geo['Gross Profit'] / shipping_rate_per_mile

# 3. Calculate the "Distance Margin" (How many more miles could we have gone?)
# This is: Break-even Distance minus the Actual Distance to the closest factory
sales_geo['distance_margin'] = sales_geo['break_even_distance'] - sales_geo['distance_to_closest']

# 4. Identify "Loss" Orders (Orders where distance exceeded break-even)
sales_geo['is_profitable_route'] = sales_geo['distance_margin'] > 0

In [75]:
sales_geo[['closest_factory','profit_density','Gross Profit','is_profitable_route','break_even_distance']]

Unnamed: 0,closest_factory,profit_density,Gross Profit,is_profitable_route,break_even_distance
0,dist_to_lots_o_nuts,0.007089,4.90,False,490.0
1,dist_to_lots_o_nuts,0.007234,5.00,False,500.0
2,dist_to_wicked_choccys,0.006702,4.80,False,480.0
3,dist_to_wicked_choccys,0.008838,6.33,False,633.0
4,dist_to_lots_o_nuts,0.006124,4.22,False,422.0
...,...,...,...,...,...
10189,dist_to_secret_factory,0.031480,10.00,True,1000.0
10190,dist_to_the_other_factory,0.045689,12.45,True,1245.0
10191,dist_to_wicked_choccys,0.038409,4.90,True,490.0
10192,dist_to_wicked_choccys,0.096022,12.25,True,1225.0


In [80]:
sales_geo[sales_geo['is_profitable_route']==False]['closest_factory'].value_counts()


closest_factory
dist_to_wicked_choccys       1119
dist_to_lots_o_nuts           926
dist_to_the_other_factory     238
dist_to_secret_factory        156
dist_to_sugar_shack            12
Name: count, dtype: int64