# Superstore Sales Exploratory Data Analysis

## Objective
Explore historical Superstore sales data to identify trends, profitability drivers,
shipping performance issues, and customer behavior insights.

This analysis is intended as a portfolio project demonstrating practical,
business-focused data analysis skills.


## Dataset
- Source: Public Superstore Sales dataset
- Data type: Transaction-level retail orders
- Key fields include order date, product category, sales, profit, and region

## Key Business Questions
- Which product categories and sub-categories drive the most revenue?
- How does sales performance vary over time?
- Which regions are underperforming in terms of profit?
- Are there high-sales but low-profit products?

## Tools
- Python (pandas, matplotlib, seaborn) for data cleaning and analysis
- Excel for pivot-table summaries
- Tableau for dashboard visualizations

In [1]:
import sys
print("Python executable:", sys.executable)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option("display.max_columns", None)

# Load dataset
df = pd.read_csv("../data/raw/superstore.csv")

# Preview data
df.head()

Python executable: /Users/philg/sales-data-analysis/superstore-sales-analysis/.venv/bin/python


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [2]:
import pandas as pd

df = pd.read_csv("../data/raw/superstore.csv")

df.shape
df.head()
df.info()
df.describe()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

Unnamed: 0,Row ID,Postal Code,Sales
count,9800.0,9789.0,9800.0
mean,4900.5,55273.322403,230.769059
std,2829.160653,32041.223413,626.651875
min,1.0,1040.0,0.444
25%,2450.75,23223.0,17.248
50%,4900.5,58103.0,54.49
75%,7350.25,90008.0,210.605
max,9800.0,99301.0,22638.48


## Initial Observations

- The dataset contains 9,800 order records.
- Sales values are highly skewed, with a small number of very large orders.
- Median sales per order are relatively low compared to the mean, indicating outliers.
- A small number of records have missing postal codes, which may affect geographic analysis.


In [3]:
df[['Order Date', 'Ship Date']].head()
df[['Order Date', 'Ship Date']].dtypes


Order Date    object
Ship Date     object
dtype: object

In [4]:
# Convert date columns to datetime (coerce invalids).
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce', infer_datetime_format=True)
df['Ship Date'] = pd.to_datetime(df['Ship Date'], errors='coerce', infer_datetime_format=True)
# Show parsed dtypes (detailed KPI derivation follows in later cells)
df[['Order Date','Ship Date']].dtypes


  df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce', infer_datetime_format=True)
  df['Ship Date'] = pd.to_datetime(df['Ship Date'], errors='coerce', infer_datetime_format=True)


Order Date    datetime64[ns]
Ship Date     datetime64[ns]
dtype: object

In [5]:
df[['Order Date', 'Ship Date']].dtypes


Order Date    datetime64[ns]
Ship Date     datetime64[ns]
dtype: object

In [6]:
# Confirm the date columns types after coercion (no strict re-parsing here)
df[['Order Date', 'Ship Date']].dtypes


Order Date    datetime64[ns]
Ship Date     datetime64[ns]
dtype: object

## Shipping KPI derivation and analysis

This section derives `Shipping Delay (Days)` as a core operational KPI, handles mixed/invalid dates, and computes descriptive and comparative statistics by `Ship Mode`, `Category`, and `Region`.

In [7]:
# Robust date parsing and KPI derivation
import pandas as pd
from dateutil import parser

def safe_parse_dates(series):
    # First try pandas fast path, then fallback to dateutil for remaining values
    s = pd.to_datetime(series, errors='coerce', infer_datetime_format=True)
    mask = s.isna()
    if mask.any():
        def _try_parse(x):
            try:
                return parser.parse(str(x))
            except Exception:
                return pd.NaT
        s.loc[mask] = series[mask].apply(_try_parse)
    return pd.to_datetime(s, errors='coerce')

# Apply safe parsing to the order and ship date columns (handles mixed formats)
df['Order Date Parsed'] = safe_parse_dates(df['Order Date'])
df['Ship Date Parsed'] = safe_parse_dates(df['Ship Date'])

# Create flags for parsing issues
df['Order Date Invalid'] = df['Order Date Parsed'].isna()
df['Ship Date Invalid'] = df['Ship Date Parsed'].isna()

# Derived KPI: Shipping Delay in days (can be negative if data issue)
df['Shipping Delay (Days)'] = (df['Ship Date Parsed'] - df['Order Date Parsed']).dt.days

# Basic quality summary
quality = {
    'total_records': len(df),
    'order_date_invalid': int(df['Order Date Invalid'].sum()),
    'ship_date_invalid': int(df['Ship Date Invalid'].sum()),
    'both_dates_valid': int((~df['Order Date Invalid'] & ~df['Ship Date Invalid']).sum())
}
quality

  s = pd.to_datetime(series, errors='coerce', infer_datetime_format=True)


  s = pd.to_datetime(series, errors='coerce', infer_datetime_format=True)


{'total_records': 9800,
 'order_date_invalid': 5841,
 'ship_date_invalid': 5985,
 'both_dates_valid': 2676}

In [8]:
# Descriptive statistics for Shipping Delay
delay = df['Shipping Delay (Days)']
desc = delay.describe(percentiles=[0.25,0.5,0.75]).to_dict()
iqr = desc.get('75%') - desc.get('25%') if ('75%' in desc and '25%' in desc) else None
overall_stats = {
    'count': int(desc.get('count', 0)),
    'mean': float(desc.get('mean', float('nan'))),
    'median': float(desc.get('50%', float('nan'))),
    'std': float(desc.get('std', float('nan'))),
    'min': float(desc.get('min', float('nan'))),
    'max': float(desc.get('max', float('nan'))),
    'IQR_days': float(iqr) if iqr is not None else None
}
overall_stats
,

''

## Generated Visuals

Below are the charts generated by the shipping analysis script.

![Histogram of Shipping Delay](../visuals/shipping_delay_hist.png)

![Boxplot: Shipping Delay by Ship Mode](../visuals/shipping_delay_boxplot.png)

![Sales vs Shipping Delay](../visuals/sales_vs_delay.png)

In [9]:
# Print concise summary generated by the analysis script
print(open('../visuals/summary.txt').read())

Concise Shipping Analysis Summary
--------------------------------
Total records: 9800
Order date invalid: 0
Ship date invalid: 0

Overall shipping delay (days):
- count: 9800
- mean: 9.22265306122449
- median: 4.0
- std: 95.4475432406012
- min: -321.0
- max: 214.0
- iqr_days: 59.0

Ship Mode summary (median, IQR, neg_rate, long_rate_gt7d):
- Same Day: median=0.0, IQR=0.0, neg_rate=0.004, long_rate_gt7d=0.030
- First Class: median=3.0, IQR=57.0, neg_rate=0.105, long_rate_gt7d=0.360
- Second Class: median=4.0, IQR=59.0, neg_rate=0.160, long_rate_gt7d=0.309
- Standard Class: median=5.0, IQR=91.0, neg_rate=0.208, long_rate_gt7d=0.308

Suspicious records flagged: 1684 (threshold > 290.3 days)

Spearman correlation (Sales vs Delay):
- correlation: 0.00036495685987423474, p-value: 0.971183263499668

Recommended next steps:
- Investigate negative-delay records and fix date-entry or ETL issues
- Review high-variability ship modes and long-tail delays (>7 days)
- Consider prioritizing high-sale