# EDA

This notebook cleans the Chocolate Sales dataset and creates 2 static visuals to support our dashboard user story (sales trends over time by country).

## Why we do these checks

Before making plots, we do quick data checks that keep the dashboard stable: we confirm data types, check missing values and duplicates, and make sure key fields like Date and Amount are usable. We also create simple Year/Month fields because most dashboard views (monthly trends and filtering) depend on them.

In [154]:
import pandas as pd
import altair as alt

DATA_PATH = "../data/raw/chocolate-sales.csv"
df_raw = pd.read_csv(DATA_PATH)
df_raw.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,"$5,320.00",180
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,"$7,896.00",94
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,"$4,501.00",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,"$12,726.00",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,"$13,685.00",184


In [155]:
display(df_raw.shape)

(3282, 6)

In [156]:
df_raw.info()
display(df_raw.isna().sum())
display(df_raw.duplicated().sum())

<class 'pandas.DataFrame'>
RangeIndex: 3282 entries, 0 to 3281
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Sales Person   3282 non-null   str  
 1   Country        3282 non-null   str  
 2   Product        3282 non-null   str  
 3   Date           3282 non-null   str  
 4   Amount         3282 non-null   str  
 5   Boxes Shipped  3282 non-null   int64
dtypes: int64(1), str(5)
memory usage: 154.0 KB


Sales Person     0
Country          0
Product          0
Date             0
Amount           0
Boxes Shipped    0
dtype: int64

np.int64(0)

In [157]:
# how many unique values per column
display(df_raw.nunique())

# Country categories 
display(sorted(df_raw["Country"].unique()))

Sales Person       25
Country             6
Product            22
Date              504
Amount           3013
Boxes Shipped     507
dtype: int64

['Australia', 'Canada', 'India', 'New Zealand', 'UK', 'USA']

In [158]:
df = df_raw.copy()

In [159]:
# Parse Date so we can do time trends
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y", errors="coerce")
df["Date"].isna().sum()

np.int64(0)

In [160]:
# Convert Amount to numeric for sums/plots
df["Amount"] = (
    df["Amount"].astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)
df["Amount"] = pd.to_numeric(df["Amount"], errors="coerce")
df["Amount"].isna().sum()

np.int64(0)

In [161]:
# Add Year/YearMonth for filters and time-series grouping
df["Year"] = df["Date"].dt.year
df["YearMonth_period"] = df["Date"].dt.to_period("M")        # true time type
df["YearMonth"] = df["YearMonth_period"].astype(str)         # label like 2022-01
df["MonthName"] = df["Date"].dt.strftime("%b")               # "Jan", "Feb", ...

df[["Date", "Year", "YearMonth", "MonthName"]].head()

Unnamed: 0,Date,Year,YearMonth,MonthName
0,2022-01-04,2022,2022-01,Jan
1,2022-08-01,2022,2022-08,Aug
2,2022-07-07,2022,2022-07,Jul
3,2022-04-27,2022,2022-04,Apr
4,2022-02-24,2022,2022-02,Feb


In [162]:
# Check which months appear in the data
df["MonthNum"] = df["Date"].dt.month
display(sorted(df["MonthNum"].dropna().unique()))

[np.int32(1),
 np.int32(2),
 np.int32(3),
 np.int32(4),
 np.int32(5),
 np.int32(6),
 np.int32(7),
 np.int32(8)]

In [163]:
from pathlib import Path

out_dir = Path("..") / "data" / "processed"
out_dir.mkdir(parents=True, exist_ok=True)

df_export = df.copy()
df_export.columns = (
    df_export.columns.str.strip().str.lower().str.replace(r"\s+", "_", regex=True)
)

out_path = out_dir / "chocolate_sales_clean.csv"
#df_export.to_csv(out_path, index=False)

out_path

PosixPath('../data/processed/chocolate_sales_clean.csv')

## EDA for User Story: Sales trends over time by country

**User Story:** As a sales manager, I want to view sales trends over time by country so I can identify which markets are growing or declining.

### Visual 1: Quarterly sales trend by country

In [164]:
# Aggregate to quarterly totals (less noisy than monthly, easier to compare trends)
df["Quarter_start"] = df["Date"].dt.to_period("Q").dt.start_time
df["Quarter_label"] = df["Date"].dt.to_period("Q").astype(str)  # e.g., 2022Q1

quarterly = (
    df.groupby(["Quarter_start", "Quarter_label", "Country"], as_index=False)
      .agg(total_amount=("Amount", "sum"))
      .sort_values(["Quarter_start", "Country"])
)

# Quarterly sales trend by country
chart_q = (
    alt.Chart(quarterly)
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "Quarter_label:N",
            sort=alt.SortField(field="Quarter_start", order="ascending"),
            axis=alt.Axis(title="Quarter", labelAngle=-45)
        ),
        y=alt.Y(
            "total_amount:Q",
            axis=alt.Axis(title="Total sales ($)", format=",.0f")
        ),
        color=alt.Color("Country:N", legend=alt.Legend(title="Country"))
    )
    .properties(title="Quarterly sales trend (Amount) by country", width=800, height=350)
)

chart_q

We aggregated sales to the quarterly level to reduce month-to-month noise and make trends easier to compare across countries. This chart helps a sales manager quickly see which countries are generally rising or falling over time, which supports decisions about which markets to focus on.

### Visual 2: Year-over-year sales growth by country (Jan–Aug 2024 vs Jan–Aug 2023)

In [165]:
# Compare Jan–Aug 2024 vs Jan–Aug 2023 (dataset covers Jan–Aug, so we use the same window for a fair YoY comparison)
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month

yoy_totals = (
    df[df["Year"].isin([2023, 2024]) & df["Month"].between(1, 8)]
    .groupby(["Country", "Year"], as_index=False)
    .agg(total_amount=("Amount", "sum"))
)

wide = yoy_totals.pivot(index="Country", columns="Year", values="total_amount").reset_index()

# Altair requires column names to be strings (pivot creates 2023/2024 as int column names)
wide = wide.rename(columns={2023: "2023", 2024: "2024"})

wide["pct_change"] = (wide["2024"] - wide["2023"]) / wide["2023"] * 100

growth_chart = (
    alt.Chart(wide)
    .mark_bar()
    .encode(
        y=alt.Y("Country:N", sort="-x", title="Country"),
        x=alt.X("pct_change:Q", title="Percent change in sales (%) — 2024 vs 2023"),
        tooltip=[
            "Country:N",
            alt.Tooltip("2023:Q", title="2023 total", format=",.0f"),
            alt.Tooltip("2024:Q", title="2024 total", format=",.0f"),
            alt.Tooltip("pct_change:Q", title="% change", format=".2f"),
        ],
    )
    .properties(
        title="Market growth/decline by country (Jan–Aug 2024 vs Jan–Aug 2023)",
        width=800,
        height=250
    )
)

growth_chart

This bar chart ranks countries by the year-over-year percent change in total sales (Jan–Aug 2024 vs Jan–Aug 2023). It helps the sales manager quickly identify which markets are growing fastest (e.g., India) and which are growing more slowly (e.g., New Zealand), which supports prioritizing where to focus sales efforts.