# EDA

This notebook cleans the Chocolate Sales dataset and creates 1â€“2 static visuals to support our dashboard user story (sales trends over time by country).

## Why we do these checks

Before making plots, we do quick data checks that keep the dashboard stable: we confirm data types, check missing values and duplicates, and make sure key fields like Date and Amount are usable. We also create simple Year/Month fields because most dashboard views (monthly trends and filtering) depend on them.

In [79]:
import pandas as pd

DATA_PATH = "../data/raw/chocolate-sales.csv"
df_raw = pd.read_csv(DATA_PATH)
df_raw.shape

(3282, 6)

In [80]:
df_raw.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,"$5,320.00",180
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,"$7,896.00",94
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,"$4,501.00",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,"$12,726.00",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,"$13,685.00",184


In [81]:
df_raw.info()
df_raw.isna().sum()
df_raw.duplicated().sum()

<class 'pandas.DataFrame'>
RangeIndex: 3282 entries, 0 to 3281
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Sales Person   3282 non-null   str  
 1   Country        3282 non-null   str  
 2   Product        3282 non-null   str  
 3   Date           3282 non-null   str  
 4   Amount         3282 non-null   str  
 5   Boxes Shipped  3282 non-null   int64
dtypes: int64(1), str(5)
memory usage: 154.0 KB


np.int64(0)

In [82]:
df = df_raw.copy()

In [83]:
# Parse Date so we can do monthly trends
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y", errors="coerce")
df["Date"].isna().sum()

np.int64(0)

In [84]:
# Convert Amount to numeric for sums/plots
df["Amount"] = (
    df["Amount"]
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(float)
)
df[["Amount"]].head()

Unnamed: 0,Amount
0,5320.0
1,7896.0
2,4501.0
3,12726.0
4,13685.0


In [85]:
# Add Year/YearMonth for filters and time-series grouping
df["Year"] = df["Date"].dt.year
df["YearMonth_period"] = df["Date"].dt.to_period("M")        # true time type
df["YearMonth"] = df["YearMonth_period"].astype(str)         # label like 2022-01
df["MonthName"] = df["Date"].dt.strftime("%b")               # "Jan", "Feb", ...

df[["Date", "Year", "YearMonth", "MonthName"]].head()

Unnamed: 0,Date,Year,YearMonth,MonthName
0,2022-01-04,2022,2022-01,Jan
1,2022-08-01,2022,2022-08,Aug
2,2022-07-07,2022,2022-07,Jul
3,2022-04-27,2022,2022-04,Apr
4,2022-02-24,2022,2022-02,Feb


In [86]:
from pathlib import Path

out_dir = Path("..") / "data" / "processed"
out_dir.mkdir(parents=True, exist_ok=True)

out_path = out_dir / "chocolate_sales_clean.csv"
df.to_csv(out_path, index=False)

out_path

PosixPath('../data/processed/chocolate_sales_clean.csv')

## EDA for User Story 1: Sales trends over time by country

In [87]:
# Compare sales trends over time by country (Amount)
monthly_country = (
    df.groupby(["YearMonth_period", "Country"], as_index=False)
      .agg(total_amount=("Amount", "sum"))
      .sort_values(["YearMonth_period", "Country"])
)

monthly_country["YearMonth"] = monthly_country["YearMonth_period"].astype(str)
monthly_country.head()

Unnamed: 0,YearMonth_period,Country,total_amount,YearMonth
0,2022-01,Australia,187383.0,2022-01
1,2022-01,Canada,143997.0,2022-01
2,2022-01,India,143430.0,2022-01
3,2022-01,New Zealand,124488.0,2022-01
4,2022-01,UK,188531.0,2022-01


In [88]:
# Overall total sales by country (summary table)
country_totals = (
    df.groupby("Country", as_index=False)
      .agg(total_amount=("Amount", "sum"))
      .sort_values("total_amount", ascending=False)
)

country_totals

Unnamed: 0,Country,total_amount
0,Australia,3646444.35
4,UK,3365388.9
2,India,3343730.83
5,USA,3313858.09
1,Canada,3078495.65
3,New Zealand,3043654.04


### Visual 1: Quarterly sales trend by country

In [89]:
import altair as alt

# Aggregate to quarterly totals (less noisy than monthly, easier to compare trends)
df["Quarter_start"] = df["Date"].dt.to_period("Q").dt.start_time
df["Quarter_label"] = df["Date"].dt.to_period("Q").astype(str)  # e.g., 2022Q1

quarterly = (
    df.groupby(["Quarter_start", "Quarter_label", "Country"], as_index=False)
      .agg(total_amount=("Amount", "sum"))
      .sort_values(["Quarter_start", "Country"])
)

# Visual 1: Quarterly sales trend by country
chart_q = (
    alt.Chart(quarterly)
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "Quarter_label:N",
            sort=alt.SortField(field="Quarter_start", order="ascending"),
            axis=alt.Axis(title="Quarter", labelAngle=-45)
        ),
        y=alt.Y(
            "total_amount:Q",
            axis=alt.Axis(title="Total sales ($)", format=",.0f")
        ),
        color=alt.Color("Country:N", legend=alt.Legend(title="Country"))
    )
    .properties(title="Quarterly sales trend (Amount) by country", width=800, height=350)
)

chart_q

We aggregated sales to the quarterly level to reduce month-to-month noise and make trends easier to compare across countries. This chart helps a sales manager quickly see which countries are generally rising or falling over time, which supports decisions about which markets to focus on.