# Problem statement — 4-hour hands-on lab (single file / dataset)

> You are a data engineer working for a small retail chain. You have a CSV sales_sample.csv containing daily sales transactions from multiple stores.

import Required Modules

In [2]:
import pandas as pd

## SECTION 1 — Create DataFrames

Create a pandas DataFrame from the CSV and from Python lists/dicts.

In [3]:
dataframe = pd.read_csv("sample.csv")
print("Dataframe Shape:", dataframe.shape)
dataframe.head()

Dataframe Shape: (10, 11)


Unnamed: 0,transaction_id,store_id,store_name,date,product_id,product_name,quantity,unit_price,total,customer_id,promo_code
0,1001,1,Downtown,2025-10-01,501,Notebook,2,3.5,7.0,2001.0,
1,1002,1,Downtown,2025-10-01,502,Pen,5,0.8,4.0,2002.0,DISC10
2,1003,2,Suburb,2025-10-02,501,Notebook,1,3.5,3.5,2003.0,
3,1004,2,Suburb,2025-10-02,503,Stapler,1,5.25,5.25,,
4,1005,1,Downtown,2025-10-03,504,Folder,3,1.2,3.6,2004.0,DISC10


## SECTION 2 — Filter & Subset Data

Filter and subset rows/columns (e.g., sales for a store, date ranges, high-value transactions).

unique store names

In [4]:
store_names = list(dataframe['store_name'].unique())
print("Unique Stores:",store_names)


Unique Stores: ['Downtown', 'Suburb', 'Airport']


High value Transaction

In [8]:
print("\nHigh Value Transactions")
dataframe[dataframe['total'] > 5]


High Value Transactions


Unnamed: 0,transaction_id,store_id,store_name,date,product_id,product_name,quantity,unit_price,total,customer_id,promo_code
0,1001,1,Downtown,2025-10-01,501,Notebook,2,3.5,7.0,2001.0,
3,1004,2,Suburb,2025-10-02,503,Stapler,1,5.25,5.25,,
5,1006,3,Airport,2025-10-03,502,Pen,10,0.8,8.0,2005.0,


## SECTION 3 — Descriptive Statistics & Grouping

Compute descriptive statistics and group summaries (totals by store, average basket).

In [9]:
print("Basic Stats")
dataframe[["quantity", "total", "unit_price"]].describe()


Basic Stats


Unnamed: 0,quantity,total,unit_price
count,10.0,10.0,10.0
mean,3.1,4.355,2.2
std,2.766867,1.889511,1.603642
min,1.0,1.6,0.8
25%,1.25,3.5,0.825
50%,2.0,3.6,1.475
75%,3.75,4.9375,3.5
max,10.0,8.0,5.25


Total Revenue by Store

In [10]:
print("\nTotal Revenue by Stores")
print(dataframe.groupby("store_name")["total"].sum())


Total Revenue by Stores
store_name
Airport     11.50
Downtown    16.20
Suburb      15.85
Name: total, dtype: float64


Average basket size (mean quantity per transaction)

In [11]:
avg_quantity = float(dataframe["quantity"].mean())
print("Average quantity units per Transaction", avg_quantity)

Average quantity units per Transaction 3.1


Unique Products Available

In [12]:
product_names = list(dataframe['product_name'].unique())
product_names

['Notebook', 'Pen', 'Stapler', 'Folder', 'Marker', 'Highlighter']

In [13]:
revenue_by_product = dataframe.groupby("product_name")["total"].sum()
revenue_by_product

product_name
Folder          3.60
Highlighter     3.60
Marker          3.50
Notebook       14.00
Pen            13.60
Stapler         5.25
Name: total, dtype: float64

## SECTION 4 — Data Cleaning

Perform simple cleaning (handle missing values, fix data types, drop duplicates).

Check Missing Values

In [14]:
dataframe.isnull().sum()

transaction_id    0
store_id          0
store_name        0
date              0
product_id        0
product_name      0
quantity          0
unit_price        0
total             0
customer_id       1
promo_code        7
dtype: int64

Fill missing customer_id

In [15]:
dataframe['customer_id'] = dataframe['customer_id'].fillna(0).astype(int)

Fill missing promo_code

In [16]:
dataframe['promo_code'] = dataframe['promo_code'].fillna("NO CODE")

Convert Data Types

In [17]:
dataframe['date'] = pd.to_datetime(dataframe['date'])

In [18]:
dataframe.drop_duplicates(inplace=True)
dataframe

Unnamed: 0,transaction_id,store_id,store_name,date,product_id,product_name,quantity,unit_price,total,customer_id,promo_code
0,1001,1,Downtown,2025-10-01,501,Notebook,2,3.5,7.0,2001,NO CODE
1,1002,1,Downtown,2025-10-01,502,Pen,5,0.8,4.0,2002,DISC10
2,1003,2,Suburb,2025-10-02,501,Notebook,1,3.5,3.5,2003,NO CODE
3,1004,2,Suburb,2025-10-02,503,Stapler,1,5.25,5.25,0,NO CODE
4,1005,1,Downtown,2025-10-03,504,Folder,3,1.2,3.6,2004,DISC10
5,1006,3,Airport,2025-10-03,502,Pen,10,0.8,8.0,2005,NO CODE
6,1007,3,Airport,2025-10-04,505,Marker,2,1.75,3.5,2006,NO CODE
7,1008,1,Downtown,2025-10-04,502,Pen,2,0.8,1.6,2002,NO CODE
8,1009,2,Suburb,2025-10-05,506,Highlighter,4,0.9,3.6,2007,NO CODE
9,1010,2,Suburb,2025-10-05,501,Notebook,1,3.5,3.5,2008,DISC5


# SECTION 5 — Build a Mini ETL Pipeline + Extend with More Data

Build a mini ETL pipeline: read CSV → clean & transform → output JSON.

### Extract

In [19]:
df = pd.read_csv("sales_sample_extended.csv")

### Transfom

In [20]:
df['date'] = pd.to_datetime(df['date'])
df['promo_code'] = df['promo_code'].fillna("NO CODE")
df['customer_id'] = df['customer_id'].fillna(0).astype(int)

clean_df = df

### Load (Ouput JSON)

In [21]:
clean_df.to_json("clean_sales_extended.json", orient="records", indent=4)