# Week 4 Lab – Data Wrangling - From Business Question to Analysis

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/04_wk4_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In this lab, we’ll use **three datasets** from the Complete Journey retail grocery data:

1. **transactions** – product purchases by households (receipt-level detail)  
2. **demographics** – household-level demographic data  
3. **products** – metadata about products purchased  

This lab reinforces this week’s readings:

- **[Reading 10: Manipulating Data](https://bradleyboehmke.github.io/uc-bana-4080/10-manipulating-data.html)**
- **[Reading 11: Summarizing Data](https://bradleyboehmke.github.io/uc-bana-4080/11_aggregating_data.html)**
- **[Reading 12: Joining Data](https://bradleyboehmke.github.io/uc-bana-4080/12-joining-data.html)**

We will:
- Start with simple data exploration
- Progress to manipulating and summarizing data
- End with joining datasets to answer more complex questions
- Practice breaking business questions into **analytical steps**

You are encouraged to work in small groups of **2–4 students**. This week’s **homework** is based on this lab. You may not finish everything today; save your work as you go.


## Setup

In [None]:
# If you don't have completejourney_py installed, run: pip install completejourney-py
from completejourney_py import get_data
import pandas as pd

# Load datasets
cj_data = get_data()
transactions = cj_data['transactions']
products = cj_data['products']
demographics = cj_data['demographics']

# Quick preview
transactions.head()


## Part 1 – Basic Exploration


**Q0:** How many transactions are in our dataset, what is the date range, how many households have demographic data, how many products exist, and what are the min/max/mean sales values?  

**Step-by-step instructions:**
1. Use `.shape[0]` on `transactions` to count rows.  
2. Use `.min()` and `.max()` on `transaction_timestamp` to find the date range.  
3. Use `.shape[0]` on `demographics` and `products` to get counts.  
4. Use `.min()`, `.max()`, `.mean()` on `sales_value` for basic stats.


In [None]:
# Starter code with blanks to fill
# total number of transactions
num_transactions = transactions._____

In [None]:
# date range of transactions
min_date = transactions['transaction_timestamp'].__()
max_date = transactions['transaction_timestamp'].__()

In [None]:
# number of unique households and products
num_households = demographics.shape[__]
num_products = products.shape[__]

In [None]:
# summary statistics for sales_value
min_sales = transactions['sales_value'].__()
max_sales = transactions['sales_value'].__()
mean_sales = transactions['sales_value'].__()


**Q1:** Which day had the highest total sales?  

**Step-by-step instructions:**
1. Create a new column `date` by extracting only the date from `transaction_timestamp` (`.dt.date`).  
2. Group by `date` and sum `sales_value`.  
3. Sort results in descending order.  
4. Select the top row.


In [None]:
# Your code here



**Q2:** What are the top 5 departments by total sales?  

**Step-by-step instructions:**
1. Join `transactions` to `products` on `product_id` using an inner join.  
2. Group by `department` and sum `sales_value`.  
3. Sort results in descending order.  
4. Display the top 5.


In [None]:
# Your code here


## Part 2 – Manipulating Data


**Q3:** What is the average unit price for each department?  

**Step-by-step instructions:**
1. Create a `unit_price` column: `sales_value / quantity`.  
2. Join `transactions` to `products` to bring in `department`.  
3. Group by `department` and calculate the mean of `unit_price`.


In [None]:
# Your code here



**Q4:** Do we have missing values in `unit_price`?  

**Step-by-step instructions:**
1. Use `.isna().sum()` on `unit_price` to count missing values.  
2. Consider filtering rows where `quantity == 0` to see if that’s the cause.


In [None]:
# Your code here


## Part 3 – Aggregations


**Q5:** Which income level spends the most on average? 
 
*Hint:* Join transactions to demographics, group by income, calculate mean sales per household.


In [None]:
# Your code here



**Q6:** Do households with kids spend more (on average) than households without kids?  

*Hint:* Use `kid_count` to group households by creating a new column (e.g., `has_kids`) that identifies whether a household has kids (`kid_count > 0`) or not (`kid_count == 0`). Compute the average spend for those with kids and those without.


In [None]:
# Your code here



**Q7:** What are the top 5 departments by total quantity of items sold?  

*Hint:* Join to products, group by department, sum quantity, and sort.


In [None]:
# Your code here


## Part 4 – Joins for Deeper Insights


**Q8:** Which product is purchased most frequently?  

*Hint:* Group by `product_id`, sum quantity, then join to products for description.


In [None]:
# Your code here



**Q9:** Identify all products with “pizza” in `product_type` and find the one with the greatest total sales.  

*Hint:* Filter products where product_type contains "pizza" with `.str.contains("pizza", case=False, na=False)`, join to transactions, sum sales by product.


In [None]:
# Your code here



**Q10:** Which product category brings in the most revenue for the highest-income households with kids? 
 
*Hint:* Filter demographics for the highest income level & `kid_count > 0`, join to transactions and products, group by category and compute the sum of sales value.


In [None]:
# Your code here



**Q11:** Which manufacturer has the highest total sales, and which department do they primarily sell in?  

*Hint:* Join transactions to products, group by manufacturer, sum sales, find top. Then, filter products for that top manufacturer and check which department(s) they are associated with.


In [None]:
# Your code here



**Q12:** For each income level, what is the most frequently purchased product category?  

*Hint:* Join demographics → transactions → products, group by income & category, count quantity, get top per income.


In [None]:
# Your code here


## Homework Deliverable


- Implement the code to answer the above questions.  
- Once you have all your answers, go to the homework quiz on Canvas and submit your answers.  
- Save your notebook — you'll upload it on Canvas as part of the homework.
