<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/wk4_data_detective.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4: Data Wrangling in Python 🛠️

## Today's Mission:
- Learn to manipulate, summarize, and join data
- Master data wrangling skills with Complete Journey data
- Solve real business questions through code-with-me challenges

**Follow along with the slides and complete the challenges below!**

## Getting Started: Load Complete Journey Data

Let's start by loading the Complete Journey datasets that we'll use throughout today's lesson!

In [None]:
# You may need to install the package first
# !pip install completejourney-py

import pandas as pd
from completejourney_py import get_data

# Load all Complete Journey datasets
cj_data = get_data()
print("Available datasets:")
cj_data.keys()

## Explore the Data Relationships

Let's take a quick look at each dataset to understand what we're working with:

In [None]:
# Quick overview of each dataset
for name, df in cj_data.items():
    print(f"\n{name.upper()}:")
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")

In [None]:
# take a look at the transactions data
transactions = cj_data['transactions']
transactions.head()

## Brainstorm Business Questions

Before we dive into data wrangling, let's think about what kinds of questions a grocery retailer might want to answer:

**Example questions:**
- What income level is buying the most?
- Do families with kids spend more than families without kids?
- Which department and product is the most commonly purchased?
- Which coupon was used the most?

**Your group's questions:**

1. question 1 here
2. question 2 here
3. question 3 here

---

# Part 1: Manipulating Data 🔧

Let's start with some examples using the Ames housing data to learn the techniques:

## Loading Example Data: Ames Housing

We'll use this messy dataset to learn data manipulation techniques:

In [None]:
# Load the Ames housing data
ames = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_raw.csv")
ames.head()

## Renaming Columns

Notice the inconsistent column names? Let's clean them up:

In [None]:
# Method 1: Rename specific columns
ames.rename(columns={
    'MS SubClass': 'ms_subclass',
    'MS Zoning': 'ms_zoning'
    }, inplace=True)

ames.head()

In [None]:
# Method 2: Clean ALL column names at once
ames.columns = (
  ames.columns.
  str.lower()               # convert to lowercase
  .str.replace(' ', '_')    # replace spaces with underscores
  .str.replace('-', '_')    # replace hyphens with underscores
  .str.strip()              # strip out extra leading/ending spaces
)

ames.head()

## Dropping Columns

Sometimes we want to remove columns we don't need:

In [None]:
# Drop columns we don't need
cols_to_drop = ['order', 'pid', 'ms_subclass']
ames.drop(columns=cols_to_drop, inplace=True)
ames.head()

## Adding New Columns

Create new metrics from existing data:

In [None]:
# Create a price per square foot column
ames['price_per_sqft'] = ames['saleprice'] / ames['gr_liv_area']
ames.head()

In [None]:
# Transform month numbers to month names
months = {
    1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 
    7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'
}

ames['mo_sold'] = ames['mo_sold'].map(months)
ames.head()

## Handling Missing Values

Check for and handle missing data:

In [None]:
# Check for missing values
ames.isnull().sum().sort_values(ascending=False).head(10)

In [None]:
# Fill missing pool quality values
ames['pool_qc'] = ames['pool_qc'].fillna('no pool')
ames['pool_qc'].value_counts()

## 🧑‍💻 Code With Me: Clean Product Names

**Business Question:** Our marketing team wants cleaner product category names for their dashboard. Can we standardize the Complete Journey product categories?

Let's work together to solve this!

In [None]:
# Load the products data
products = cj_data["products"]
products['product_category'].value_counts()

**Your turn:** Help me clean these category names by:
1. Converting to lowercase
2. Replacing spaces/hyphens with underscores  
3. Creating a new column called `clean_category`

In [None]:
# Fill in the blanks together!
products['clean_category'] = (
    products['product_category']
    .str.______()                    # convert to lowercase
    .str.replace(' ', '_')           # replace spaces
    .str.replace('__', '__')         # replace hyphens (fill in the blanks)
)

products[['product_category', 'clean_category']].head()

## 🧑‍💻 Code With Me: Create Business Metrics

**Business Question:** Our analytics team wants to calculate unit prices to identify premium vs budget products.

In [None]:
# Let's look at our transaction data
transactions = cj_data["transactions"]
transactions[['sales_value', 'quantity']].head()

**Your turn:** Help me create a `unit_price` column:

In [None]:
# Fill in the blanks together!
transactions['unit_price'] = transactions['_______'] / transactions['_______']

# Let's see which products have the highest unit prices
transactions[['product_id', 'sales_value', 'quantity', 'unit_price']].head()

---

# Part 2: Summarizing Data 📊

Now let's learn how to aggregate and summarize our data:

## Simple Aggregations

Computing summary statistics for individual columns:

In [None]:
# Basic statistics on the Ames data
avg_price = ames['saleprice'].mean()
print(f"Avg Sale Price: ${avg_price:,.2f}")

min_ppsqft = ames['price_per_sqft'].min()
max_ppsqft = ames['price_per_sqft'].max()
print(f"Min & Max Price per Sqft: ${min_ppsqft:,.2f} - ${max_ppsqft:,.2f}")

In [None]:
# Summary stats for multiple columns
cols = ['gr_liv_area', 'saleprice', 'price_per_sqft']
ames[cols].mean()

## Multiple Aggregations with .agg()

When you need different statistics for different columns:

In [None]:
ames.agg({
    'saleprice': ['mean', 'median'],
    'price_per_sqft': ['mean', 'min', 'max']
})

## Group-level Aggregations

The real power comes from grouping and summarizing:

In [None]:
# Average sale price by neighborhood
(
  ames.
  groupby('neighborhood', as_index=False).
  agg({'saleprice': ['mean', 'median']})
)

In [None]:
# Group by multiple variables
(
  ames.
  groupby(['neighborhood', 'mo_sold'], as_index=False).
  agg({'saleprice': 'mean'})
)

## 🧑‍💻 Code With Me: Top Revenue Products

**Business Question:** Which products generate the most revenue? Our merchandising team needs this for inventory planning.

In [None]:
# Let's look at our transaction data
transactions = cj_data["transactions"]
transactions[['product_id', 'sales_value', 'quantity']].head()

**Your turn:** Help me find the top revenue-generating products:

In [None]:
# Fill in the blanks together!
product_revenue = (
    transactions
    .groupby('_______', as_index=False)
    .agg({'sales_value': '_______'})
    .sort_values('sales_value', ascending=False)
)

product_revenue.head(10)

## 🧑‍💻 Code With Me: Store Performance

**Business Question:** Which stores are performing best? Our operations team wants to understand store-level performance.

In [None]:
# Look at store information in our transactions
transactions['store_id'].value_counts().head()

**Your turn:** Help me compare total sales and transaction counts by store:

In [None]:
# Fill in the blanks together!
store_performance = (
    transactions
    .groupby('_______', as_index=False)
    .agg({
        'sales_value': ['_______', '_______'],  # sum, mean
        'basket_id': '_______'                  # count (for # of transactions)
    })
)

store_performance.head()

---

# Part 3: Joining Data 🔗

Most business questions require combining multiple datasets:

## Understanding Data Relationships

In the Complete Journey data:
- `household_id` connects **transactions** with **demographics**
- `product_id` connects **transactions** with **products**
- `coupon_upc` connects **coupons** with **coupon_redemptions**

## Basic Merge Example

Let's see how to join transactions with products:

In [None]:
# Example: What is the total sales value for the top 10 selling products?
transactions = cj_data["transactions"]
products = cj_data["products"]

(
    transactions
    .merge(products, how='inner', on='product_id')
    .groupby(['product_id', 'product_category'], as_index=False)
    .agg({'sales_value': 'sum'})
    .nlargest(10, 'sales_value')
)

## 🧑‍💻 Code With Me: Customer Demographics Analysis

**Business Question:** Do families with kids spend more than families without kids? Our marketing team wants to target family-friendly promotions.

In [None]:
# Let's explore what we have
transactions = cj_data["transactions"]
demographics = cj_data["demographics"] 

demographics[['household_id', 'kids_count']].head()

**Your turn:** Help me join transactions with demographics:

In [None]:
# Fill in the blanks together!
family_data = (
    transactions
    .merge(demographics, on='_______', how='_______')
)

family_data[['household_id', 'sales_value', 'kids_count']].head()

## 🧑‍💻 Code With Me: Family Spending Analysis

**Continuing our analysis:** Now let's compare spending between families with and without kids.

In [None]:
# First, let's create a family type column
family_data['family_type'] = family_data['kids_count'].apply(
    lambda x: 'Has Kids' if x > 0 else 'No Kids'
)

family_data['family_type'].value_counts()

**Your turn:** Help me compare average spending by family type:

In [None]:
# Fill in the blanks together!
family_spending = (
    family_data
    .groupby('_______', as_index=False)
    .agg({'sales_value': ['_______', '_______', 'count']})  # mean, sum
)

family_spending

**Discussion:** What does this tell us about family spending patterns?

---

# Practice Section: Your Turn! 🎯

Now try some challenges on your own using the Complete Journey data:

## Challenge 1: Data Manipulation

**Task:** Clean up the demographics data by:
1. Creating a new column `income_level` that categorizes income ranges as "Low", "Medium", or "High"
2. Creating a new column `has_kids` that is True/False based on kids_count

In [None]:
# Your code here:



## Challenge 2: Data Summarization

**Task:** Find the top 5 product categories by total sales value

In [None]:
# Your code here:



## Challenge 3: Data Joining

**Task:** Answer one of your group's brainstormed questions by joining appropriate datasets and summarizing the results

In [None]:
# Your code here:



---

# 🧾 Quick Reference

| Task | Syntax Example |
|------|----------------|
| Rename columns | `df.rename(columns={"old":"new"}, inplace=True)` |
| Create a new column | `df["unit_price"] = df["sales_value"] / df["quantity"]` |
| Drop column(s) | `df.drop(columns=["col1","col2"], inplace=True)` |
| Fill missing values | `df["col"].fillna(0, inplace=True)` |
| Group and single aggregation | `df.groupby("dept")["sales_value"].sum()` |
| Group with multiple aggregations | `df.groupby("dept").agg({"sales_value":["sum","mean"], "quantity":"sum"})` |
| Sort results | `df.sort_values(["sales_value"], ascending=False)` |
| Most frequent items | `df["product_id"].value_counts()` |
| Join two tables (inner) | `pd.merge(left_df, right_df, on="key", how="left")` |

## Key Takeaways

- **Clean data first:** Standardized columns and clear keys prevent surprises
- **groupby + agg:** This combination unlocks most business questions
- **Joins are essential:** Most real insights require combining multiple datasets
- **Build systematically:** manipulate → summarize → join → summarize → interpret

**You're now ready for Thursday's lab! 🚀**