# Week 7: Hands-On Practice - Joining Data Sources

**PORA Academy Cohort 5 - Data Analytics & AI Bootcamp**

---

## Instructions
Work through these exercises to practice pandas merge operations. Each exercise builds on the concepts from the lecture. Try to complete them without looking at the solutions first!

## Setup: Load the Data

In [None]:
import pandas as pd
import numpy as np

# Load all datasets
customers = pd.read_csv('../datasets/customers.csv')
orders = pd.read_csv('../datasets/orders.csv')
order_items = pd.read_csv('../datasets/order_items.csv')
products = pd.read_csv('../datasets/products.csv')
sellers = pd.read_csv('../datasets/sellers.csv')
reviews = pd.read_csv('../datasets/order_reviews.csv')
payments = pd.read_csv('../datasets/order_payments.csv')

# Convert date columns
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
reviews['review_creation_date'] = pd.to_datetime(reviews['review_creation_date'])

print("Data loaded successfully!")
print(f"Customers: {len(customers)} rows")
print(f"Orders: {len(orders)} rows")
print(f"Order items: {len(order_items)} rows")

## Exercise 1: Basic Inner Merge

**Task**: Merge `orders` with `customers` and show only orders from 'RJ' (Rio de Janeiro) state with customer city information.

**Expected output columns**: `order_id`, `customer_id`, `order_status`, `customer_city`, `customer_state`

**Hints**:
- Use `pd.merge()` with `how='inner'`
- Join on `customer_id`
- Filter for `customer_state == 'RJ'`

In [None]:
# Your code here


## Exercise 2: Left Merge for Data Quality

**Task**: Find all products that have NEVER been ordered. 

**Steps**:
1. Perform a left merge of `products` with `order_items` on `product_id`
2. Filter for rows where `order_id` is NaN (no orders)
3. Display `product_id`, `product_category_name`, and `product_weight_g`

**Hints**:
- Use `how='left'` to keep all products
- Use `.isna()` to find missing order_ids

In [None]:
# Your code here


## Exercise 3: Multi-DataFrame Analysis

**Task**: Calculate total revenue by seller state.

**Steps**:
1. Merge `orders` → `order_items` → `sellers`
2. Filter for delivered orders only
3. Group by `seller_state`
4. Calculate total revenue (sum of `price`)
5. Sort by revenue descending

**Expected columns in result**: `seller_state`, `total_revenue`

**Hints**:
- Chain merges: `.merge().merge()`
- Use `.groupby()` and `.agg()`

In [None]:
# Your code here


## Exercise 4: Payment Analysis

**Task**: Analyze payment methods for delivered orders.

**Steps**:
1. Merge `orders` with `payments`
2. Filter for delivered orders
3. Calculate for each payment type:
   - Number of orders
   - Total payment value
   - Average payment value
4. Sort by total payment value descending

**Expected output**: Summary by `payment_type`

In [None]:
# Your code here


## Exercise 5: Customer Satisfaction Analysis

**Task**: Find the average review score for each customer state.

**Steps**:
1. Merge `customers` → `orders` → `reviews`
2. Filter for delivered orders with reviews
3. Group by `customer_state`
4. Calculate:
   - Average review score
   - Number of reviews
5. Sort by average score descending

**Bonus**: Filter to show only states with at least 2 reviews

In [None]:
# Your code here


## Exercise 6: Product Category Performance

**Task**: Identify the best-selling product categories.

**Steps**:
1. Merge `products` → `order_items` → `orders`
2. Filter for delivered orders
3. Group by `product_category_name`
4. Calculate:
   - Total items sold (count of order_item_id)
   - Total revenue (sum of price)
   - Average item price
5. Sort by total items sold descending

In [None]:
# Your code here


## Exercise 7: Complex Join with Multiple Conditions

**Task**: Create a comprehensive order summary including customer, seller, product, payment, and review information.

**Steps**:
1. Start with `orders`
2. Merge with `customers` (inner)
3. Merge with `order_items` (inner)
4. Merge with `products` (inner)
5. Merge with `sellers` (inner)
6. Merge with `payments` (left - some orders might not have payments)
7. Merge with `reviews` (left - some orders might not have reviews)
8. Filter for delivered orders
9. Select relevant columns and display first 10 rows

**Columns to include**: 
- `order_id`
- `customer_state`
- `product_category_name`
- `seller_state`
- `price`
- `payment_type`
- `review_score`

In [None]:
# Your code here


## Exercise 8: Debugging Challenge

**Task**: The code below has errors. Find and fix them.

```python
# This code should show orders with their payment types
result = pd.merge(
    orders,
    payments,
    on='customer_id',  # Error 1: Wrong join key
    how='right'        # Error 2: Wrong join type (should be 'left' or 'inner')
)

# Filter for delivered orders
delivered = result[result['status'] == 'delivered']  # Error 3: Wrong column name

print(delivered[['order_id', 'payment_type']].head())
```

Fix the code below:

In [None]:
# Fixed code here


## Exercise 9: Advanced - Outer Merge Analysis

**Task**: Use an outer merge to find data quality issues between orders and payments.

**Steps**:
1. Perform an outer merge of `orders` and `payments` with `indicator=True`
2. Identify:
   - Orders without payments (left_only)
   - Payments without orders (right_only)
   - Orders with payments (both)
3. Display counts for each category
4. Show examples of each category

In [None]:
# Your code here


## Exercise 10: Business Intelligence Challenge

**Task**: Answer this business question: "Which seller state has the highest customer satisfaction (review score) and what's the most popular product category in that state?"

**Steps**:
1. Merge all necessary tables
2. Calculate average review score by seller_state
3. Identify the state with highest score
4. For that state, find the most popular product category

**Expected output**: 
- Best seller state
- Average review score
- Top product category in that state
- Number of orders for that category

In [None]:
# Your code here


## Bonus Exercise: Create Your Own Analysis

**Task**: Come up with your own business question and answer it using pandas merges.

Some ideas:
- Which customer state spends the most on average per order?
- What's the relationship between product weight and shipping cost (freight_value)?
- Do certain payment types correlate with higher review scores?
- What's the average delivery time by seller state?

Write your question and analysis below:

In [None]:
# Your business question: 
# 

# Your analysis code:


## Summary

Great work! You've practiced:
- Inner, left, right, and outer merges
- Chaining multiple merges
- Finding missing data with left merges
- Aggregating merged data
- Debugging merge issues
- Real-world business analysis

Remember: Tomorrow's SQL class will cover the same concepts using SQL JOIN syntax!