# WEEK 8 PYTHON EXERCISES: Advanced Filtering & Data Analysis

**Student Name:** _________________________

**Date:** _________________________

---

## INSTRUCTIONS:
1. Read each question carefully
2. Write your Python code in the provided cells
3. Test your code to ensure it runs without errors
4. Compare your approach with expected outputs
5. Ask for help if stuck for more than 15 minutes!

**DATASET:** Olist Brazilian E-Commerce Dataset

**ESTIMATED TIME:** 90 minutes

---

## Setup: Import Libraries

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Setup: Load Data

**IMPORTANT:** Update the file paths below to match your local data location. If you're using a database connection, replace the CSV loading code with SQLAlchemy connection code.

### Option 1: Load from CSV files

In [None]:
# TODO: Update these paths to your actual data location
DATA_PATH = 'path/to/olist_data/'  # Update this!

# Load datasets
customers = pd.read_csv(f'{DATA_PATH}olist_customers_dataset.csv')
orders = pd.read_csv(f'{DATA_PATH}olist_orders_dataset.csv')
order_items = pd.read_csv(f'{DATA_PATH}olist_order_items_dataset.csv')
order_reviews = pd.read_csv(f'{DATA_PATH}olist_order_reviews_dataset.csv')
order_payments = pd.read_csv(f'{DATA_PATH}olist_order_payments_dataset.csv')
products = pd.read_csv(f'{DATA_PATH}olist_products_dataset.csv')

# Convert date columns
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])

print("Data loaded successfully!")
print(f"\nCustomers: {len(customers):,} rows")
print(f"Orders: {len(orders):,} rows")
print(f"Order Items: {len(order_items):,} rows")
print(f"Reviews: {len(order_reviews):,} rows")
print(f"Payments: {len(order_payments):,} rows")
print(f"Products: {len(products):,} rows")

### Option 2: Load from Database (Alternative)

Uncomment and use this if you have database access:

In [None]:
# from sqlalchemy import create_engine

# # Create database connection
# engine = create_engine('postgresql://user:password@host:port/database')

# # Load tables
# customers = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_customers_dataset', engine)
# orders = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_orders_dataset', engine)
# order_items = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_order_items_dataset', engine)
# order_reviews = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_order_reviews_dataset', engine)
# order_payments = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_order_payments_dataset', engine)
# products = pd.read_sql('SELECT * FROM olist_sales_data_set.olist_products_dataset', engine)

# print("Data loaded from database successfully!")

---

# PART 1: ADVANCED FILTERING (25 minutes)

In this section, you'll practice complex boolean filtering with `&`, `|`, `.isin()`, and `~` operators.

## Exercise 1.1: Payment Method Analysis

**Task:** Find all orders paid with 'boleto' or 'voucher' where the payment value is between R$ 50 and R$ 150 (inclusive).

**Expected columns:** order_id, payment_type, payment_value

**Sort by:** payment_value (descending)

**Limit:** Display first 15 rows

**Hint:** Use `.isin()` for payment types and comparison operators or `.between()` for the range.

In [None]:
# YOUR CODE HERE:



## Exercise 1.2: Northeast Region Customers

**Task:** Identify customers from Northeast states ('BA', 'PE', 'CE', 'RN', 'PB') who have placed MORE than one delivered order.

**Expected columns:** customer_unique_id, customer_state, order_count

**Sort by:** order_count (descending), then customer_state

**Limit:** Display first 20 rows

**Hint:** 
1. Filter customers by state using `.isin()`
2. Merge with orders table
3. Filter for delivered orders
4. Group by customer and count orders
5. Filter for count > 1

In [None]:
# YOUR CODE HERE:



## Exercise 1.3: Negative Reviews with Comments

**Task:** Find all reviews with scores of 1 or 2 WHERE the customer left a written comment (review_comment_message is NOT NULL).

**Expected columns:** review_id, order_id, review_score, review_comment_message

**Sort by:** review_score (ascending)

**Limit:** Display first 10 rows

**Hint:** Use `.isin()` or comparison operators for scores, and `.notna()` for checking non-null comments.

In [None]:
# YOUR CODE HERE:



## Exercise 1.4: Dormant High-Value Customers (CHALLENGE!)

**Task:** Create a list of customers who:
- Have total lifetime spending over R$ 300
- Haven't ordered in 200+ days (use '2018-09-01' as reference date)
- Are from 'SP', 'RJ', or 'MG' states

**Expected columns:** customer_unique_id, customer_state, lifetime_value, last_order_date, days_since_last_order

**Sort by:** lifetime_value (descending)

**Limit:** Display first 15 rows

**Hint:** 
1. Merge customers, orders, and order_items
2. Calculate total_value = price + freight_value
3. Group by customer_unique_id to calculate lifetime_value and last_order_date
4. Calculate days_since_last_order using reference date
5. Apply all filters

In [None]:
# YOUR CODE HERE:



---

# PART 2: QUERY METHOD (25 minutes)

In this section, you'll practice using the `.query()` method, which provides SQL-like syntax for filtering DataFrames.

## Exercise 2.1: Above-Average Order Values Using .query()

**Task:** Find all order items where the total value (price + freight_value) exceeds the AVERAGE total value by more than 50%.

**Formula:** (price + freight_value) > (AVG * 1.5)

**Expected columns:** order_id, product_id, price, freight_value, total_value

**Sort by:** total_value (descending)

**Limit:** Display first 20 rows

**Hint:** 
1. Calculate total_value column first
2. Calculate the average total_value
3. Use `.query()` with the comparison (can use @ for variables)

In [None]:
# YOUR CODE HERE:



## Exercise 2.2: Top Product Categories

**Task:** Find all products that belong to the top 5 product categories by total sales count. First identify the top 5 categories, then filter products.

**Expected columns:** product_id, product_category_name

**Sort by:** product_category_name, then product_id

**Limit:** Display first 30 rows

**Hint:** 
1. Join order_items and products
2. Count sales by category
3. Get top 5 category names
4. Filter products using `.isin()` or `.query()` with the top categories list

In [None]:
# YOUR CODE HERE:



## Exercise 2.3: Silent High-Value Customers

**Task:** Identify customers who have:
- 3 or more delivered orders
- NEVER left a review (not in reviews table)
- Total spending > R$ 200

**Expected columns:** customer_unique_id, customer_state, total_orders, lifetime_value

**Sort by:** lifetime_value (descending)

**Limit:** Display first 20 rows

**Hint:** 
1. Create customer metrics (orders count, lifetime value)
2. Find customers who appear in orders but not in reviews (use `.isin()` with `~`)
3. Apply all filters

In [None]:
# YOUR CODE HERE:



## Exercise 2.4: Seasonal Shoppers

**Task:** Find customers who made purchases in Q1 2018 (Jan-Mar) but NOT in Q2 2018 (Apr-Jun).

**Expected columns:** customer_unique_id, customer_state, q1_orders

**Sort by:** q1_orders (descending)

**Limit:** Display first 15 rows

**Hint:** 
1. Filter orders for Q1 2018
2. Filter orders for Q2 2018
3. Get customer IDs from each quarter
4. Find customers in Q1 but NOT in Q2 (use `.isin()` with `~`)

In [None]:
# YOUR CODE HERE:



---

# PART 3: PERFORMANCE OPTIMIZATION (40 minutes)

In this section, you'll practice creating efficient, optimized pandas code for large-scale data analysis.

## Exercise 3.1: Product Performance Dashboard

**Task:** Create a comprehensive product analysis by building a multi-stage data pipeline:

**Stage 1 (product_sales):** Calculate total sales, revenue, and average price per product

**Stage 2 (product_reviews):** Calculate review metrics per product

**Stage 3 (product_intelligence):** Combine both and categorize products

**Final output should show products with:**
- "High Volume" (50+ sales) or "Medium Volume" (20-49 sales)
- "Low Satisfaction" (avg score < 3) or "Medium Satisfaction" (3-4)
- At least 5 reviews

**Expected columns:** product_id, product_category_name, times_sold, total_revenue, avg_review_score, total_reviews, volume_category, satisfaction_category

**Sort by:** times_sold (descending)

**Limit:** Display first 15 rows

**Hint:** Create separate DataFrames for each stage, then merge them together.

In [None]:
# YOUR CODE HERE:

# Stage 1: Product sales metrics


# Stage 2: Product review metrics


# Stage 3: Combine and categorize


# Apply filters and display results


## Exercise 3.2: Customer Lifecycle Analysis

**Task:** Build a pipeline that identifies customer lifecycle stages:

**Pipeline 1:** Get first and last order dates for each customer

**Pipeline 2:** Categorize customers as:
- "New" (only 1 order)
- "Active Repeat" (2+ orders, last order within 90 days of '2018-09-01')
- "At Risk" (2+ orders, last order 91-180 days ago)
- "Dormant" (2+ orders, last order 181+ days ago)

**Final output:** Show count and average metrics per segment

**Expected columns:** segment, customer_count, avg_orders, avg_days_inactive

**Sort by:** customer_count (descending)

**Hint:** Use `.agg()` to calculate multiple metrics at once.

In [None]:
# YOUR CODE HERE:

# Pipeline 1: Customer order dates


# Pipeline 2: Categorize customers


# Final: Aggregate by segment


## Exercise 3.3: State Performance Comparison

**Task:** Create a query that compares each state's performance to national averages:

**Pipeline 1 (state_metrics):** Calculate metrics per state (orders, revenue, customers)

**Pipeline 2 (national_averages):** Calculate overall national averages

**Pipeline 3 (state_comparison):** Show each state vs national average

**Expected columns:** customer_state, total_orders, total_revenue, national_avg_revenue, revenue_vs_national_pct

**Sort by:** revenue_vs_national_pct (descending)

**Limit:** Top 10 states

**Hint:** Calculate percentage difference as: `(state_revenue - national_avg) / national_avg * 100`

In [None]:
# YOUR CODE HERE:

# Pipeline 1: State metrics


# Pipeline 2: National averages


# Pipeline 3: Compare and calculate percentage


## Exercise 3.4: Retention Campaign Targets (ULTIMATE CHALLENGE!)

**Task:** Design a comprehensive data pipeline that creates a prioritized retention campaign target list:

**Pipeline 1:** Customer purchase behavior (orders, spend, dates)

**Pipeline 2:** Customer satisfaction metrics (reviews, scores)

**Pipeline 3:** Customer payment preferences (most used payment method)

**Pipeline 4:** Risk scoring and segmentation

**Pipeline 5:** Final prioritization with recommended actions

**Target customers who are:**
- High lifetime value (>R$ 400)
- Prefer credit card payments
- Either: At risk (90-180 days inactive) OR had a bad experience (review ≤ 2)

**Expected columns:** customer_unique_id, customer_state, lifetime_value, days_inactive, avg_review_score, preferred_payment, risk_category, priority_score, recommended_action

**Sort by:** priority_score (descending)

**Limit:** Display first 25 rows

**Hint for priority_score:** Combine factors like:
- Lifetime value weight (higher = better)
- Days inactive weight (more = worse)
- Review score weight (lower = worse)

In [None]:
# YOUR CODE HERE:

# Pipeline 1: Purchase behavior


# Pipeline 2: Satisfaction metrics


# Pipeline 3: Payment preferences


# Pipeline 4: Merge and create risk scoring


# Pipeline 5: Prioritization and recommended actions


---

# BONUS CHALLENGES (Optional)

These are advanced exercises for students who finish early or want extra practice.

## Bonus 1: Cross-Sell Opportunities

**Task:** Find products frequently bought together:
1. Identify orders with exactly 2 products
2. Find the most common product pairs
3. Show pairs where both products have good reviews (avg score ≥ 4)

**Expected columns:** product_1, product_2, times_bought_together, product_1_avg_score, product_2_avg_score

**Sort by:** times_bought_together (descending)

**Limit:** Top 20 pairs

In [None]:
# YOUR CODE HERE:



## Bonus 2: Cohort Retention Analysis

**Task:** Create a cohort analysis showing customer retention by their first purchase month. Show what percentage of each monthly cohort made repeat purchases in subsequent months.

**Expected output:** A cohort table showing:
- Cohort month (first purchase month)
- Total customers in cohort
- Customers who returned in Month 1, Month 2, Month 3, etc.
- Retention percentages

**Hint:** 
1. Extract cohort month from first purchase date
2. Calculate months since first purchase for each order
3. Create a pivot table showing retention by cohort and month

In [None]:
# YOUR CODE HERE:



---

# REFLECTION QUESTIONS

Answer these questions in the markdown cells below. Think about what you learned and how you can apply it.

### Q1: When would you use `.isin()` vs multiple OR conditions? Give a real business example.

**YOUR ANSWER:**



### Q2: What's the main advantage of the `.query()` method over traditional boolean indexing?

**YOUR ANSWER:**



### Q3: How would you optimize a pandas pipeline that's running slowly on large data?

**YOUR ANSWER:**



### Q4: Describe a business scenario from your own experience where advanced filtering and multi-stage data pipelines would be valuable.

**YOUR ANSWER:**



---

# END OF EXERCISES

**Remember to:**
- Save your work regularly
- Test all code before submission
- Compare your solutions with the provided solutions file
- Ask questions if anything is unclear!

**Great job working through these exercises! 🎉**

---