# WEEK 8: ADVANCED FILTERING IN PYTHON - PART 1
## Topic: Complex Boolean Logic for Customer Retention Analysis
## Business Case: Identifying At-Risk Customers in E-Commerce

---

### LEARNING OBJECTIVES:
1. Master complex boolean filtering with multiple conditions
2. Use `&` and `|` operators (equivalent to SQL AND/OR)
3. Apply `.isin()` method (equivalent to SQL IN)
4. Use `~` operator for NOT logic (equivalent to SQL NOT)
5. Handle NaN values in complex conditions
6. Combine conditions with proper parentheses

### BUSINESS CONTEXT:
As a data analyst for Olist (Brazilian e-commerce marketplace), you've been tasked with identifying at-risk customers who may churn. Your analysis will help the marketing team design targeted retention campaigns.

### FROM EXCEL TO PYTHON:
- Excel Multiple Filters → Python boolean indexing with `&`/`|`
- Excel "Filter by List" → Python `.isin()` method
- Excel "Between" custom filter → Python comparison operators
- Excel "Blanks" filter → Python `.isna()` / `.notna()`

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Section 1: Loading Data

For this lesson, we'll work with the Olist e-commerce dataset. In a real scenario, you'd load from a database or CSV files.

In [None]:
# NOTE: Update these paths to match your local data location
# Or connect to database using SQLAlchemy

# Example: Load from CSV (adjust paths as needed)
# customers = pd.read_csv('path/to/olist_customers_dataset.csv')
# orders = pd.read_csv('path/to/olist_orders_dataset.csv')
# order_items = pd.read_csv('path/to/olist_order_items_dataset.csv')
# reviews = pd.read_csv('path/to/olist_order_reviews_dataset.csv')
# payments = pd.read_csv('path/to/olist_order_payments_dataset.csv')

# For demonstration, we'll create sample data
# In your exercises, you'll use the actual dataset

# Sample customers data
customers = pd.DataFrame({
    'customer_id': ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'],
    'customer_unique_id': ['cu1', 'cu2', 'cu3', 'cu4', 'cu5', 'cu6', 'cu7', 'cu8'],
    'customer_state': ['SP', 'RJ', 'MG', 'BA', 'SP', 'PE', 'SP', 'RJ'],
    'customer_city': ['Sao Paulo', 'Rio de Janeiro', 'Belo Horizonte', 'Salvador', 
                      'Campinas', 'Recife', 'Santos', 'Niteroi']
})

# Sample orders data
orders = pd.DataFrame({
    'order_id': ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7', 'o8'],
    'customer_id': ['c1', 'c2', 'c3', 'c4', 'c5', 'c1', 'c6', 'c2'],
    'order_status': ['delivered', 'delivered', 'canceled', 'delivered', 'delivered', 
                     'delivered', 'delivered', 'shipped'],
    'order_purchase_timestamp': pd.to_datetime([
        '2018-01-15', '2018-02-20', '2018-03-10', '2018-04-05',
        '2017-12-01', '2017-10-15', '2018-05-20', '2018-06-15'
    ])
})

# Sample order items data
order_items = pd.DataFrame({
    'order_id': ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7', 'o8'],
    'product_id': ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'],
    'price': [150.50, 220.00, 89.90, 350.00, 125.00, 420.00, 95.50, 180.00],
    'freight_value': [20.50, 35.00, 15.00, 45.00, 18.00, 55.00, 12.00, 25.00]
})

# Sample reviews data
reviews = pd.DataFrame({
    'review_id': ['r1', 'r2', 'r3', 'r4', 'r5', 'r6'],
    'order_id': ['o1', 'o2', 'o4', 'o5', 'o6', 'o7'],
    'review_score': [5, 2, 4, 3, 1, 5],
    'review_comment_message': ['Great!', 'Terrible service', None, 'OK', 'Bad product', None]
})

# Sample payments data
payments = pd.DataFrame({
    'order_id': ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7', 'o8'],
    'payment_type': ['credit_card', 'boleto', 'credit_card', 'voucher', 
                     'credit_card', 'credit_card', 'boleto', 'credit_card'],
    'payment_value': [171.00, 255.00, 104.90, 395.00, 143.00, 475.00, 107.50, 205.00],
    'payment_installments': [3, 1, 2, 1, 2, 12, 1, 4]
})

print("Data loaded successfully!")
print(f"\nCustomers: {len(customers)} rows")
print(f"Orders: {len(orders)} rows")
print(f"Order Items: {len(order_items)} rows")
print(f"Reviews: {len(reviews)} rows")
print(f"Payments: {len(payments)} rows")

## Section 2: Review - Basic Filtering

Let's start with simple filtering, similar to SQL WHERE clauses.

In [None]:
# Simple condition: Find all customers from São Paulo (SP)
sp_customers = customers[customers['customer_state'] == 'SP']
print("Customers from São Paulo:")
print(sp_customers)
print(f"\nTotal: {len(sp_customers)} customers")

In [None]:
# Combine with join: Find delivered orders from São Paulo customers
sp_delivered_orders = orders.merge(customers, on='customer_id') \
    [(orders.merge(customers, on='customer_id')['customer_state'] == 'SP') & 
     (orders.merge(customers, on='customer_id')['order_status'] == 'delivered')]

# Better approach: Create merged dataframe first
orders_with_customers = orders.merge(customers, on='customer_id')
sp_delivered = orders_with_customers[
    (orders_with_customers['customer_state'] == 'SP') & 
    (orders_with_customers['order_status'] == 'delivered')
]

print("Delivered orders from São Paulo customers:")
print(sp_delivered[['order_id', 'customer_id', 'customer_state', 'order_status']].head(10))

## Section 3: The `.isin()` Method - Equivalent to SQL IN

### BAD PRACTICE: Multiple OR conditions
### BEST PRACTICE: Using `.isin()` method

In [None]:
# BAD PRACTICE: Multiple OR conditions (verbose and error-prone)
southeast_bad = customers[
    (customers['customer_state'] == 'SP') |
    (customers['customer_state'] == 'RJ') |
    (customers['customer_state'] == 'MG')
]

print("BAD approach - Multiple OR conditions:")
print(southeast_bad)
print("\n" + "="*60 + "\n")

In [None]:
# BEST PRACTICE: Using .isin() method (clean and maintainable)
southeast_states = ['SP', 'RJ', 'MG']
southeast_good = customers[customers['customer_state'].isin(southeast_states)]

print("BEST PRACTICE - Using .isin():")
print(southeast_good)
print(f"\nTotal: {len(southeast_good)} customers from Southeast region")

### Business Example: Focus on Top Revenue States

In [None]:
# Find customers from high-value states for targeted campaigns
high_value_states = ['SP', 'RJ', 'MG', 'RS', 'PR']

high_value_customers = customers[customers['customer_state'].isin(high_value_states)]

# Count customers by state
state_counts = high_value_customers.groupby('customer_state')['customer_unique_id'].count() \
    .sort_values(ascending=False)

print("Customer distribution in high-value states:")
print(state_counts)
print(f"\nTotal high-value state customers: {len(high_value_customers)}")
print(f"Percentage of total: {len(high_value_customers)/len(customers)*100:.1f}%")

## Section 4: NOT IN - Using the `~` Operator

The `~` operator negates a boolean condition (equivalent to SQL NOT)

In [None]:
# Find customers NOT from the Southeast region
southeast_states = ['SP', 'RJ', 'MG', 'ES']
non_southeast = customers[~customers['customer_state'].isin(southeast_states)]

print("Customers NOT from Southeast region:")
print(non_southeast)
print(f"\nTotal: {len(non_southeast)} customers")

# Count by state
non_southeast_counts = non_southeast.groupby('customer_state').size() \
    .sort_values(ascending=False)
print("\nDistribution:")
print(non_southeast_counts)

### Business Example: Analyze Non-Delivered Orders (Potential Issues)

In [None]:
# Find orders that are NOT delivered or shipped (potential problems)
successful_statuses = ['delivered', 'shipped']
problem_orders = orders[~orders['order_status'].isin(successful_statuses)]

print("Problem orders (not delivered/shipped):")
print(problem_orders)

# Calculate percentages
status_counts = orders['order_status'].value_counts()
status_pct = (status_counts / len(orders) * 100).round(2)

print("\nOrder status distribution:")
status_df = pd.DataFrame({
    'Count': status_counts,
    'Percentage': status_pct
})
print(status_df)

## Section 5: Range Filtering - Equivalent to SQL BETWEEN

Python uses comparison operators: `>=`, `<=`, `>`, `<`

In [None]:
# Find order items with medium-value transactions (R$ 100 - R$ 300)
order_items['total_value'] = order_items['price'] + order_items['freight_value']

medium_value_orders = order_items[
    (order_items['total_value'] >= 100) & 
    (order_items['total_value'] <= 300)
]

print("Medium-value order items (R$ 100-300):")
print(medium_value_orders.sort_values('total_value', ascending=False))
print(f"\nTotal: {len(medium_value_orders)} order items")

### Using `.between()` Method (Alternative Approach)

In [None]:
# Alternative: Use .between() method (more readable)
medium_value_orders_v2 = order_items[
    order_items['total_value'].between(100, 300, inclusive='both')
]

print("Using .between() method:")
print(medium_value_orders_v2.sort_values('total_value', ascending=False))

# Verify both methods give same result
print(f"\nBoth methods match: {len(medium_value_orders) == len(medium_value_orders_v2)}")

### Date Range Filtering

In [None]:
# Find recent orders (Q4 2017: Oct-Dec 2017)
start_date = pd.to_datetime('2017-10-01')
end_date = pd.to_datetime('2017-12-31')

q4_2017_orders = orders[
    (orders['order_purchase_timestamp'] >= start_date) &
    (orders['order_purchase_timestamp'] <= end_date)
]

print("Orders in Q4 2017:")
print(q4_2017_orders.sort_values('order_purchase_timestamp', ascending=False))
print(f"\nTotal: {len(q4_2017_orders)} orders")

## Section 6: Complex AND/OR Logic

### CRITICAL: Always use parentheses with `&` and `|` operators!

**Operator Precedence:**
- `&` (AND) takes precedence over `|` (OR)
- Always use parentheses to control evaluation order
- Each condition MUST be in parentheses

In [None]:
# WITHOUT PARENTHESES (WRONG! - Will cause error)
# This will fail:
# wrong_filter = payments[payments['payment_type'] == 'credit_card' & payments['payment_value'] > 200]

print("⚠️ Without parentheses causes errors or unexpected results!")
print("Always wrap EACH condition in parentheses.\n")

In [None]:
# CORRECT: WITH PARENTHESES
credit_high_value = payments[
    (payments['payment_type'] == 'credit_card') & 
    ((payments['payment_value'] > 200) | (payments['payment_installments'] >= 10))
]

print("Credit card payments with (high value OR many installments):")
print(credit_high_value)
print(f"\nLogic: Credit card AND (value > 200 OR installments >= 10)")

### Business Example: At-Risk Customer Indicators

In [None]:
# Identify at-risk customers: low review scores OR no reviews AND high order value
# First, merge data
orders_with_reviews = orders.merge(reviews, on='order_id', how='left')
orders_full = orders_with_reviews.merge(order_items, on='order_id')
orders_full['order_value'] = orders_full['price'] + orders_full['freight_value']

# Define at-risk conditions
at_risk_orders = orders_full[
    (orders_full['order_status'] == 'delivered') &
    (
        (orders_full['review_score'] <= 2) |  # Low satisfaction
        (orders_full['review_score'].isna())    # No feedback (disengaged?)
    ) &
    (orders_full['order_value'] > 150)  # High-value orders only
]

print("At-risk high-value orders:")
print(at_risk_orders[['order_id', 'customer_id', 'order_value', 'review_score', 'order_status']])
print(f"\nTotal: {len(at_risk_orders)} at-risk orders")
print(f"Total value at risk: R$ {at_risk_orders['order_value'].sum():.2f}")

## Section 7: Handling NULL/NaN Values in Complex Conditions

### NULL Complications:
- NaN in comparisons returns False (not True or False)
- Always use `.isna()` or `.notna()` explicitly
- Use `.fillna()` to provide default values

In [None]:
# Check for reviews with and without comments
print("Review comment status:")
print(f"Total reviews: {len(reviews)}")
print(f"Reviews with comments: {reviews['review_comment_message'].notna().sum()}")
print(f"Reviews without comments: {reviews['review_comment_message'].isna().sum()}")

# Count by score
review_summary = reviews.groupby('review_score').agg({
    'review_id': 'count',
    'review_comment_message': lambda x: x.notna().sum()
}).rename(columns={
    'review_id': 'total_reviews',
    'review_comment_message': 'reviews_with_comments'
})
review_summary['reviews_without_comments'] = \
    review_summary['total_reviews'] - review_summary['reviews_with_comments']

print("\nReview breakdown by score:")
print(review_summary)

### Filtering with NULL Handling

In [None]:
# Find low-rated reviews WITHOUT comments (missed opportunity for feedback)
low_score_no_comment = reviews[
    (reviews['review_score'] <= 2) &
    (reviews['review_comment_message'].isna())
]

print("Low-rated reviews without comments:")
print(low_score_no_comment)
print("\n⚠️ These customers are unhappy but didn't tell us why!")

In [None]:
# Find low-rated reviews WITH comments (actionable feedback)
low_score_with_comment = reviews[
    (reviews['review_score'] <= 2) &
    (reviews['review_comment_message'].notna())
]

print("Low-rated reviews with comments (actionable):")
print(low_score_with_comment)
print("\n✅ These reviews provide specific improvement opportunities!")

## Section 8: Comprehensive Business Case - At-Risk Customer Identification

### RETENTION RISK FACTORS:
1. Low review scores (1-2 stars) = Dissatisfied
2. High-value customers = Critical to retain
3. Recent purchases = Active but potentially volatile
4. No review submitted = Disengaged

### GOAL: Create a filtered list of at-risk VIP customers

In [None]:
# Build comprehensive customer profile
# Step 1: Calculate customer metrics
customer_orders = orders.merge(customers, on='customer_id')
customer_orders = customer_orders.merge(order_items, on='order_id')
customer_orders = customer_orders.merge(reviews, on='order_id', how='left')

# Calculate lifetime value and metrics
customer_orders['order_value'] = customer_orders['price'] + customer_orders['freight_value']

customer_metrics = customer_orders[
    customer_orders['order_status'] == 'delivered'
].groupby('customer_unique_id').agg({
    'order_id': 'count',
    'order_value': 'sum',
    'review_score': ['mean', 'min'],
    'order_purchase_timestamp': 'max',
    'customer_state': 'first',
    'customer_city': 'first'
}).reset_index()

# Flatten column names
customer_metrics.columns = ['customer_unique_id', 'total_orders', 'lifetime_value', 
                            'avg_review_score', 'lowest_review_score', 
                            'last_order_date', 'customer_state', 'customer_city']

print("Customer metrics calculated:")
print(customer_metrics.head())

In [None]:
# Step 2: Calculate days since last order
reference_date = pd.to_datetime('2018-09-01')
customer_metrics['days_since_last_order'] = \
    (reference_date - customer_metrics['last_order_date']).dt.days

# Step 3: Create risk category
def categorize_risk(row):
    if row['avg_review_score'] < 3:
        return 'High Risk'
    elif row['avg_review_score'] < 4:
        return 'Medium Risk'
    else:
        return 'Low Risk'

customer_metrics['risk_category'] = customer_metrics.apply(categorize_risk, axis=1)

print("\nEnriched customer metrics:")
print(customer_metrics)

In [None]:
# Step 4: Filter at-risk VIP customers
at_risk_vips = customer_metrics[
    (customer_metrics['lifetime_value'] > 200) &  # High-value customers
    (customer_metrics['total_orders'] >= 1) &
    (
        (customer_metrics['avg_review_score'] < 4) |  # Not fully satisfied
        (customer_metrics['lowest_review_score'] <= 2) |  # Had bad experience
        (customer_metrics['days_since_last_order'].between(60, 180))  # Recent but not current
    )
].sort_values(['risk_category', 'lifetime_value'], ascending=[True, False])

print("\n" + "="*80)
print("AT-RISK VIP CUSTOMERS - RETENTION CAMPAIGN TARGETS")
print("="*80)
print(at_risk_vips)
print(f"\nTotal at-risk VIPs: {len(at_risk_vips)}")
print(f"Total revenue at risk: R$ {at_risk_vips['lifetime_value'].sum():.2f}")
print(f"Average customer value: R$ {at_risk_vips['lifetime_value'].mean():.2f}")

## Key Takeaways

### 1. `.isin()` Method
- Clean way to filter multiple values (equivalent to SQL IN)
- More readable than multiple OR conditions

### 2. `~` Operator (NOT)
- Negates boolean conditions
- Use with `.isin()` for NOT IN functionality

### 3. Range Filtering
- Use `>=` and `<=` for inclusive ranges
- Or use `.between()` method for cleaner code

### 4. Complex Boolean Logic
- **ALWAYS use parentheses** around each condition
- `&` for AND, `|` for OR
- Control evaluation order with parentheses

### 5. NULL/NaN Handling
- Use `.isna()` and `.notna()` explicitly
- Never use `== None` or `!= None`
- Consider `.fillna()` for default values

### 6. Business Value
- Layer multiple conditions to create meaningful segments
- Combine metrics (value, behavior, satisfaction) for holistic view
- Prioritize by business impact (revenue at risk, customer lifetime value)

---

## Next Steps

In **Part 2**, we'll learn about the `.query()` method - a SQL-like way to write filters in Python that's even more readable for complex conditions!

In **Part 3**, we'll cover performance optimization techniques to make these filtering operations faster on large datasets.

## Practice Exercises

Try these yourself before checking the solutions!

**Q1:** Find all payments made with 'boleto' or 'voucher' where the payment value is between R$ 50 and R$ 150.

**Q2:** Identify customers from Northeast states ('BA', 'PE', 'CE') who have placed more than one order.

**Q3:** Find reviews with scores of 1 or 2 WHERE the customer left a comment.

**Q4:** Create a list of dormant high-value customers: those who spent over R$ 300 total but haven't ordered in 200+ days.

*Solutions will be provided in the solutions notebook!*