# WEEK 8: ADVANCED FILTERING IN PYTHON - PART 2
## Topic: The `.query()` Method for SQL-Like Filtering
## Business Case: Multi-Level Customer Segmentation

---

### LEARNING OBJECTIVES:
1. Master the `.query()` method for readable filtering
2. Write SQL-like string expressions in pandas
3. Use variables in query expressions with `@` syntax
4. Combine `.query()` with other pandas methods
5. Understand when to use `.query()` vs boolean indexing
6. Handle complex multi-condition queries efficiently

### BUSINESS CONTEXT:
The `.query()` method lets you write filters using SQL-like syntax, making your code more readable and maintainable - especially for analysts who know SQL well!

### FROM SQL TO PYTHON:
```sql
WHERE customer_state = 'SP' AND order_value > 200
```
becomes
```python
df.query("customer_state == 'SP' and order_value > 200")
```

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")

## Section 1: Setup - Load Sample Data

In [None]:
# Create sample Olist e-commerce data
customers = pd.DataFrame({
    'customer_id': ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10'],
    'customer_unique_id': ['cu1', 'cu2', 'cu3', 'cu4', 'cu5', 'cu6', 'cu7', 'cu8', 'cu9', 'cu10'],
    'customer_state': ['SP', 'RJ', 'MG', 'BA', 'SP', 'PE', 'SP', 'RJ', 'RS', 'PR'],
    'customer_city': ['Sao Paulo', 'Rio de Janeiro', 'Belo Horizonte', 'Salvador', 
                      'Campinas', 'Recife', 'Santos', 'Niteroi', 'Porto Alegre', 'Curitiba']
})

orders = pd.DataFrame({
    'order_id': ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7', 'o8', 'o9', 'o10'],
    'customer_id': ['c1', 'c2', 'c3', 'c4', 'c5', 'c1', 'c6', 'c2', 'c7', 'c8'],
    'order_status': ['delivered', 'delivered', 'canceled', 'delivered', 'delivered', 
                     'delivered', 'delivered', 'shipped', 'delivered', 'delivered'],
    'order_purchase_timestamp': pd.to_datetime([
        '2018-01-15', '2018-02-20', '2018-03-10', '2018-04-05', '2017-12-01',
        '2017-10-15', '2018-05-20', '2018-06-15', '2018-07-10', '2018-08-05'
    ])
})

order_items = pd.DataFrame({
    'order_id': ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7', 'o8', 'o9', 'o10'],
    'product_id': ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10'],
    'price': [150.50, 220.00, 89.90, 350.00, 125.00, 420.00, 95.50, 180.00, 275.00, 310.00],
    'freight_value': [20.50, 35.00, 15.00, 45.00, 18.00, 55.00, 12.00, 25.00, 32.00, 40.00]
})

reviews = pd.DataFrame({
    'review_id': ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8'],
    'order_id': ['o1', 'o2', 'o4', 'o5', 'o6', 'o7', 'o9', 'o10'],
    'review_score': [5, 2, 4, 3, 1, 5, 4, 2],
    'review_comment_message': ['Great!', 'Terrible', None, 'OK', 'Bad product', None, 'Good', 'Poor quality']
})

# Create comprehensive dataset
customer_data = orders.merge(customers, on='customer_id') \
    .merge(order_items, on='order_id') \
    .merge(reviews, on='order_id', how='left')

customer_data['order_value'] = customer_data['price'] + customer_data['freight_value']
customer_data['order_month'] = customer_data['order_purchase_timestamp'].dt.to_period('M')

print("Data loaded successfully!")
print(f"Total records: {len(customer_data)}")
print("\nDataset preview:")
print(customer_data.head())

## Section 2: Introduction to `.query()` Method

### Basic Syntax:
```python
df.query("column_name operator value")
```

### Key Differences from Boolean Indexing:
- Use `and` instead of `&`
- Use `or` instead of `|`
- Use `not` instead of `~`
- Use `==` for equality (same as boolean indexing)
- No need for parentheses around each condition!

In [None]:
# COMPARISON: Boolean Indexing vs .query()

# Boolean indexing (Part 1 approach)
sp_delivered_bool = customer_data[
    (customer_data['customer_state'] == 'SP') & 
    (customer_data['order_status'] == 'delivered')
]

# Using .query() - more readable!
sp_delivered_query = customer_data.query(
    "customer_state == 'SP' and order_status == 'delivered'"
)

print("Boolean Indexing Result:")
print(sp_delivered_bool[['order_id', 'customer_state', 'order_status']].head())
print(f"\nRows: {len(sp_delivered_bool)}")

print("\n" + "="*60 + "\n")

print(".query() Method Result:")
print(sp_delivered_query[['order_id', 'customer_state', 'order_status']].head())
print(f"\nRows: {len(sp_delivered_query)}")

print("\n✅ Both methods produce identical results!")

## Section 3: Basic `.query()` Operations

### Comparison Operators:
- `==` equal
- `!=` not equal
- `>`, `>=`, `<`, `<=` comparisons
- `in` for membership (like SQL IN)
- `not in` for exclusion (like SQL NOT IN)

In [None]:
# Simple equality
high_value_orders = customer_data.query("order_value > 200")
print("High-value orders (>R$ 200):")
print(high_value_orders[['order_id', 'order_value', 'customer_state']])
print(f"\nTotal: {len(high_value_orders)} orders")

In [None]:
# Using 'in' operator (equivalent to .isin())
southeast_customers = customer_data.query(
    "customer_state in ['SP', 'RJ', 'MG']"
)

print("Customers from Southeast region:")
print(southeast_customers[['customer_id', 'customer_state', 'customer_city']])
print(f"\nTotal: {len(southeast_customers)} records")

In [None]:
# Using 'not in' operator
non_southeast = customer_data.query(
    "customer_state not in ['SP', 'RJ', 'MG', 'ES']"
)

print("Customers NOT from Southeast:")
print(non_southeast[['customer_id', 'customer_state', 'order_value']])
print(f"\nTotal: {len(non_southeast)} records")

## Section 4: Complex Multi-Condition Queries

### Logical Operators:
- `and` - both conditions must be true
- `or` - at least one condition must be true
- `not` - negates a condition

### Pro Tip: Use parentheses for complex logic!

In [None]:
# Multiple AND conditions
vip_orders = customer_data.query(
    "order_value > 200 and order_status == 'delivered' and customer_state == 'SP'"
)

print("VIP orders from São Paulo:")
print(vip_orders[['order_id', 'customer_state', 'order_value', 'order_status']])
print(f"\nTotal value: R$ {vip_orders['order_value'].sum():.2f}")

In [None]:
# Combining AND and OR with parentheses
at_risk_orders = customer_data.query(
    "order_status == 'delivered' and "
    "(review_score <= 2 or review_score.isna()) and "
    "order_value > 150"
)

print("At-risk high-value orders:")
print(at_risk_orders[['order_id', 'order_value', 'review_score', 'customer_state']])
print(f"\nTotal at-risk revenue: R$ {at_risk_orders['order_value'].sum():.2f}")
print(f"Average order value: R$ {at_risk_orders['order_value'].mean():.2f}")

## Section 5: Using Variables in Queries with `@` Symbol

You can reference Python variables in query strings using `@variable_name`

In [None]:
# Define threshold variables
min_value = 200
target_states = ['SP', 'RJ']
target_status = 'delivered'

# Use variables in query with @ symbol
filtered_orders = customer_data.query(
    "order_value > @min_value and "
    "customer_state in @target_states and "
    "order_status == @target_status"
)

print(f"Orders with value > R$ {min_value} from states {target_states}:")
print(filtered_orders[['order_id', 'customer_state', 'order_value']])
print(f"\nTotal: {len(filtered_orders)} orders")

In [None]:
# Dynamic filtering - easy to change thresholds
def get_high_value_customers(df, min_order_value, min_review_score):
    """
    Find high-value customers with good satisfaction.
    
    Parameters:
    - df: customer data DataFrame
    - min_order_value: minimum order value threshold
    - min_review_score: minimum acceptable review score
    """
    result = df.query(
        "order_value >= @min_order_value and "
        "review_score >= @min_review_score"
    )
    return result

# Test with different thresholds
satisfied_vips = get_high_value_customers(customer_data, 200, 4)
print("High-value satisfied customers:")
print(satisfied_vips[['customer_id', 'order_value', 'review_score']])
print(f"\nCount: {len(satisfied_vips)}")

## Section 6: String Operations in Queries

You can use string methods directly in `.query()`

In [None]:
# String contains (case-sensitive)
sao_paulo_customers = customer_data.query(
    "customer_city.str.contains('Sao')"
)

print("Customers from cities containing 'Sao':")
print(sao_paulo_customers[['customer_id', 'customer_city', 'customer_state']].drop_duplicates())

# Case-insensitive search
paulo_customers = customer_data.query(
    "customer_city.str.contains('paulo', case=False)"
)
print("\nCase-insensitive search for 'paulo':")
print(paulo_customers[['customer_id', 'customer_city']].drop_duplicates())

## Section 7: Chaining `.query()` with Other Methods

The power of `.query()` is that it works seamlessly in method chains!

In [None]:
# Method chaining: filter → group → aggregate
state_performance = customer_data \
    .query("order_status == 'delivered'") \
    .groupby('customer_state') \
    .agg({
        'order_id': 'count',
        'order_value': ['sum', 'mean'],
        'review_score': 'mean'
    }) \
    .round(2)

state_performance.columns = ['order_count', 'total_revenue', 'avg_order_value', 'avg_review_score']
state_performance = state_performance.sort_values('total_revenue', ascending=False)

print("State Performance Summary (Delivered Orders Only):")
print(state_performance)
print(f"\nTotal revenue: R$ {state_performance['total_revenue'].sum():.2f}")

In [None]:
# Complex chaining: filter → calculate → filter again → sort
premium_customers = customer_data \
    .query("order_status == 'delivered'") \
    .assign(
        revenue_category=lambda x: pd.cut(
            x['order_value'],
            bins=[0, 100, 200, 1000],
            labels=['Low', 'Medium', 'High']
        )
    ) \
    .query("revenue_category == 'High'") \
    .sort_values('order_value', ascending=False)

print("Premium orders (High revenue category):")
print(premium_customers[['order_id', 'customer_state', 'order_value', 'revenue_category']])
print(f"\nHigh-value order count: {len(premium_customers)}")

## Section 8: Business Case - Multi-Level Customer Segmentation

Let's build a complete customer segmentation using `.query()` for each segment.

In [None]:
# Define segmentation criteria
high_value_threshold = 200
satisfied_threshold = 4
at_risk_score = 3
target_regions = ['SP', 'RJ', 'MG']

# Segment 1: Champions (High value + High satisfaction + Target region)
champions = customer_data.query(
    "order_value >= @high_value_threshold and "
    "review_score >= @satisfied_threshold and "
    "customer_state in @target_regions and "
    "order_status == 'delivered'"
)

print("SEGMENT 1: CHAMPIONS")
print("=" * 60)
print(champions[['customer_id', 'customer_state', 'order_value', 'review_score']])
print(f"\nChampions: {len(champions)} customers")
print(f"Total revenue: R$ {champions['order_value'].sum():.2f}")
print(f"Avg order value: R$ {champions['order_value'].mean():.2f}")
print(f"Avg satisfaction: {champions['review_score'].mean():.2f}")

In [None]:
# Segment 2: At Risk (High value BUT low satisfaction)
at_risk = customer_data.query(
    "order_value >= @high_value_threshold and "
    "review_score <= @at_risk_score and "
    "review_score.notna() and "
    "order_status == 'delivered'"
)

print("\n\nSEGMENT 2: AT RISK (High Value, Low Satisfaction)")
print("=" * 60)
print(at_risk[['customer_id', 'customer_state', 'order_value', 'review_score']])
print(f"\nAt-risk customers: {len(at_risk)}")
print(f"Revenue at risk: R$ {at_risk['order_value'].sum():.2f}")
print(f"Avg satisfaction: {at_risk['review_score'].mean():.2f}")
print("\n⚠️ URGENT: These customers need immediate attention!")

In [None]:
# Segment 3: Silent Customers (No reviews submitted)
silent_customers = customer_data.query(
    "order_status == 'delivered' and "
    "review_score.isna()"
)

print("\n\nSEGMENT 3: SILENT CUSTOMERS (No Reviews)")
print("=" * 60)
print(silent_customers[['customer_id', 'customer_state', 'order_value', 'order_status']])
print(f"\nSilent customers: {len(silent_customers)}")
print(f"Total revenue: R$ {silent_customers['order_value'].sum():.2f}")
print("\n💡 Opportunity: Encourage these customers to leave reviews!")

In [None]:
# Segment 4: Potential Loyal (Medium value + Good satisfaction)
potential_loyal = customer_data.query(
    "order_value >= 150 and order_value < @high_value_threshold and "
    "review_score >= @satisfied_threshold and "
    "order_status == 'delivered'"
)

print("\n\nSEGMENT 4: POTENTIAL LOYAL")
print("=" * 60)
print(potential_loyal[['customer_id', 'customer_state', 'order_value', 'review_score']])
print(f"\nPotential loyal: {len(potential_loyal)} customers")
print(f"Total revenue: R$ {potential_loyal['order_value'].sum():.2f}")
print("\n✅ Strategy: Upsell these satisfied customers to Champion status!")

### Segment Summary Dashboard

In [None]:
# Create summary of all segments
segment_summary = pd.DataFrame({
    'Segment': ['Champions', 'At Risk', 'Silent Customers', 'Potential Loyal'],
    'Customer Count': [
        len(champions),
        len(at_risk),
        len(silent_customers),
        len(potential_loyal)
    ],
    'Total Revenue': [
        champions['order_value'].sum(),
        at_risk['order_value'].sum(),
        silent_customers['order_value'].sum(),
        potential_loyal['order_value'].sum()
    ],
    'Avg Order Value': [
        champions['order_value'].mean(),
        at_risk['order_value'].mean(),
        silent_customers['order_value'].mean(),
        potential_loyal['order_value'].mean()
    ]
}).round(2)

segment_summary['Revenue %'] = (
    segment_summary['Total Revenue'] / segment_summary['Total Revenue'].sum() * 100
).round(2)

print("\n\nCUSTOMER SEGMENTATION DASHBOARD")
print("=" * 80)
print(segment_summary.to_string(index=False))
print("\n" + "=" * 80)
print(f"\nTotal customers analyzed: {segment_summary['Customer Count'].sum()}")
print(f"Total revenue: R$ {segment_summary['Total Revenue'].sum():.2f}")

## Section 9: When to Use `.query()` vs Boolean Indexing

### Use `.query()` when:
✅ You have multiple conditions (more readable)
✅ Your team knows SQL well
✅ You want to write more maintainable code
✅ Working with many column names

### Use Boolean Indexing when:
✅ Single simple condition
✅ Need to use complex pandas operations
✅ Column names have spaces or special characters
✅ Working with indexes rather than columns

In [None]:
# Comparison: Readability test
import time

# Complex filter with boolean indexing
start = time.time()
result1 = customer_data[
    (customer_data['order_value'] > 150) &
    (customer_data['customer_state'].isin(['SP', 'RJ'])) &
    (customer_data['order_status'] == 'delivered') &
    (
        (customer_data['review_score'] >= 4) |
        (customer_data['review_score'].isna())
    )
]
time1 = time.time() - start

# Same filter with .query()
start = time.time()
result2 = customer_data.query(
    "order_value > 150 and "
    "customer_state in ['SP', 'RJ'] and "
    "order_status == 'delivered' and "
    "(review_score >= 4 or review_score.isna())"
)
time2 = time.time() - start

print("Boolean Indexing:")
print(f"Execution time: {time1*1000:.2f}ms")
print(f"Results: {len(result1)} rows")

print("\n.query() Method:")
print(f"Execution time: {time2*1000:.2f}ms")
print(f"Results: {len(result2)} rows")

print("\n📊 Readability winner: .query() (much cleaner!)")
print(f"Results match: {len(result1) == len(result2)}")

## Key Takeaways

### 1. `.query()` Syntax
- Use `and`, `or`, `not` instead of `&`, `|`, `~`
- No need for excessive parentheses
- String-based expressions are more readable

### 2. Variable References
- Use `@variable_name` to reference Python variables
- Makes queries dynamic and reusable
- Perfect for parameterized functions

### 3. String Operations
- Can use `.str.contains()`, `.str.startswith()` etc.
- Handles case sensitivity with `case=False`

### 4. Method Chaining
- `.query()` works seamlessly in pandas pipelines
- Enables clear, readable data transformation workflows

### 5. Business Applications
- Customer segmentation
- Multi-criteria filtering
- Dynamic threshold-based analysis
- Readable, maintainable analytics code

---

## Next Steps

In **Part 3**, we'll explore performance optimization techniques for filtering large datasets, including:
- Vectorized operations
- Memory-efficient filtering
- Using `eval()` for complex expressions
- Benchmarking different approaches

## Practice Exercises

Try rewriting these boolean indexing filters using `.query()`:

**Q1:** Find all orders where value is between R$ 100-250 AND customer is from SP or RJ.

**Q2:** Identify customers with review scores below 3 OR no review submitted, from any state except SP.

**Q3:** Create a dynamic function that accepts min_value and states list as parameters.

**Q4:** Build a customer segment finder that accepts multiple thresholds.

*Solutions in the solutions notebook!*