# Polars Joins - Comprehensive Guide

This notebook covers all 7 types of joins in Polars with practical examples.

## Join Types:
- **inner**: Returns rows that have matching values in both tables
- **left**: Returns all rows from left table + matched rows from right
- **right**: Returns all rows from right table + matched rows from left
- **full/outer**: Returns all rows when there is a match in either table
- **cross**: Returns the Cartesian product of rows from both tables
- **semi**: Returns rows from left that have a match in right (no right columns)
- **anti**: Returns rows from left that have NO match in right

In [None]:
import polars as pl
import numpy as np
from datetime import datetime, timedelta

## Setup: Create Sample DataFrames

We'll use a realistic e-commerce scenario with customers and orders.

In [None]:
# Customers DataFrame
customers = pl.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'signup_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']
})

# Orders DataFrame (not all customers have orders, and some orders from customers not in our list)
orders = pl.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 1, 2, 3, 6, 7],  # Note: customer_id 6 and 7 don't exist in customers
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headset'],
    'amount': [1200.00, 25.00, 75.00, 350.00, 89.99, 120.00],
    'order_date': ['2023-06-01', '2023-06-05', '2023-06-03', '2023-06-07', '2023-06-10', '2023-06-12']
})

print("CUSTOMERS:")
print(customers)
print("\nORDERS:")
print(orders)

## 1. INNER JOIN

**Use Case**: When you only want rows where there's a match in BOTH tables.

**Example**: Find all customers who have placed orders (ignore customers without orders and orders without matching customers)

In [None]:
inner_result = customers.join(orders, on='customer_id', how='inner')

print("INNER JOIN - Only customers with orders:")
print(inner_result)
print(f"\nRows: {len(inner_result)}")
print("\nNote: Diana (4) and Eve (5) are excluded (no orders)")
print("Note: Orders from customer_id 6 and 7 are excluded (customers don't exist)")

## 2. LEFT JOIN

**Use Case**: When you want ALL rows from the left table, regardless of matches.

**Example**: List all customers and their orders (if any). Customers without orders will have null values in order columns.

In [None]:
left_result = customers.join(orders, on='customer_id', how='left')

print("LEFT JOIN - All customers, with their orders if any:")
print(left_result)
print(f"\nRows: {len(left_result)}")
print("\nNote: Diana and Eve appear with null order values")
print("Note: Orders from customer_id 6 and 7 are still excluded")

### Left Join - Finding Customers Without Orders

In [None]:
# Practical use: Find customers who haven't ordered yet (for marketing campaign)
customers_no_orders = left_result.filter(pl.col('order_id').is_null())

print("Customers without orders (target for marketing):")
print(customers_no_orders.select(['customer_id', 'name', 'city']))

## 3. RIGHT JOIN

**Use Case**: When you want ALL rows from the right table, regardless of matches.

**Example**: List all orders with customer information (if available). Orders without matching customers will have null customer values.

In [None]:
right_result = customers.join(orders, on='customer_id', how='right')

print("RIGHT JOIN - All orders, with customer info if available:")
print(right_result)
print(f"\nRows: {len(right_result)}")
print("\nNote: Orders from customer_id 6 and 7 appear with null customer info")
print("Note: Diana and Eve don't appear (they have no orders)")

### Right Join - Finding Orphaned Orders

In [None]:
# Practical use: Find orders with invalid customer_ids (data quality issue)
orphaned_orders = right_result.filter(pl.col('name').is_null())

print("Orders with invalid customer_ids (data issue):")
print(orphaned_orders.select(['order_id', 'customer_id', 'product', 'amount']))

## 4. FULL/OUTER JOIN

**Use Case**: When you want ALL rows from BOTH tables, with nulls where there's no match.

**Example**: Complete view of customers and orders - see everyone and everything.

In [None]:
full_result = customers.join(orders, on='customer_id', how='full')

print("FULL/OUTER JOIN - All customers and all orders:")
print(full_result)
print(f"\nRows: {len(full_result)}")
print("\nNote: Includes customers without orders (Diana, Eve)")
print("Note: Includes orders without valid customers (customer_id 6, 7)")

### Full Join - Analysis

In [None]:
# Analyze the full join results
print("Summary of Full Join:")
print(f"Total rows: {len(full_result)}")
print(f"Customers without orders: {full_result.filter(pl.col('order_id').is_null()).height}")
print(f"Orders without valid customers: {full_result.filter(pl.col('name').is_null()).height}")
print(f"Valid customer-order pairs: {full_result.filter(pl.col('name').is_not_null() & pl.col('order_id').is_not_null()).height}")

## 5. CROSS JOIN (Cartesian Product)

**Use Case**: When you need every possible combination of rows from both tables.

**Example**: Generate all possible customer-product combinations (for recommendation system).

In [None]:
# Create a products catalog
products = pl.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1200.00, 25.00, 75.00],
    'category': ['Electronics', 'Accessories', 'Accessories']
})

# Create smaller customer list for cross join demo
customers_small = customers.head(3).select(['customer_id', 'name'])

print("Customers:")
print(customers_small)
print("\nProducts:")
print(products)

In [None]:
# Cross join: every customer with every product
cross_result = customers_small.join(products, how='cross')

print("\nCROSS JOIN - All possible customer-product combinations:")
print(cross_result)
print(f"\nRows: {len(cross_result)} = {len(customers_small)} customers × {len(products)} products")

### Cross Join - Practical Example: Product Recommendations

In [None]:
# Use cross join to create recommendation matrix
# Then filter out products customer already bought

all_combinations = customers.select(['customer_id', 'name']).join(products, how='cross')

# Get products each customer already ordered
purchased = orders.select(['customer_id', 'product']).unique()

# Recommend products NOT yet purchased
recommendations = all_combinations.join(
    purchased,
    left_on=['customer_id', 'product_name'],
    right_on=['customer_id', 'product'],
    how='anti'  # Keep only non-matches (see anti join below)
)

print("Product recommendations (products customer hasn't bought):")
print(recommendations.sort('customer_id', 'product_name'))

## 6. SEMI JOIN

**Use Case**: When you want rows from LEFT table that have a match in RIGHT table, but you DON'T need columns from the right table.

**Example**: Find customers who have placed orders (but we don't need order details).

In [None]:
semi_result = customers.join(orders, on='customer_id', how='semi')

print("SEMI JOIN - Customers who have placed orders:")
print(semi_result)
print(f"\nRows: {len(semi_result)}")
print("\nNote: Only customer columns, no order details")
print("Note: Each customer appears once (even if they have multiple orders)")

### Semi Join vs Inner Join

In [None]:
print("Comparison: Semi Join vs Inner Join")
print("\nSemi Join (customers who ordered):")
print(semi_result)
print(f"Rows: {len(semi_result)}")

print("\nInner Join (customer-order pairs):")
print(inner_result.select(customers.columns))  # Show only customer columns for comparison
print(f"Rows: {len(inner_result)}")

print("\nDifference: Alice appears once in semi join, twice in inner join (she has 2 orders)")

### Semi Join - Practical Use Cases

In [None]:
# Example 1: Find customers who bought high-value items (>$100)
high_value_orders = orders.filter(pl.col('amount') > 100)
premium_customers = customers.join(high_value_orders, on='customer_id', how='semi')

print("Premium customers (bought items >$100):")
print(premium_customers)

# Example 2: Find active customers (ordered in last 7 days)
recent_orders = orders.filter(pl.col('order_date') > '2023-06-05')
active_customers = customers.join(recent_orders, on='customer_id', how='semi')

print("\nActive customers (ordered after 2023-06-05):")
print(active_customers)

## 7. ANTI JOIN

**Use Case**: When you want rows from LEFT table that DON'T have a match in RIGHT table.

**Example**: Find customers who have NEVER placed an order.

In [None]:
anti_result = customers.join(orders, on='customer_id', how='anti')

print("ANTI JOIN - Customers who have NOT placed any orders:")
print(anti_result)
print(f"\nRows: {len(anti_result)}")
print("\nNote: Only Diana and Eve (no orders)")
print("Note: Only customer columns (like semi join)")

### Anti Join - Practical Use Cases

In [None]:
# Example 1: Find customers who haven't ordered specific products
laptop_orders = orders.filter(pl.col('product') == 'Laptop')
customers_without_laptop = customers.join(laptop_orders, on='customer_id', how='anti')

print("Customers who haven't bought a laptop:")
print(customers_without_laptop)

# Example 2: Find customers who ordered in the past but not recently
recent_orders = orders.filter(pl.col('order_date') > '2023-06-05')
customers_with_orders = customers.join(orders, on='customer_id', how='semi')
churned_customers = customers_with_orders.join(recent_orders, on='customer_id', how='anti')

print("\nChurned customers (ordered before but not recently):")
print(churned_customers)

## Advanced Join Scenarios

### Multiple Join Keys

In [None]:
# Create sales data with region and product
sales_target = pl.DataFrame({
    'region': ['East', 'East', 'West', 'West'],
    'product_category': ['Electronics', 'Accessories', 'Electronics', 'Accessories'],
    'target': [100000, 50000, 80000, 40000]
})

sales_actual = pl.DataFrame({
    'region': ['East', 'East', 'West', 'South'],
    'product_category': ['Electronics', 'Accessories', 'Electronics', 'Electronics'],
    'actual': [105000, 45000, 75000, 60000]
})

# Join on multiple columns
sales_comparison = sales_target.join(
    sales_actual,
    on=['region', 'product_category'],
    how='full'
).with_columns([
    (pl.col('actual').fill_null(0) - pl.col('target').fill_null(0)).alias('variance')
])

print("Sales Target vs Actual (multi-key join):")
print(sales_comparison)

### Join with Different Column Names

In [None]:
# Customer table uses 'customer_id', but another table uses 'cust_id'
customer_preferences = pl.DataFrame({
    'cust_id': [1, 2, 3],
    'preferred_category': ['Electronics', 'Accessories', 'Electronics'],
    'newsletter': [True, False, True]
})

# Join using left_on and right_on
customer_with_prefs = customers.join(
    customer_preferences,
    left_on='customer_id',
    right_on='cust_id',
    how='left'
)

print("Customers with preferences (different column names):")
print(customer_with_prefs)

### Handling Duplicate Column Names

In [None]:
# When both tables have same column names (other than join key)
customer_info = pl.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'status': ['Gold', 'Silver', 'Bronze']
})

order_info = pl.DataFrame({
    'customer_id': [1, 1, 2],
    'order_id': [101, 102, 103],
    'status': ['Shipped', 'Delivered', 'Processing']  # Same column name!
})

# Polars adds suffix automatically
joined = customer_info.join(order_info, on='customer_id', suffix='_order')

print("Join with duplicate column names:")
print(joined)
print("\nNote: 'status' from customer table, 'status_order' from order table")

## Join Performance Tips

In [None]:
# Create larger datasets for performance testing
import time

large_customers = pl.DataFrame({
    'customer_id': range(100000),
    'name': [f'Customer_{i}' for i in range(100000)],
    'value': np.random.randn(100000)
})

large_orders = pl.DataFrame({
    'order_id': range(50000),
    'customer_id': np.random.randint(0, 100000, 50000),
    'amount': np.random.uniform(10, 1000, 50000)
})

# Tip 1: Use semi/anti join instead of left join + filter when possible
start = time.time()
result1 = large_customers.join(large_orders, on='customer_id', how='left').filter(
    pl.col('order_id').is_not_null()
)
time1 = time.time() - start

start = time.time()
result2 = large_customers.join(large_orders, on='customer_id', how='semi')
time2 = time.time() - start

print(f"Left join + filter: {time1:.4f} seconds")
print(f"Semi join: {time2:.4f} seconds")
print(f"Semi join is {time1/time2:.2f}x faster")

## Summary: When to Use Each Join Type

| Join Type | Use When | Keeps Columns From | Duplicate Rows |
|-----------|----------|-------------------|----------------|
| **INNER** | Only want matching rows from both sides | Both tables | Yes (if multiple matches) |
| **LEFT** | Want all from left, with optional matches from right | Both tables | Yes (if multiple matches) |
| **RIGHT** | Want all from right, with optional matches from left | Both tables | Yes (if multiple matches) |
| **FULL** | Want all rows from both sides | Both tables | Yes (if multiple matches) |
| **CROSS** | Need all possible combinations | Both tables | By design |
| **SEMI** | Filter left table by existence in right table | Left only | No (deduplicates left) |
| **ANTI** | Filter left table by NON-existence in right table | Left only | No (deduplicates left) |

### Quick Decision Guide:
- Need data from both tables? → **INNER/LEFT/RIGHT/FULL**
- Just filtering one table by another? → **SEMI/ANTI**
- Need all combinations? → **CROSS**

## Practice Exercises

In [None]:
# Setup for exercises
employees = pl.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['John', 'Sarah', 'Mike', 'Emma', 'David'],
    'department': ['IT', 'HR', 'IT', 'Finance', 'Marketing']
})

projects = pl.DataFrame({
    'project_id': [101, 102, 103, 104],
    'emp_id': [1, 1, 2, 6],  # Note: emp_id 6 doesn't exist
    'project_name': ['Website', 'App', 'Recruitment', 'Mystery'],
    'hours': [120, 80, 40, 60]
})

print("EMPLOYEES:")
print(employees)
print("\nPROJECTS:")
print(projects)

In [None]:
# Exercise 1: Find employees who are working on projects
# Your code here:


In [None]:
# Exercise 2: Find employees who are NOT working on any projects
# Your code here:


In [None]:
# Exercise 3: Find projects with invalid employee assignments
# Your code here:


In [None]:
# Exercise 4: Show all employees and all projects in one view
# Your code here:
