# Week 10 Wednesday - Part 1: Data Merging with Pandas

**Duration:** 45 minutes  
**Topic:** Combining Multiple DataFrames using `merge()`  
**Business Context:** Lagos E-Commerce Inventory Management  

---

## Learning Objectives

By the end of this session, you will be able to:

1. Understand the relationship between SQL JOINs and pandas `merge()`
2. Perform inner, left, right, and outer joins using pandas
3. Merge multiple datasets to create comprehensive inventory analysis
4. Handle common merging issues (duplicate keys, missing values, column name conflicts)
5. Apply merge operations to real-world inventory management scenarios

---

## Introduction: Why Merging Matters

In real-world data analysis, information is rarely contained in a single table. For inventory management:

- **Products table:** Contains product details (weight, dimensions, category)
- **Inventory table:** Contains stock levels and warehouse locations
- **Orders table:** Contains sales transactions
- **Suppliers table:** Contains supplier information

To answer business questions like:
- "Which high-demand products are low in stock?"
- "What's the average inventory value by warehouse?"
- "Which suppliers provide the best-selling products?"

We need to **combine (merge)** these separate tables into unified datasets.

---

## Setup: Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np

# Display settings for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

In [None]:
# Load all datasets
products = pd.read_csv('../datasets/products.csv')
inventory = pd.read_csv('../datasets/inventory.csv')
orders = pd.read_csv('../datasets/orders.csv')
order_items = pd.read_csv('../datasets/order_items.csv')
suppliers = pd.read_csv('../datasets/suppliers.csv')
warehouses = pd.read_csv('../datasets/warehouses.csv')

print("‚úì Datasets loaded successfully")
print(f"\nDataset sizes:")
print(f"  Products: {len(products)} rows")
print(f"  Inventory: {len(inventory)} rows")
print(f"  Orders: {len(orders)} rows")
print(f"  Order Items: {len(order_items)} rows")
print(f"  Suppliers: {len(suppliers)} rows")
print(f"  Warehouses: {len(warehouses)} rows")

### Quick Data Preview

In [None]:
# Preview each dataset
print("PRODUCTS:")
print(products.head(3))
print("\nINVENTORY:")
print(inventory.head(3))
print("\nWAREHOUSES:")
print(warehouses.head())

---

## Section 1: SQL JOIN ‚Üî Pandas Merge Mapping (10 minutes)

### The Four Types of Joins

| SQL JOIN Type | Pandas Parameter | Result | Use Case |
|---------------|------------------|--------|----------|
| `INNER JOIN` | `how='inner'` | Only matching records | Find products with active inventory |
| `LEFT JOIN` | `how='left'` | All from left + matches from right | Keep all products, show inventory where available |
| `RIGHT JOIN` | `how='right'` | All from right + matches from left | Keep all inventory, add product details |
| `FULL OUTER JOIN` | `how='outer'` | All records from both tables | Identify data gaps (products without inventory) |

### Basic Syntax Comparison

**SQL:**
```sql
SELECT *
FROM products p
INNER JOIN inventory i ON p.product_id = i.product_id;
```

**Pandas:**
```python
result = pd.merge(products, inventory, on='product_id', how='inner')
```

---

## Section 2: Inner Join - Matching Records Only (10 minutes)

**Business Question:** "Show me products that currently have inventory, with their stock levels"

### Example 1: Products with Active Inventory

In [None]:
# Inner join: Only products that have inventory records
products_with_inventory = pd.merge(
    products,
    inventory,
    on='product_id',
    how='inner'
)

print(f"Products in dataset: {len(products)}")
print(f"Inventory records: {len(inventory)}")
print(f"Products with inventory (after INNER join): {len(products_with_inventory)}")
print("\nSample result:")
print(products_with_inventory[['product_id', 'category', 'stock_level', 'status']].head())

### Example 2: Add Warehouse Information

In [None]:
# Chain multiple merges to add warehouse details
inventory_with_location = pd.merge(
    products_with_inventory,
    warehouses,
    on='warehouse_id',
    how='inner'
)

print(f"Records after adding warehouse info: {len(inventory_with_location)}")
print("\nInventory by warehouse:")
print(inventory_with_location[['product_id', 'category', 'stock_level', 'city', 'region']].head())

**üí° Key Insight:** Inner joins reduce your dataset to only matching records. This is useful when you only want complete data.

---

## Section 3: Left Join - Keep All Records from Left Table (10 minutes)

**Business Question:** "Show me ALL products, and their inventory status (even if they don't have inventory)"

### Example 3: All Products with Optional Inventory

In [None]:
# Left join: Keep all products, add inventory data where available
all_products_inventory = pd.merge(
    products,
    inventory,
    on='product_id',
    how='left'
)

print(f"Total products: {len(all_products_inventory)}")
print(f"Products WITHOUT inventory: {all_products_inventory['stock_level'].isna().sum()}")
print("\nSample with missing inventory:")
print(all_products_inventory[all_products_inventory['stock_level'].isna()].head())

### Handling Missing Values After Left Join

In [None]:
# Fill missing stock levels with 0 for products not in inventory
all_products_inventory['stock_level'] = all_products_inventory['stock_level'].fillna(0)
all_products_inventory['status'] = all_products_inventory['status'].fillna('Not Tracked')

print("Stock level distribution after filling NaN:")
print(all_products_inventory['status'].value_counts())

**üí° Key Insight:** Left joins preserve all records from the left DataFrame. Missing matches result in NaN values that need to be handled.

---

## Section 4: Right Join and Outer Join (8 minutes)

### Example 4: Right Join - Keep All Inventory Records

In [None]:
# Right join: Keep all inventory, add product details where available
all_inventory_products = pd.merge(
    products,
    inventory,
    on='product_id',
    how='right'
)

print(f"Inventory records preserved: {len(all_inventory_products)}")
print(f"Inventory without product details: {all_inventory_products['category'].isna().sum()}")

### Example 5: Outer Join - Keep Everything

In [None]:
# Outer join: Keep all products AND all inventory records
full_product_inventory = pd.merge(
    products,
    inventory,
    on='product_id',
    how='outer'
)

print(f"Total records (outer join): {len(full_product_inventory)}")
print(f"Products without inventory: {full_product_inventory['stock_level'].isna().sum()}")
print(f"Inventory without product: {full_product_inventory['category'].isna().sum()}")

# Identify orphaned inventory (no matching product)
orphaned_inventory = full_product_inventory[full_product_inventory['category'].isna()]
print(f"\n‚ö†Ô∏è Data Quality Issue: {len(orphaned_inventory)} inventory records have no matching product!")

**üí° Key Insight:** Outer joins are excellent for data quality checks - they reveal orphaned records and data inconsistencies.

---

## Section 5: Advanced Merging Techniques (7 minutes)

### Handling Column Name Conflicts with Suffixes

In [None]:
# When both tables have columns with the same name (besides the key)
# Use suffixes to distinguish them
result_with_suffixes = pd.merge(
    inventory,
    warehouses,
    on='warehouse_id',
    how='left',
    suffixes=('_inventory', '_warehouse')
)

print("Columns after merge with suffixes:")
print(result_with_suffixes.columns.tolist())

### Merging on Different Column Names

In [None]:
# When the joining columns have different names in each table
# Example: order_items has 'seller_id', suppliers has 'supplier_id'
items_with_suppliers = pd.merge(
    order_items,
    suppliers,
    left_on='seller_id',
    right_on='supplier_id',
    how='left'
)

print(f"Order items with supplier info: {len(items_with_suppliers)}")
print("\nSample:")
print(items_with_suppliers[['order_id', 'seller_id', 'city', 'state']].head())

### Merging on Multiple Columns

In [None]:
# Sometimes you need to match on multiple columns for uniqueness
# Example: Match products and inventory by both product_id AND warehouse_id
# (This is hypothetical since our inventory already has warehouse_id)

# Syntax example:
# result = pd.merge(df1, df2, on=['col1', 'col2'], how='inner')

print("Syntax for multi-column merge:")
print("pd.merge(df1, df2, on=['key1', 'key2'], how='inner')")

---

## Section 6: Real-World Business Analysis (5 minutes)

**Complete Example:** Create a comprehensive inventory report with all relevant information

### Multi-Step Merge: Products ‚Üí Inventory ‚Üí Warehouses

In [None]:
# Step 1: Merge products with inventory (left join to keep all products)
step1 = pd.merge(products, inventory, on='product_id', how='left')

# Step 2: Add warehouse details (left join to preserve products without inventory)
inventory_report = pd.merge(step1, warehouses, on='warehouse_id', how='left')

# Clean up the result
inventory_report['stock_level'] = inventory_report['stock_level'].fillna(0)
inventory_report['status'] = inventory_report['status'].fillna('Not in Inventory')
inventory_report['city'] = inventory_report['city'].fillna('Not Assigned')

print("Complete Inventory Report:")
print(inventory_report[[
    'product_id', 'category', 'stock_level', 
    'status', 'city', 'region'
]].head(10))

### Business Insights from Merged Data

In [None]:
# Now we can answer complex business questions

# 1. Which warehouse has the most inventory?
print("Total Stock by Warehouse:")
print(inventory_report.groupby('city')['stock_level'].sum().sort_values(ascending=False))

# 2. Which product categories are low in stock?
print("\nLow Stock Products by Category:")
low_stock = inventory_report[inventory_report['status'] == 'Low Stock']
print(low_stock['category'].value_counts())

# 3. Which region has the most product variety?
print("\nProduct Variety by Region:")
print(inventory_report.groupby('region')['product_id'].nunique().sort_values(ascending=False))

---

## Summary and Key Takeaways

### What We Learned Today:

1. **Merge Types:**
   - `how='inner'`: Only matching records (intersection)
   - `how='left'`: All from left table + matches from right
   - `how='right'`: All from right table + matches from left
   - `how='outer'`: All records from both tables (union)

2. **Key Parameters:**
   - `on`: Column name(s) to join on
   - `left_on` / `right_on`: Different column names
   - `suffixes`: Handle column name conflicts

3. **Best Practices:**
   - Always check record counts before and after merging
   - Handle missing values (NaN) after left/right/outer joins
   - Use outer joins to identify data quality issues
   - Chain merges step-by-step for complex analyses

### SQL to Pandas Quick Reference:

```python
# SQL: INNER JOIN
pd.merge(df1, df2, on='key', how='inner')

# SQL: LEFT JOIN
pd.merge(df1, df2, on='key', how='left')

# SQL: RIGHT JOIN
pd.merge(df1, df2, on='key', how='right')

# SQL: FULL OUTER JOIN
pd.merge(df1, df2, on='key', how='outer')
```

---

## Practice Exercise (5 minutes)

**Challenge:** Merge `orders`, `order_items`, and `products` to answer:
"What product categories are generating the most revenue?"

### Your Task:

In [None]:
# Step 1: Merge orders with order_items
# Step 2: Add product information
# Step 3: Calculate total revenue by category
# Step 4: Sort and display top 5 categories

# Your code here:


### Solution (Reveal After Attempting)

In [None]:
# Solution:
orders_items = pd.merge(orders, order_items, on='order_id', how='inner')
full_sales = pd.merge(orders_items, products, on='product_id', how='left')
revenue_by_category = full_sales.groupby('category')['price'].sum().sort_values(ascending=False)

print("Top 5 Revenue-Generating Categories:")
print(revenue_by_category.head())

---

## Next Session Preview

**Part 2: Data Reshaping (Wednesday, 45 minutes)**
- Pivot tables: Convert rows to columns
- Melt: Convert wide data to long format
- Stack/Unstack: Multi-level reshaping
- Real-world use cases: Monthly sales trends, category performance matrices

---

## Resources

- [Pandas merge() documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html)
- [SQL to pandas comparison](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html)
- Week 10 SQL content (Thursday): Database normalization and views