# Week 10 Wednesday - Part 3: Data Concatenation with Pandas

**Duration:** 30 minutes  
**Topic:** Combining DataFrames with `pd.concat()`  
**Business Context:** Combining Multi-Period Inventory Reports  

---

## Learning Objectives

By the end of this session, you will be able to:

1. Combine multiple DataFrames vertically (stacking rows)
2. Combine multiple DataFrames horizontally (adding columns)
3. Understand when to use `concat()` vs `merge()`
4. Handle index alignment and duplicate handling
5. Apply concatenation to real-world multi-file scenarios

---

## Introduction: Why Concatenation Matters

In real business scenarios, data often comes in multiple files:

- **Time-based splits:** Monthly sales reports (jan_sales.csv, feb_sales.csv, mar_sales.csv)
- **Regional splits:** Lagos_inventory.csv, Abuja_inventory.csv, PH_inventory.csv
- **Category splits:** Electronics_data.csv, Furniture_data.csv, etc.

**Concatenation (`pd.concat()`)** allows you to:
- Stack these files vertically (combine rows)
- Combine them horizontally (add new columns)
- Create unified datasets for analysis

### Concat vs Merge:

| Operation | Use Case | Key Difference |
|-----------|----------|----------------|
| **merge()** | Combine based on common columns (keys) | Intelligent matching (like SQL JOIN) |
| **concat()** | Stack DataFrames together | Simple stacking (no matching logic) |

---

## Setup: Import Libraries and Create Sample Data

In [None]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [None]:
# Load our main datasets
products = pd.read_csv('../datasets/products.csv')
inventory = pd.read_csv('../datasets/inventory.csv')
warehouses = pd.read_csv('../datasets/warehouses.csv')

print("‚úì Datasets loaded")

---

## Section 1: Vertical Concatenation - Stacking Rows (12 minutes)

**Business Scenario:** You receive monthly inventory snapshots as separate files. You need to combine them into one dataset.

### Creating Sample Monthly Data

In [None]:
# Simulate monthly inventory snapshots
jan_inventory = inventory.head(7).copy()
jan_inventory['snapshot_date'] = '2025-01-31'

feb_inventory = inventory.iloc[7:14].copy()
feb_inventory['snapshot_date'] = '2025-02-28'

mar_inventory = inventory.iloc[14:21].copy() if len(inventory) > 14 else inventory.head(7).copy()
mar_inventory['snapshot_date'] = '2025-03-31'

print("January Snapshot:")
print(jan_inventory.head())
print(f"\nJanuary records: {len(jan_inventory)}")
print(f"February records: {len(feb_inventory)}")
print(f"March records: {len(mar_inventory)}")

### Example 1: Basic Vertical Concatenation

In [None]:
# Combine all monthly snapshots into one DataFrame
all_months = pd.concat(
    [jan_inventory, feb_inventory, mar_inventory],
    axis=0,              # axis=0 means vertical stacking (rows)
    ignore_index=True    # Create new sequential index
)

print(f"Combined inventory records: {len(all_months)}")
print(f"Expected: {len(jan_inventory) + len(feb_inventory) + len(mar_inventory)}")
print("\nSample of combined data:")
print(all_months.head())
print("\nMonths represented:")
print(all_months['snapshot_date'].value_counts())

**üí° Key Insight:** `axis=0` stacks DataFrames vertically, and `ignore_index=True` creates a new continuous index.

### Example 2: Keeping Original Index with Keys

In [None]:
# Add identifying keys to track source of each row
all_months_keyed = pd.concat(
    [jan_inventory, feb_inventory, mar_inventory],
    axis=0,
    keys=['January', 'February', 'March'],  # Add hierarchical index
    names=['Month', 'Original_Index']       # Name the index levels
)

print("With hierarchical index:")
print(all_months_keyed.head(10))
print("\nAccess January data:")
print(all_months_keyed.loc['January'].head())

### Example 3: Handling Missing Columns

In [None]:
# Create DataFrames with different columns (common real-world issue)
df1 = pd.DataFrame({
    'product_id': ['A1', 'A2', 'A3'],
    'stock': [100, 150, 200],
    'warehouse': ['Lagos', 'Abuja', 'Lagos']
})

df2 = pd.DataFrame({
    'product_id': ['B1', 'B2'],
    'stock': [75, 120],
    'supplier': ['SupplierX', 'SupplierY']  # Different column!
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Concatenate with outer join (default) - keeps all columns
combined_outer = pd.concat([df1, df2], axis=0, ignore_index=True)
print("\nCombined (outer join - keeps all columns):")
print(combined_outer)
print("\n‚ö†Ô∏è Notice the NaN values where columns don't match")

In [None]:
# Concatenate with inner join - only common columns
combined_inner = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True)
print("Combined (inner join - only common columns):")
print(combined_inner)
print("\n‚úì Only 'product_id' and 'stock' columns kept")

**üí° Key Insight:** Use `join='inner'` to keep only common columns, or `join='outer'` (default) to keep all columns with NaN for missing values.

---

## Section 2: Horizontal Concatenation - Adding Columns (8 minutes)

**Business Scenario:** You have product information in one file and supplier ratings in another. You want to add supplier data as new columns.

### Example 4: Basic Horizontal Concatenation

In [None]:
# Create sample data with aligned indices
product_info = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3', 'P4'],
    'category': ['Electronics', 'Furniture', 'Electronics', 'Home']
})

price_info = pd.DataFrame({
    'price': [25000, 45000, 15000, 8000],
    'currency': ['‚Ç¶', '‚Ç¶', '‚Ç¶', '‚Ç¶']
})

print("Product Info:")
print(product_info)
print("\nPrice Info:")
print(price_info)

In [None]:
# Combine horizontally (add columns)
complete_product = pd.concat(
    [product_info, price_info],
    axis=1  # axis=1 means horizontal concatenation (columns)
)

print("Combined (horizontal):")
print(complete_product)
print(f"\nColumns increased from {len(product_info.columns)} + {len(price_info.columns)} = {len(complete_product.columns)}")

**‚ö†Ô∏è Important:** Horizontal concatenation aligns on the index. If indices don't match, you'll get NaN values.

### Example 5: Index Alignment Issues

In [None]:
# Create DataFrames with mismatched indices
df_a = pd.DataFrame({
    'A': [1, 2, 3]
}, index=[0, 1, 2])

df_b = pd.DataFrame({
    'B': [4, 5, 6]
}, index=[1, 2, 3])  # Different indices!

print("DataFrame A (index 0,1,2):")
print(df_a)
print("\nDataFrame B (index 1,2,3):")
print(df_b)

# Concatenate horizontally
result = pd.concat([df_a, df_b], axis=1)
print("\nConcatenated Result:")
print(result)
print("\n‚ö†Ô∏è Notice NaN values where indices don't align!")

**üí° Solution:** For horizontal concat, consider using `merge()` instead if you need intelligent matching:

```python
# Better approach for misaligned data:
result = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer')
```

---

## Section 3: Real-World Use Cases (8 minutes)

### Use Case 1: Combining Regional Inventory Files

In [None]:
# Simulate regional inventory files
lagos_warehouse = warehouses[warehouses['city'] == 'Lagos']
lagos_inventory = inventory[inventory['warehouse_id'].isin(lagos_warehouse['warehouse_id'])].copy()
lagos_inventory['region'] = 'Lagos'

abuja_warehouse = warehouses[warehouses['city'] == 'Abuja']
abuja_inventory = inventory[inventory['warehouse_id'].isin(abuja_warehouse['warehouse_id'])].copy()
abuja_inventory['region'] = 'Abuja'

print(f"Lagos inventory: {len(lagos_inventory)} records")
print(f"Abuja inventory: {len(abuja_inventory)} records")

# Combine regional files
national_inventory = pd.concat(
    [lagos_inventory, abuja_inventory],
    axis=0,
    ignore_index=True
)

print(f"\nNational inventory: {len(national_inventory)} records")
print("\nRegional distribution:")
print(national_inventory['region'].value_counts())

### Use Case 2: Appending New Data to Existing Dataset

In [None]:
# Existing inventory
existing = inventory.head(10).copy()
print(f"Existing records: {len(existing)}")
print(f"Last product_id: {existing['product_id'].iloc[-1]}")

# New incoming data (e.g., from daily update)
new_data = pd.DataFrame({
    'product_id': ['NEW001', 'NEW002'],
    'warehouse_id': [1, 2],
    'stock_level': [50, 75],
    'reorder_point': [10, 15],
    'last_restocked': ['2025-10-26', '2025-10-26'],
    'status': ['In Stock', 'In Stock']
})

print("\nNew incoming data:")
print(new_data)

# Append new data
updated_inventory = pd.concat([existing, new_data], axis=0, ignore_index=True)
print(f"\nUpdated inventory: {len(updated_inventory)} records")
print("\nLast 3 records (including new):")
print(updated_inventory.tail(3))

### Use Case 3: Combining Data from Multiple File Formats

In [None]:
# Realistic scenario: You have data in different formats
# CSV file data
csv_data = inventory.head(5)[['product_id', 'warehouse_id', 'stock_level']].copy()
csv_data['source'] = 'CSV'

# Excel file data (simulated)
excel_data = pd.DataFrame({
    'product_id': ['E1', 'E2', 'E3'],
    'warehouse_id': [1, 2, 3],
    'stock_level': [120, 95, 80],
    'source': ['Excel', 'Excel', 'Excel']
})

# Database query data (simulated)
db_data = pd.DataFrame({
    'product_id': ['D1', 'D2'],
    'warehouse_id': [1, 1],
    'stock_level': [200, 150],
    'source': ['Database', 'Database']
})

# Combine all sources
consolidated = pd.concat(
    [csv_data, excel_data, db_data],
    axis=0,
    ignore_index=True
)

print("Consolidated inventory from multiple sources:")
print(consolidated)
print("\nData source breakdown:")
print(consolidated['source'].value_counts())

---

## Section 4: Best Practices and Common Pitfalls (2 minutes)

### Decision Tree: Concat vs Merge

```
Do you need to match records based on common values (keys)?
‚îú‚îÄ YES ‚Üí Use merge() or join()
‚îÇ   ‚îî‚îÄ Example: Combine products with their suppliers
‚îÇ
‚îî‚îÄ NO ‚Üí Use concat()
    ‚îú‚îÄ Same structure, different time periods?
    ‚îÇ   ‚îî‚îÄ concat(axis=0) [vertical stacking]
    ‚îÇ
    ‚îî‚îÄ Different columns, same rows?
        ‚îî‚îÄ concat(axis=1) [horizontal stacking]
```

### Quick Reference Table

| Scenario | Method | Parameters |
|----------|--------|------------|
| Stack monthly files | `pd.concat()` | `axis=0, ignore_index=True` |
| Combine regional data | `pd.concat()` | `axis=0, keys=['Region1', ...]` |
| Add new columns | `pd.concat()` | `axis=1` |
| Combine with matching | `pd.merge()` | `on='key_column'` |
| Only common columns | `pd.concat()` | `axis=0, join='inner'` |
| All columns | `pd.concat()` | `axis=0, join='outer'` |

### Common Pitfalls

In [None]:
# Pitfall 1: Forgetting to reset index
bad_concat = pd.concat([jan_inventory, feb_inventory], axis=0)
print("Without ignore_index (duplicate indices):")
print(bad_concat.head(10))
print(f"\nIndex has duplicates: {bad_concat.index.duplicated().any()}")

# Pitfall 2: Wrong axis
print("\n‚ùå Using axis=1 when you meant axis=0:")
wrong_axis = pd.concat([jan_inventory.head(3), feb_inventory.head(3)], axis=1)
print(f"Shape: {wrong_axis.shape} (way too many columns!)")

# Pitfall 3: Not tracking data source
print("\n‚ö†Ô∏è No way to identify which month each record came from!")
print("‚úì Solution: Add 'source' column or use keys parameter")

---

## Summary and Key Takeaways

### What We Learned Today:

1. **Vertical Concatenation (`axis=0`):**
   - Stacks DataFrames on top of each other
   - Use `ignore_index=True` for new sequential index
   - Use `keys` parameter to track source
   - Use `join='inner'` for only common columns

2. **Horizontal Concatenation (`axis=1`):**
   - Adds columns side-by-side
   - Aligns on index (be careful with mismatches!)
   - Consider using `merge()` if indices don't align

3. **Concat vs Merge:**
   - **concat()**: Simple stacking, no matching logic
   - **merge()**: Intelligent matching based on keys (like SQL JOIN)

### Quick Syntax Reference:

```python
# Vertical stacking (common use case)
pd.concat([df1, df2, df3], axis=0, ignore_index=True)

# With source tracking
pd.concat([df1, df2], axis=0, keys=['Source1', 'Source2'])

# Horizontal (add columns)
pd.concat([df1, df2], axis=1)

# Only common columns
pd.concat([df1, df2], axis=0, join='inner')
```

---

## Practice Exercise (5 minutes)

**Challenge:** 
1. Create three separate DataFrames representing Q1, Q2, and Q3 product sales
2. Add a 'quarter' column to each
3. Combine them into a single annual report
4. Calculate total sales by quarter

### Your Task:

In [None]:
# Step 1: Create Q1, Q2, Q3 DataFrames
# Step 2: Add quarter identifier
# Step 3: Concatenate
# Step 4: Analyze

# Your code here:


### Solution (Reveal After Attempting)

In [None]:
# Solution:
q1_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [10000, 15000, 12000],
    'quarter': ['Q1', 'Q1', 'Q1']
})

q2_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [12000, 16000, 13000],
    'quarter': ['Q2', 'Q2', 'Q2']
})

q3_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [15000, 18000, 14000],
    'quarter': ['Q3', 'Q3', 'Q3']
})

annual_report = pd.concat([q1_sales, q2_sales, q3_sales], axis=0, ignore_index=True)

print("Annual Sales Report:")
print(annual_report)

print("\nTotal Sales by Quarter:")
print(annual_report.groupby('quarter')['sales'].sum())

print("\nTotal Sales by Product:")
print(annual_report.groupby('product')['sales'].sum())

---

## Session Wrap-Up

### Today's Complete Learning Journey:

**Part 1 - Merging (45 min):**
- Inner, left, right, outer joins
- Combining related datasets based on keys

**Part 2 - Reshaping (45 min):**
- Pivot tables (long ‚Üí wide)
- Melt (wide ‚Üí long)
- Stack/Unstack operations

**Part 3 - Concatenation (30 min - Today):**
- Vertical stacking (combining rows)
- Horizontal stacking (adding columns)
- Real-world multi-file scenarios

### Next Steps:

**Practice Exercises** (Remaining class time):
- Complete comprehensive exercises notebook
- Apply all three techniques to inventory analysis
- Prepare questions for Q&A

**Thursday SQL Session:**
- Database normalization concepts
- Creating views and materialized views
- Data integrity and constraints
- See how today's pandas concepts map to SQL design

---

## Resources

- [Pandas concat() documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
- [Merge, join, concatenate guide](https://pandas.pydata.org/docs/user_guide/merging.html)
- [Comparison: concat vs merge vs join](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#join)
- Week 10 Resources folder: Cheat sheets and quick reference guides