# üìò Section 6: Merging, Joining, and Concatenating DataFrames

**Level:** Intermediate

In data analysis, you often need to combine data from multiple sources ‚Äî users, transactions, logs, etc. Pandas provides flexible methods to combine DataFrames efficiently:

- `pd.merge()` ‚Üí SQL-style joins
- `.join()` ‚Üí index-based joins
- `pd.concat()` ‚Üí stacking datasets vertically or horizontally

---

## üîπ 6.1 Creating Sample Datasets

Let‚Äôs create two DataFrames that mimic real-world datasets:

In [None]:
import pandas as pd

# Users dataset
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['New York', 'Paris', 'Berlin', 'Tokyo'],
    'signup_year': [2020, 2019, 2021, 2018]
})

# Transactions dataset
transactions = pd.DataFrame({
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'user_id': [1, 2, 2, 4, 5, 6],
    'amount': [250, 120, 75, 300, 90, 450],
    'product': ['Laptop', 'Book', 'Pen', 'Phone', 'Mouse', 'Keyboard']
})

display(users)
display(transactions)

## üîπ 6.2 `pd.merge()` ‚Äî SQL-style Joins

`pd.merge()` allows you to combine DataFrames using keys (common columns) similar to SQL joins.

**Example: Inner Join** ‚Äî Only keeps matching rows.

In [None]:
merged_inner = pd.merge(users, transactions, on='user_id', how='inner')
merged_inner

### üí° Join Types

| Join Type | Keeps | Use Case |
|------------|--------|-----------|
| `inner` | Matching keys only | Customers with purchases |
| `left` | All rows from left DF | Users including those with no transactions |
| `right` | All rows from right DF | All transactions even if user missing |
| `outer` | All keys from both | Combine everything |

Let's visualize them üëá

In [None]:
merged_left = pd.merge(users, transactions, on='user_id', how='left', indicator=True)
merged_right = pd.merge(users, transactions, on='user_id', how='right', indicator=True)
merged_outer = pd.merge(users, transactions, on='user_id', how='outer', indicator=True)

print('Left Join:')
display(merged_left)

print('Right Join:')
display(merged_right)

print('Outer Join:')
display(merged_outer)

## üîπ 6.3 Joining on Different Column Names

Sometimes your key columns have different names. Use `left_on` and `right_on`.

In [None]:
transactions_renamed = transactions.rename(columns={'user_id': 'customer_id'})
merged_diff = pd.merge(users, transactions_renamed, left_on='user_id', right_on='customer_id', how='inner')
merged_diff.head()

## üîπ 6.4 Concatenating DataFrames (`pd.concat()`)

Used for stacking DataFrames **vertically** or **horizontally**. Ideal when you want to append new records or add columns.

In [None]:
# Vertical concatenation (adding rows)
new_users = pd.DataFrame({
    'user_id': [5, 6],
    'name': ['Eve', 'Frank'],
    'city': ['Sydney', 'Rome'],
    'signup_year': [2022, 2023]
})

all_users = pd.concat([users, new_users], ignore_index=True)
all_users

In [None]:
# Horizontal concatenation (adding columns)
user_scores = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'satisfaction_score': [4.5, 3.8, 4.9, 4.2]
})

combined_side = pd.concat([users.set_index('user_id'), user_scores.set_index('user_id')], axis=1)
combined_side

## üîπ 6.5 `.join()` ‚Äî Index-based Joins

If your keys are already indexes, `.join()` provides a clean and readable way to combine DataFrames.

In [None]:
cities = pd.DataFrame({
    'city': ['New York', 'Paris', 'Berlin', 'Tokyo'],
    'continent': ['North America', 'Europe', 'Europe', 'Asia']
}).set_index('city')

joined = users.set_index('city').join(cities)
joined

## üîπ 6.6 Real-World Example: Customer Insights

Combine users with transactions and city data to analyze total spending per customer.

In [None]:
merged = pd.merge(users, transactions, on='user_id', how='left')
user_spend = merged.groupby(['user_id', 'name', 'city'])['amount'].sum().reset_index().rename(columns={'amount': 'total_spent'})
final = user_spend.set_index('city').join(cities)
final.reset_index(inplace=True)
final

## ‚öôÔ∏è Best Practices & Common Pitfalls

‚úÖ Check merge keys with `.unique()` before joining.
‚úÖ Use `indicator=True` to audit join results.
‚ö†Ô∏è Avoid joining on non-unique columns unless followed by aggregation.
‚öôÔ∏è Prefer `concat()` for union-like operations.

---
### üí™ Real-World Challenge

Given two DataFrames:
- `students(id, name, department)`
- `marks(student_id, subject, score)`

**Tasks:**
1. Merge both DataFrames.
2. Compute each student's average score.
3. Find top 3 departments by average performance.

---
_End of Section 6 ‚Äî Next: Grouping & Aggregation._