# 9. Joining, Merging, and Concatenating

## 1. Introduction to Dataset Combination

### Objective
Combining datasets is a fundamental step in data analysis when working with fragmented, complementary, or diverse data sources. This process helps to:
- Bring together related data for comprehensive analysis.
- Create enriched datasets for better insights.
- Simplify workflows by consolidating data into a single DataFrame.

### Key Concepts

#### What are Concatenation and Merging?
- **Concatenation**:
  - Combines datasets along a particular axis (rows or columns).
  - Does not match rows or columns explicitly, so the alignment is purely positional.
  - Useful when datasets share a similar structure (e.g., identical columns for rows concatenation or identical indices for column concatenation).

- **Merging**:
  - Combines datasets by aligning rows using one or more keys (columns or indices).
  - More flexible and powerful than concatenation, as it matches data based on content.
  - Useful for combining datasets with different structures or when keys are critical.

#### When to Use Concatenation vs. Merging?
- **Use Concatenation**:
  - Datasets have the same columns (for vertical concatenation) or the same indices (for horizontal concatenation).
  - The goal is to extend data by adding rows or columns without complex alignment.

- **Use Merging**:
  - Datasets have shared or complementary keys.
  - The goal is to enrich data by combining relevant information based on key relationships.



## 2. Vertical and Horizontal Concatenation

### Key Topics

#### 1. Concatenating Rows (Vertical Concatenation)
- Adds rows from one or more datasets to another.
- Columns must align, meaning all datasets should have the same column names (although `NaN` is used for missing columns if mismatched).



In [ ]:
import pandas as pd

# Sample monthly sales datasets
sales_jan = pd.DataFrame({
    'Product': ['A', 'B'],
    'Sales': [100, 200]
})
sales_feb = pd.DataFrame({
    'Product': ['A', 'C'],
    'Sales': [150, 300]
})

# Vertical concatenation
sales_combined = pd.concat([sales_jan, sales_feb], axis=0, ignore_index=True)
print(sales_combined)

### Additional Examples for Vertical Concatenation
1. Concatenating Datasets with Different Columns:
```python
# Different columns
sales_extra = pd.DataFrame({
    'Product': ['D'],
    'Revenue': [400]
})
result = pd.concat([sales_combined, sales_extra], axis=0, ignore_index=True)
print(result)
```
2. Adding Hierarchical Keys:
```python
# Adding keys
sales_with_keys = pd.concat([sales_jan, sales_feb], keys=['January', 'February'])
print(sales_with_keys)
```



#### 2. Concatenating Columns (Horizontal Concatenation)
- Adds columns from one or more datasets to another.
- Rows must align by index; otherwise, missing values (`NaN`) are used.



In [ ]:
# Sample datasets with matching indices
prices = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Price': [10, 20, 30]
})
discounts = pd.DataFrame({
    'Discount': [0.1, 0.2, 0.15]
}, index=[0, 1, 2])

# Horizontal concatenation
sales_with_prices = pd.concat([sales_combined, discounts], axis=1)
print(sales_with_prices)

### Additional Examples for Horizontal Concatenation
1. Adding Columns with Different Indices:
```python
# Different indices
ratings = pd.DataFrame({
    'Rating': [4.5, 3.8, 4.9]
}, index=[1, 2, 3])
combined_with_ratings = pd.concat([prices, ratings], axis=1)
print(combined_with_ratings)
```
2. Adding Metadata with Keys:
```python
# Adding hierarchical keys
merged_with_keys = pd.concat([prices, discounts], axis=1, keys=['Prices', 'Discounts'])
print(merged_with_keys)
```


### Handling Indices During Concatenation

- **`ignore_index`**:
  - Resets the index to a continuous integer range after concatenation.
  - Use when the original index is not meaningful.

```python
# Example with ignore_index
sales_combined_reset = pd.concat([sales_jan, sales_feb], axis=0, ignore_index=True)
print(sales_combined_reset)
```

- **`keys`**:
  - Adds hierarchical indexing to distinguish data from different sources.

```python
# Example with keys
sales_with_keys = pd.concat([sales_jan, sales_feb], axis=0, keys=['January', 'February'])
print(sales_with_keys)
```

**Output**:
```
            Product  Sales
January 0       A    100
         1       B    200
February 0       A    150
         1       C    300
```

### Best Practices
- Ensure columns or indices align correctly to avoid unintended results.
- Use parameters like `ignore_index` and `keys` to control how indices are handled.


## Examples

### 1. Combine Monthly Sales Datasets Vertically
- Combine sales data from different months into a single dataset.

```python
# Example from above with January and February sales
sales_combined = pd.concat([sales_jan, sales_feb], axis=0, ignore_index=True)
```

### 2. Concatenate Datasets with Matching Indices Horizontally
- Combine product details (e.g., prices and discounts) into the sales dataset.

```python
sales_with_prices = pd.concat([sales_combined, discounts], axis=1)
```
