# [Merging, Joining, and Concatenating DataFrames](#)

In data analysis and manipulation, it's common to work with multiple datasets that need to be combined in various ways. Pandas provides powerful tools for merging, joining, and concatenating DataFrames, allowing you to integrate data from different sources efficiently.


The main methods for combining DataFrames in Pandas are:

1. **Concatenation**: Combining DataFrames by stacking them vertically (along rows) or horizontally (along columns).
2. **Merging**: Combining DataFrames based on common columns or indexes, similar to SQL joins.
3. **Joining**: A specialized form of merging that uses the index of DataFrames as the joining key.


Let's start by importing Pandas and creating some sample DataFrames with meaningful data to work with:


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Customer information DataFrame
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown', 'Charlie Davis'],
    'email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'alice@example.com', 'charlie@example.com']
})

# Order information DataFrame
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [3, 2, 1, 3, 5, 99],
    'order_date': ['2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19', '2023-01-20'],
    'total_amount': [150.50, 200.75, 50.25, 300.00, 175.80, 123.00]
})

# Product information DataFrame
products = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Smartwatch'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'],
    'price': [1200.00, 800.00, 500.00, 150.00, 250.00]
})


In [3]:
print("Customers DataFrame:")
customers


Customers DataFrame:


Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com
4,5,Charlie Davis,charlie@example.com


In [4]:
print("\nOrders DataFrame:")
orders



Orders DataFrame:


Unnamed: 0,order_id,customer_id,order_date,total_amount
0,101,3,2023-01-15,150.5
1,102,2,2023-01-16,200.75
2,103,1,2023-01-17,50.25
3,104,3,2023-01-18,300.0
4,105,5,2023-01-19,175.8


In [5]:
print("\nProducts DataFrame:")
products


Products DataFrame:


Unnamed: 0,product_id,product_name,category,price
0,P1,Laptop,Electronics,1200.0
1,P2,Smartphone,Electronics,800.0
2,P3,Tablet,Electronics,500.0
3,P4,Headphones,Accessories,150.0
4,P5,Smartwatch,Accessories,250.0


These sample DataFrames represent a simple e-commerce scenario:

1. `customers`: Contains customer information.
2. `orders`: Represents order details, linked to customers via `customer_id`.
3. `products`: Contains product information.


Throughout this lecture, we'll explore:

- How to use `pd.concat()` for combining DataFrames
- Various types of merges using `pd.merge()` and DataFrame's `merge()` method
- The differences between merging and joining
- Advanced techniques for handling complex combining scenarios
- Best practices and performance considerations


Understanding how to effectively combine DataFrames is crucial for data preprocessing, feature engineering, and creating comprehensive datasets for analysis. By mastering these techniques, you'll be able to handle complex data integration tasks with ease and efficiency.


In the following sections, we'll dive deep into each method of combining DataFrames, exploring their syntax, use cases, and potential pitfalls. We'll use these realistic DataFrames to demonstrate various combining techniques and highlight important concepts.

## <a id='toc1_'></a>[Concatenating DataFrames](#toc0_)

Concatenation is the process of combining DataFrames by stacking them either vertically (along rows) or horizontally (along columns). This is particularly useful when you have multiple DataFrames with similar structures that you want to combine into a single DataFrame.


<img src="../images/concatenation.png" width="800">

### <a id='toc1_1_'></a>[Vertical Concatenation (along rows)](#toc0_)


Vertical concatenation is used when you want to stack DataFrames on top of each other. This is common when dealing with data from different time periods or sources.


In [6]:
# Create two DataFrames with customer data from different regions
customers_us = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['John Doe', 'Jane Smith', 'Bob Johnson'],
    'country': ['USA', 'USA', 'USA']
})
customers_us

Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA


In [7]:
customers_uk = pd.DataFrame({
    'customer_id': [4, 5, 6],
    'name': ['Alice Brown', 'Charlie Davis', 'Emma Wilson'],
    'country': ['UK', 'UK', 'UK']
})
customers_uk

Unnamed: 0,customer_id,name,country
0,4,Alice Brown,UK
1,5,Charlie Davis,UK
2,6,Emma Wilson,UK


In [8]:
# Concatenate vertically
all_customers = pd.concat([customers_us, customers_uk])

In [9]:
print("Concatenated customers:")
all_customers

Concatenated customers:


Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA
0,4,Alice Brown,UK
1,5,Charlie Davis,UK
2,6,Emma Wilson,UK


By default, `pd.concat()` performs vertical concatenation. The resulting DataFrame includes all rows from both input DataFrames.


### <a id='toc1_2_'></a>[Horizontal Concatenation (along columns)](#toc0_)


Horizontal concatenation is used when you want to combine DataFrames side by side, adding new columns to the resulting DataFrame.


In [10]:
# Create two DataFrames with different customer information
customer_info = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown']
})
customer_info

Unnamed: 0,customer_id,name
0,1,John Doe
1,2,Jane Smith
2,3,Bob Johnson
3,4,Alice Brown


In [11]:
customer_orders = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'total_orders': [5, 3, 7, 2],
    'total_spend': [500, 300, 700, 200]
})
customer_orders

Unnamed: 0,customer_id,total_orders,total_spend
0,1,5,500
1,2,3,300
2,3,7,700
3,4,2,200


In [12]:
# Concatenate horizontally
customer_data = pd.concat([customer_info, customer_orders], axis=1)

In [13]:
print("Horizontally concatenated customer data:")
customer_data

Horizontally concatenated customer data:


Unnamed: 0,customer_id,name,customer_id.1,total_orders,total_spend
0,1,John Doe,1,5,500
1,2,Jane Smith,2,3,300
2,3,Bob Johnson,3,7,700
3,4,Alice Brown,4,2,200


Note the use of `axis=1` to specify horizontal concatenation.


In [14]:
# todo:
# add concat with different index which results in None

### <a id='toc1_3_'></a>[Handling Index and Axis](#toc0_)


When concatenating DataFrames, it's important to consider how the index is handled. By default, `pd.concat()` keeps the original indexes, which can lead to duplicate index values.


In [15]:
# Concatenate with default index handling
result_default = pd.concat([customers_us, customers_uk])

In [16]:
print("Default concatenation (potentially duplicate indexes):")
result_default

Default concatenation (potentially duplicate indexes):


Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA
0,4,Alice Brown,UK
1,5,Charlie Davis,UK
2,6,Emma Wilson,UK


In [17]:
# Reset index after concatenation
result_reset_index = pd.concat([customers_us, customers_uk]).reset_index(drop=True)
print("\nConcatenation with reset index:")
result_reset_index


Concatenation with reset index:


Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA
3,4,Alice Brown,UK
4,5,Charlie Davis,UK
5,6,Emma Wilson,UK


In [18]:
# Concatenate with a hierarchical index
result_hierarchical = pd.concat([customers_us, customers_uk], keys=['US', 'UK'])
print("\nConcatenation with hierarchical index:")
result_hierarchical


Concatenation with hierarchical index:


Unnamed: 0,Unnamed: 1,customer_id,name,country
US,0,1,John Doe,USA
US,1,2,Jane Smith,USA
US,2,3,Bob Johnson,USA
UK,0,4,Alice Brown,UK
UK,1,5,Charlie Davis,UK
UK,2,6,Emma Wilson,UK


Additional considerations:

1. **Ignoring index**: Use `ignore_index=True` to create a new sequential index.

In [19]:
pd.concat([customers_us, customers_uk], ignore_index=True)

Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA
3,4,Alice Brown,UK
4,5,Charlie Davis,UK
5,6,Emma Wilson,UK


2. **Joining on specific axes**: Use `join='inner'` to keep only columns that are present in all input DataFrames.

In [20]:
# Assuming customers_us has an extra 'state' column
customers_us['state'] = ['NY', 'CA', 'TX']
customers_us


Unnamed: 0,customer_id,name,country,state
0,1,John Doe,USA,NY
1,2,Jane Smith,USA,CA
2,3,Bob Johnson,USA,TX


In [21]:
pd.concat([customers_us, customers_uk], join='inner')

Unnamed: 0,customer_id,name,country
0,1,John Doe,USA
1,2,Jane Smith,USA
2,3,Bob Johnson,USA
0,4,Alice Brown,UK
1,5,Charlie Davis,UK
2,6,Emma Wilson,UK


In [22]:
pd.concat([customers_us, customers_uk], join='outer')

Unnamed: 0,customer_id,name,country,state
0,1,John Doe,USA,NY
1,2,Jane Smith,USA,CA
2,3,Bob Johnson,USA,TX
0,4,Alice Brown,UK,
1,5,Charlie Davis,UK,
2,6,Emma Wilson,UK,


In [23]:
pd.concat([customers_us, customers_uk])

Unnamed: 0,customer_id,name,country,state
0,1,John Doe,USA,NY
1,2,Jane Smith,USA,CA
2,3,Bob Johnson,USA,TX
0,4,Alice Brown,UK,
1,5,Charlie Davis,UK,
2,6,Emma Wilson,UK,


3. **Handling missing data**: When concatenating DataFrames with different columns, missing values are filled with NaN by default.

In [24]:
# Adding a new column to customers_uk
customers_uk['loyalty_score'] = [100, 150, 75]
pd.concat([customers_us, customers_uk])

Unnamed: 0,customer_id,name,country,state,loyalty_score
0,1,John Doe,USA,NY,
1,2,Jane Smith,USA,CA,
2,3,Bob Johnson,USA,TX,
0,4,Alice Brown,UK,,100.0
1,5,Charlie Davis,UK,,150.0
2,6,Emma Wilson,UK,,75.0


Concatenation is a powerful tool for combining DataFrames, but it's important to carefully consider how indexes and columns are handled to ensure the resulting DataFrame meets your needs. Always verify the output to make sure it aligns with your expectations, especially when dealing with complex data structures or different column sets.

## <a id='toc2_'></a>[Merging DataFrames](#toc0_)

Merging is a powerful operation in Pandas that allows you to combine DataFrames based on common columns or indexes, similar to SQL joins. It's particularly useful when you need to integrate data from different sources based on shared keys.


<img src="../images/merge.png" width="800">

Let's use our previously defined DataFrames for these examples:


In [25]:
print("Customers DataFrame:")
customers

Customers DataFrame:


Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com
4,5,Charlie Davis,charlie@example.com


In [26]:
print("\nOrders DataFrame:")
orders


Orders DataFrame:


Unnamed: 0,order_id,customer_id,order_date,total_amount
0,101,3,2023-01-15,150.5
1,102,2,2023-01-16,200.75
2,103,1,2023-01-17,50.25
3,104,3,2023-01-18,300.0
4,105,5,2023-01-19,175.8


### <a id='toc2_1_'></a>[Inner Merge](#toc0_)


An inner merge returns only the rows where there's a match in both DataFrames.


In [27]:
# Perform an inner merge on customer_id
pd.merge(customers, orders, on='customer_id', how='inner')

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount
0,1,John Doe,john@example.com,103,2023-01-17,50.25
1,2,Jane Smith,jane@example.com,102,2023-01-16,200.75
2,3,Bob Johnson,bob@example.com,101,2023-01-15,150.5
3,3,Bob Johnson,bob@example.com,104,2023-01-18,300.0
4,5,Charlie Davis,charlie@example.com,105,2023-01-19,175.8


In this case, only customers who have placed orders will appear in the result.


### <a id='toc2_2_'></a>[Outer Merge](#toc0_)


An outer merge returns all rows from both DataFrames, filling in NaN where there's no match.


In [28]:
# Perform an outer merge on customer_id
pd.merge(customers, orders, on='customer_id', how='outer')

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount
0,1,John Doe,john@example.com,103.0,2023-01-17,50.25
1,2,Jane Smith,jane@example.com,102.0,2023-01-16,200.75
2,3,Bob Johnson,bob@example.com,101.0,2023-01-15,150.5
3,3,Bob Johnson,bob@example.com,104.0,2023-01-18,300.0
4,4,Alice Brown,alice@example.com,,,
5,5,Charlie Davis,charlie@example.com,105.0,2023-01-19,175.8


This includes all customers (even those without orders) and all orders (even if the customer is not in the customers DataFrame).


### <a id='toc2_3_'></a>[Cross Merge](#toc0_)

A cross merge (or Cartesian product) returns all possible combinations of rows from both DataFrames.

In [29]:
# Perform an outer merge on customer_id
pd.merge(customers, orders, how='cross')

Unnamed: 0,customer_id_x,name,email,order_id,customer_id_y,order_date,total_amount
0,1,John Doe,john@example.com,101,3,2023-01-15,150.5
1,1,John Doe,john@example.com,102,2,2023-01-16,200.75
2,1,John Doe,john@example.com,103,1,2023-01-17,50.25
3,1,John Doe,john@example.com,104,3,2023-01-18,300.0
4,1,John Doe,john@example.com,105,5,2023-01-19,175.8
5,2,Jane Smith,jane@example.com,101,3,2023-01-15,150.5
6,2,Jane Smith,jane@example.com,102,2,2023-01-16,200.75
7,2,Jane Smith,jane@example.com,103,1,2023-01-17,50.25
8,2,Jane Smith,jane@example.com,104,3,2023-01-18,300.0
9,2,Jane Smith,jane@example.com,105,5,2023-01-19,175.8


### <a id='toc2_4_'></a>[Left and Right Merge](#toc0_)


A left merge keeps all rows from the left DataFrame, while a right merge keeps all rows from the right DataFrame.


In [30]:
# Left merge
pd.merge(customers, orders, on='customer_id', how='left')

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount
0,1,John Doe,john@example.com,103.0,2023-01-17,50.25
1,2,Jane Smith,jane@example.com,102.0,2023-01-16,200.75
2,3,Bob Johnson,bob@example.com,101.0,2023-01-15,150.5
3,3,Bob Johnson,bob@example.com,104.0,2023-01-18,300.0
4,4,Alice Brown,alice@example.com,,,
5,5,Charlie Davis,charlie@example.com,105.0,2023-01-19,175.8


In [31]:
# Right merge
pd.merge(customers, orders, on='customer_id', how='right')

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount
0,3,Bob Johnson,bob@example.com,101,2023-01-15,150.5
1,2,Jane Smith,jane@example.com,102,2023-01-16,200.75
2,1,John Doe,john@example.com,103,2023-01-17,50.25
3,3,Bob Johnson,bob@example.com,104,2023-01-18,300.0
4,5,Charlie Davis,charlie@example.com,105,2023-01-19,175.8


Left merge includes all customers, even those without orders. Right merge includes all orders, even if the customer is not in the customers DataFrame.


### <a id='toc2_5_'></a>[Merging on Index](#toc0_)


You can merge DataFrames based on their index instead of a column.


In [32]:
# Set customer_id as index for both DataFrames
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

In [33]:
# Merge on index
pd.merge(customers_indexed, orders_indexed, left_index=True, right_index=True)


Unnamed: 0_level_0,name,email,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,John Doe,john@example.com,103,2023-01-17,50.25
2,Jane Smith,jane@example.com,102,2023-01-16,200.75
3,Bob Johnson,bob@example.com,101,2023-01-15,150.5
3,Bob Johnson,bob@example.com,104,2023-01-18,300.0
5,Charlie Davis,charlie@example.com,105,2023-01-19,175.8


This is useful when your key is already the index of your DataFrames.


### <a id='toc2_6_'></a>[Merging on Multiple Keys](#toc0_)


Sometimes you need to merge based on multiple columns to uniquely identify rows.


In [34]:
# Create a new DataFrame with customer purchases
purchases = pd.DataFrame({
    'customer_id': [1, 2, 3, 2, 1],
    'product_id': ['P1', 'P2', 'P3', 'P1', 'P3'],
    'quantity': [1, 2, 1, 1, 3]
})

# Merge customers, purchases, and products
pd.merge(customers, purchases, on='customer_id').merge(products, on='product_id')

Unnamed: 0,customer_id,name,email,product_id,quantity,product_name,category,price
0,1,John Doe,john@example.com,P1,1,Laptop,Electronics,1200.0
1,1,John Doe,john@example.com,P3,3,Tablet,Electronics,500.0
2,2,Jane Smith,jane@example.com,P2,2,Smartphone,Electronics,800.0
3,2,Jane Smith,jane@example.com,P1,1,Laptop,Electronics,1200.0
4,3,Bob Johnson,bob@example.com,P3,1,Tablet,Electronics,500.0


This example merges three DataFrames using multiple keys in sequence.


Additional considerations:

1. **Handling duplicate columns**: When merging, you might end up with duplicate column names. You can use the `suffixes` parameter to distinguish them:

In [35]:
customers

Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com
4,5,Charlie Davis,charlie@example.com


In [36]:
orders

Unnamed: 0,order_id,customer_id,order_date,total_amount
0,101,3,2023-01-15,150.5
1,102,2,2023-01-16,200.75
2,103,1,2023-01-17,50.25
3,104,3,2023-01-18,300.0
4,105,5,2023-01-19,175.8


In [37]:
# duplciated column
orders['name'] = ['Bob Johnson', 'Jane Smith', 'John Doe', 'Bob Johnson', 'Charlie Davis']

pd.merge(customers, orders, on='customer_id', suffixes=('_customer', '_order'))

Unnamed: 0,customer_id,name_customer,email,order_id,order_date,total_amount,name_order
0,1,John Doe,john@example.com,103,2023-01-17,50.25,John Doe
1,2,Jane Smith,jane@example.com,102,2023-01-16,200.75,Jane Smith
2,3,Bob Johnson,bob@example.com,101,2023-01-15,150.5,Bob Johnson
3,3,Bob Johnson,bob@example.com,104,2023-01-18,300.0,Bob Johnson
4,5,Charlie Davis,charlie@example.com,105,2023-01-19,175.8,Charlie Davis


In [38]:
# Drop the 'name' column from orders
orders.drop(columns='name', inplace=True)

2. **Indicator column**: You can add an indicator column to show the merge result for each row:


In [39]:
pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount,_merge
0,1,John Doe,john@example.com,103.0,2023-01-17,50.25,both
1,2,Jane Smith,jane@example.com,102.0,2023-01-16,200.75,both
2,3,Bob Johnson,bob@example.com,101.0,2023-01-15,150.5,both
3,3,Bob Johnson,bob@example.com,104.0,2023-01-18,300.0,both
4,4,Alice Brown,alice@example.com,,,,left_only
5,5,Charlie Davis,charlie@example.com,105.0,2023-01-19,175.8,both


3. **Validating merges**: Use the `validate` parameter to ensure your merge behaves as expected:


In [40]:
pd.merge(customers, orders, on='customer_id', validate="one_to_many")

Unnamed: 0,customer_id,name,email,order_id,order_date,total_amount
0,1,John Doe,john@example.com,103,2023-01-17,50.25
1,2,Jane Smith,jane@example.com,102,2023-01-16,200.75
2,3,Bob Johnson,bob@example.com,101,2023-01-15,150.5
3,3,Bob Johnson,bob@example.com,104,2023-01-18,300.0
4,5,Charlie Davis,charlie@example.com,105,2023-01-19,175.8


Merging is a fundamental operation in data manipulation, allowing you to combine information from different sources. Always carefully consider which type of merge to use based on your data and analysis requirements. It's also crucial to inspect your merged data to ensure it meets your expectations and doesn't introduce unintended consequences like data loss or duplication.

## <a id='toc3_'></a>[Joining DataFrames](#toc0_)

Let's work with the following DataFrames:


In [41]:
# Customer information
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['Alice Brown', 'Bob Charles', 'Charlie Davis', 'David Evans', 'Eve Franklin'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com', 'eve@example.com']
}).set_index('customer_id')
customers

Unnamed: 0_level_0,name,email
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,Alice Brown,alice@example.com
102,Bob Charles,bob@example.com
103,Charlie Davis,charlie@example.com
104,David Evans,david@example.com
105,Eve Franklin,eve@example.com


In [42]:
# Order information
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_id': [101, 103, 102, 105, 101],
    'order_date': ['2023-06-01', '2023-06-02', '2023-06-03', '2023-06-04', '2023-06-05'],
    'total_amount': [150.50, 200.75, 50.25, 300.00, 175.80]
}).set_index('customer_id')
orders

Unnamed: 0_level_0,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,1001,2023-06-01,150.5
103,1002,2023-06-02,200.75
102,1003,2023-06-03,50.25
105,1004,2023-06-04,300.0
101,1005,2023-06-05,175.8


In [43]:
# Customer support tickets
support_tickets = pd.DataFrame({
    'ticket_id': [5001, 5002, 5003, 5004],
    'customer_id': [102, 104, 101, 103],
    'issue': ['Payment failed', 'Item not received', 'Wrong item sent', 'Refund request'],
    'status': ['Resolved', 'In Progress', 'Resolved', 'Pending']
}).set_index('customer_id')
support_tickets

Unnamed: 0_level_0,ticket_id,issue,status
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
102,5001,Payment failed,Resolved
104,5002,Item not received,In Progress
101,5003,Wrong item sent,Resolved
103,5004,Refund request,Pending


Now, let's explore joining these DataFrames.


### <a id='toc3_1_'></a>[Understanding the Difference between Merge and Join](#toc0_)


While `merge()` and `join()` can often be used interchangeably, `join()` is specifically designed for index-based joining and is often more convenient when working with DataFrames that have meaningful indexes.


In [44]:
# Using merge() with index
pd.merge(customers, orders, left_index=True, right_index=True, how='outer')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Alice Brown,alice@example.com,1001.0,2023-06-01,150.5
101,Alice Brown,alice@example.com,1005.0,2023-06-05,175.8
102,Bob Charles,bob@example.com,1003.0,2023-06-03,50.25
103,Charlie Davis,charlie@example.com,1002.0,2023-06-02,200.75
104,David Evans,david@example.com,,,
105,Eve Franklin,eve@example.com,1004.0,2023-06-04,300.0


In [45]:
# Using join()
customers.join(orders, how='outer')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Alice Brown,alice@example.com,1001.0,2023-06-01,150.5
101,Alice Brown,alice@example.com,1005.0,2023-06-05,175.8
102,Bob Charles,bob@example.com,1003.0,2023-06-03,50.25
103,Charlie Davis,charlie@example.com,1002.0,2023-06-02,200.75
104,David Evans,david@example.com,,,
105,Eve Franklin,eve@example.com,1004.0,2023-06-04,300.0


In this case, both operations produce the same result, but `join()` is more concise.


### <a id='toc3_2_'></a>[Different Types of Joins](#toc0_)


1. **Inner Join:**
   Keeps only the rows with matching indexes in both DataFrames.


In [46]:
customers.join(orders, how='inner')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Alice Brown,alice@example.com,1001,2023-06-01,150.5
101,Alice Brown,alice@example.com,1005,2023-06-05,175.8
102,Bob Charles,bob@example.com,1003,2023-06-03,50.25
103,Charlie Davis,charlie@example.com,1002,2023-06-02,200.75
105,Eve Franklin,eve@example.com,1004,2023-06-04,300.0


2. **Left Join:**
   Keeps all rows from the left DataFrame and matching rows from the right DataFrame.


In [47]:
customers.join(support_tickets, how='left')

Unnamed: 0_level_0,name,email,ticket_id,issue,status
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Alice Brown,alice@example.com,5003.0,Wrong item sent,Resolved
102,Bob Charles,bob@example.com,5001.0,Payment failed,Resolved
103,Charlie Davis,charlie@example.com,5004.0,Refund request,Pending
104,David Evans,david@example.com,5002.0,Item not received,In Progress
105,Eve Franklin,eve@example.com,,,


3. **Right Join:**
   Keeps all rows from the right DataFrame and matching rows from the left DataFrame.


In [48]:
customers.join(orders, how='right')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Alice Brown,alice@example.com,1001,2023-06-01,150.5
103,Charlie Davis,charlie@example.com,1002,2023-06-02,200.75
102,Bob Charles,bob@example.com,1003,2023-06-03,50.25
105,Eve Franklin,eve@example.com,1004,2023-06-04,300.0
101,Alice Brown,alice@example.com,1005,2023-06-05,175.8


4. **Outer Join:**
   Keeps all rows from both DataFrames, filling with NaN where there's no match.


In [49]:
customers.join([orders, support_tickets], how='outer')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount,ticket_id,issue,status
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
101,Alice Brown,alice@example.com,1001.0,2023-06-01,150.5,5003.0,Wrong item sent,Resolved
101,Alice Brown,alice@example.com,1005.0,2023-06-05,175.8,5003.0,Wrong item sent,Resolved
102,Bob Charles,bob@example.com,1003.0,2023-06-03,50.25,5001.0,Payment failed,Resolved
103,Charlie Davis,charlie@example.com,1002.0,2023-06-02,200.75,5004.0,Refund request,Pending
104,David Evans,david@example.com,,,,5002.0,Item not received,In Progress
105,Eve Franklin,eve@example.com,1004.0,2023-06-04,300.0,,,


Additional considerations:


1. **Joining multiple DataFrames:**
   You can join multiple DataFrames in a single operation:

In [50]:
customers.join([orders, support_tickets], how='outer')

Unnamed: 0_level_0,name,email,order_id,order_date,total_amount,ticket_id,issue,status
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
101,Alice Brown,alice@example.com,1001.0,2023-06-01,150.5,5003.0,Wrong item sent,Resolved
101,Alice Brown,alice@example.com,1005.0,2023-06-05,175.8,5003.0,Wrong item sent,Resolved
102,Bob Charles,bob@example.com,1003.0,2023-06-03,50.25,5001.0,Payment failed,Resolved
103,Charlie Davis,charlie@example.com,1002.0,2023-06-02,200.75,5004.0,Refund request,Pending
104,David Evans,david@example.com,,,,5002.0,Item not received,In Progress
105,Eve Franklin,eve@example.com,1004.0,2023-06-04,300.0,,,


2. **Handling duplicate column names:**
   When joining DataFrames with overlapping column names, you can use the `lsuffix` and `rsuffix` parameters:


In [51]:
# Assuming we have a 'customer_rating' in both customers and orders DataFrames
customers['customer_rating'] = [4.5, 3.8, 4.2, 5.0, 4.7]
orders['customer_rating'] = [4.0, 4.5, 3.5, 5.0, 4.8]

customers.join(orders, lsuffix='_customer', rsuffix='_order')

Unnamed: 0_level_0,name,email,customer_rating_customer,order_id,order_date,total_amount,customer_rating_order
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,Alice Brown,alice@example.com,4.5,1001.0,2023-06-01,150.5,4.0
101,Alice Brown,alice@example.com,4.5,1005.0,2023-06-05,175.8,4.8
102,Bob Charles,bob@example.com,3.8,1003.0,2023-06-03,50.25,3.5
103,Charlie Davis,charlie@example.com,4.2,1002.0,2023-06-02,200.75,4.5
104,David Evans,david@example.com,5.0,,,,
105,Eve Franklin,eve@example.com,4.7,1004.0,2023-06-04,300.0,5.0


3. **Performance tip:**
   For large DataFrames, `join()` can be faster than `merge()` when joining on indexes, as it's optimized for this operation.


Using meaningful data in these examples helps illustrate how joining can be used in real-world scenarios, such as combining customer information with their orders and support tickets. This allows for comprehensive analysis of customer behavior, order history, and customer service interactions all in one DataFrame.

## <a id='toc4_'></a>[Best Practices and Common Pitfalls](#toc0_)

When working with merging, joining, and concatenating DataFrames in Pandas, it's important to follow best practices and be aware of common pitfalls. This knowledge will help you avoid errors, improve performance, and ensure the integrity of your data.


### <a id='toc4_1_'></a>[Best Practices](#toc0_)


1. **Verify Data Types Before Merging**
   Ensure that the columns you're merging on have the same data type in both DataFrames.

In [52]:
# Check data types
print(customers.reset_index()['customer_id'].dtype)
print(orders.reset_index()['customer_id'].dtype)

int64
int64


In [53]:
# Convert if necessary
orders.index.astype(int)

Index([101, 103, 102, 105, 101], dtype='int64', name='customer_id')

2. **Use Appropriate Join Types**
   Choose the right join type (inner, left, right, outer) based on your specific needs.


In [54]:
# Example: Use left join to keep all customers, even those without orders
customers.merge(orders, on='customer_id', how='left')

Unnamed: 0_level_0,name,email,customer_rating_x,order_id,order_date,total_amount,customer_rating_y
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,Alice Brown,alice@example.com,4.5,1001.0,2023-06-01,150.5,4.0
101,Alice Brown,alice@example.com,4.5,1005.0,2023-06-05,175.8,4.8
102,Bob Charles,bob@example.com,3.8,1003.0,2023-06-03,50.25,3.5
103,Charlie Davis,charlie@example.com,4.2,1002.0,2023-06-02,200.75,4.5
104,David Evans,david@example.com,5.0,,,,
105,Eve Franklin,eve@example.com,4.7,1004.0,2023-06-04,300.0,5.0


3. **Set and Use Meaningful Indexes**
   When appropriate, set meaningful indexes to make joins more intuitive and potentially more efficient.


In [55]:
# customers.set_index('customer_id', inplace=True)
# orders.set_index('customer_id', inplace=True)
customers.join(orders, rsuffix='_order', lsuffix='_customer', how='left')

Unnamed: 0_level_0,name,email,customer_rating_customer,order_id,order_date,total_amount,customer_rating_order
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,Alice Brown,alice@example.com,4.5,1001.0,2023-06-01,150.5,4.0
101,Alice Brown,alice@example.com,4.5,1005.0,2023-06-05,175.8,4.8
102,Bob Charles,bob@example.com,3.8,1003.0,2023-06-03,50.25,3.5
103,Charlie Davis,charlie@example.com,4.2,1002.0,2023-06-02,200.75,4.5
104,David Evans,david@example.com,5.0,,,,
105,Eve Franklin,eve@example.com,4.7,1004.0,2023-06-04,300.0,5.0


4. **Handle Duplicate Column Names**
   Use suffixes or prefixes to avoid confusion when merging DataFrames with overlapping column names.


   ```python
   merged_df = pd.merge(df1, df2, on='key', suffixes=('_df1', '_df2'))
   ```


5. **Validate Merge Results**
   Always check the shape and content of your merged DataFrame to ensure it meets your expectations.


In [56]:
print(f"Shape before merge: {customers.shape}, {orders.shape}")
print(f"Shape after merge: {customer_orders.shape}")

Shape before merge: (5, 3), (5, 4)
Shape after merge: (4, 3)


6. **Use Indicators for Debugging**
   When troubleshooting, use the `indicator` parameter to see which DataFrame each row came from.


In [57]:
merged_with_indicator = pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)
print(merged_with_indicator['_merge'].value_counts())

_merge
both          5
left_only     1
right_only    0
Name: count, dtype: int64


### <a id='toc4_2_'></a>[Common Pitfalls](#toc0_)


1. **Unintended Cartesian Products**
   Be cautious when merging on columns with duplicate values, as this can lead to an explosion in the number of rows.


   ```python
   # This could result in unintended duplicates
   problematic_merge = pd.merge(customers, orders, on='customer_id')

   # Check for duplicates
   print(problematic_merge.duplicated().sum())
   ```


2. **Losing Data Due to Mismatched Keys**
   When using inner joins, be aware that you might lose data if keys don't match across DataFrames.


In [58]:
# This might lose customers without orders
inner_join = pd.merge(customers, orders, on='customer_id', how='inner')
print(f"Customers lost: {len(customers) - len(inner_join)}")

Customers lost: 0


4. **Incorrect Assumptions About Join Behavior**
   Different types of joins (left, right, inner, outer) can produce very different results. Always double-check that you're using the correct join for your needs.


In [59]:
# These will produce different results
pd.merge(customers, orders, on='customer_id', how='left')

Unnamed: 0_level_0,name,email,customer_rating_x,order_id,order_date,total_amount,customer_rating_y
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,Alice Brown,alice@example.com,4.5,1001.0,2023-06-01,150.5,4.0
101,Alice Brown,alice@example.com,4.5,1005.0,2023-06-05,175.8,4.8
102,Bob Charles,bob@example.com,3.8,1003.0,2023-06-03,50.25,3.5
103,Charlie Davis,charlie@example.com,4.2,1002.0,2023-06-02,200.75,4.5
104,David Evans,david@example.com,5.0,,,,
105,Eve Franklin,eve@example.com,4.7,1004.0,2023-06-04,300.0,5.0


In [60]:
pd.merge(customers, orders, on='customer_id', how='right')

Unnamed: 0_level_0,name,email,customer_rating_x,order_id,order_date,total_amount,customer_rating_y
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,Alice Brown,alice@example.com,4.5,1001,2023-06-01,150.5,4.0
103,Charlie Davis,charlie@example.com,4.2,1002,2023-06-02,200.75,4.5
102,Bob Charles,bob@example.com,3.8,1003,2023-06-03,50.25,3.5
105,Eve Franklin,eve@example.com,4.7,1004,2023-06-04,300.0,5.0
101,Alice Brown,alice@example.com,4.5,1005,2023-06-05,175.8,4.8


5. **Ignoring Data Types in String Operations**
   Be cautious when joining on string columns, as differences in capitalization or whitespace can prevent matches.


   ```python
   # Normalize string columns before joining
   customers['email'] = customers['email'].str.lower().str.strip()
   ```


6. **Overlooking the Impact of Missing Data**
   NaN values can affect merge results. Decide how to handle missing data before merging.


   ```python
   # Fill NaN values before merging if appropriate
   customers.fillna({'email': 'unknown@example.com'}, inplace=True)
   ```


By following these best practices and being aware of common pitfalls, you can ensure more reliable and efficient data manipulation when combining DataFrames in Pandas. Always take the time to validate your results and understand the implications of each operation on your data.