# Working With Multiple Dataframes

In order to efficiently store data, we often spread related information across multiple tables. For instance, imagine that we own an e-commerce business and we want to track the products that have been ordered from our website.

However, a lot of this information would be repeated. If the same customer makes multiple orders, that customer’s name, address, and phone number will be reported multiple times. If the same product is ordered by multiple customers, then the product price and description will be repeated. This will make our orders table big and unmanageable.

So instead, we can split our data into three tables.

## Inner Merge I

In [None]:
orders = pd.read_csv('orders.csv')

products = pd.read_csv('products.csv')

customers = pd.read_csv('customers.csv')

print(orders)
print(products)
print(customers)

order_3_description = "thing-a-ma-jig"
order_5_phone_number = "112-358-1321"

## Inner Merge II

It is easy to do this kind of matching for one row, but hard to do it for multiple rows.

Luckily, Pandas can efficiently do this for the entire table. We use the `.merge()` method.

The `.merge()` method looks for columns that are common between two DataFrames and then looks for rows where those column’s values are the same. It then combines the matching rows into a single row in a new table.

In [None]:
import pandas as pd

sales = pd.read_csv('sales.csv')
print(sales)
targets = pd.read_csv('targets.csv')
print(targets)

sales_vs_targets = pd.merge(sales, targets)
print(sales_vs_targets)

crushing_it = sales_vs_targets[sales_vs_targets.revenue > sales_vs_targets.target]
print(crushing_it)

## Inner Merge III

In addition to using `pd.merge()`, each DataFrame has its own `.merge()` method. The `.merge()` method is the same as the `pd.merge()` function, but it is called on one of the DataFrames.

We generally use this when we are joining more than two DataFrames together because we can “chain” the commands. For example, the following command would merge orders to customers, and then the resulting DataFrame to products:


```python
    big_df = orders.merge(customers).merge(products)
```

In [None]:
import pandas as pd

sales = pd.read_csv('sales.csv')
print(sales)
targets = pd.read_csv('targets.csv')
print(targets)

men_women = pd.read_csv("men_women_sales.csv")

all_data = men_women.merge(sales).merge(targets)

results = all_data[(all_data.revenue > all_data.target) & (all_data.women>all_data.men)]

## Merge on Specific Columns

In the previous example, the `.merge()` function “knew” how to combine tables based on the columns that were the same between two tables. For instance, `products` and `orders` both had a column called `product_id`. This won’t always be true when we want to perform a merge.

In [None]:
orders_products = pd.merge(orders, products.rename(columns={'id': 'product_id'}))
print(orders_products)

## Merge on Specific Columns II

In the previous exercise, we learned how to use `.rename()` to merge two DataFrames whose columns don’t match.

If we don’t want to do that, we have another option. We could use the keywords `left_on` and `right_on` to specify which columns we want to perform the merge on. 

In [None]:
orders = pd.read_csv('orders.csv')
print(orders)
products = pd.read_csv('products.csv')
print(products)

# Merge orders and products using left_on and right_on
orders_products = pd.merge(orders, products, left_on='product_id', right_on='id', suffixes=['_orders', '_products'])
print(orders_products)

## Mismatched Merges

In our previous examples, there were always matching values when we were performing our merges. What happens when that isn’t true?

Let’s imagine that our products table is out of date and is missing the newest product: Product 5. What happens when someone orders it?

In [None]:
orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')

print(orders)
print(products)

merged_df = pd.merge(orders, products)
print(merged_df)

## Outer Merge

In the previous exercise, we saw that when we merge two DataFrames whose rows don’t match perfectly, we lose the unmatched rows.

This type of merge (where we only include matching rows) is called an inner merge. There are other types of merges that we can use when we want to keep information from the unmatched rows.

In [None]:
store_a = pd.read_csv('store_a.csv')
print(store_a)
store_b = pd.read_csv('store_b.csv')
print(store_b)

store_a_b_outer = pd.merge(store_a, store_b, how="outer")
print(store_a_b_outer)

## Left Merge

Suppose we want to identify which customers are missing phone information. We would want a list of all customers who have email, but don’t have phone.

We could get this by performing a Left Merge. A Left Merge includes all rows from the first (left) table, but only rows from the second (right) table that match the first table.

For this command, the order of the arguments matters. If the first DataFrame is company_a and we do a left join, we’ll only end up with rows that appear in company_a.

By listing company_a first, we get all customers from Company A, and only customers from Company B who are also customers of Company A.

## Right Merge

Right merge is the exact opposite of left merge. Here, the merged table will include all rows from the second (right) table, but only rows from the first (left) table that match the second table.

By listing company_a first and company_b second, we get all customers from Company B, and only customers from Company A who are also customers of Company B.

In [None]:
store_a = pd.read_csv('store_a.csv')
print(store_a)
store_b = pd.read_csv('store_b.csv')
print(store_b)

store_a_b_left = pd.merge(store_a, store_b, how="left")
store_b_a_left = pd.merge(store_b, store_a, how="left")

print(store_a_b_left)
print(store_b_a_left)

# Concatenate DataFrames

Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller.

When we need to reconstruct a single DataFrame from multiple smaller DataFrames, we can use the method `pd.concat([df1, df2, df3, ...])`. This method only works if all of the columns are the same in all of the DataFrames.

In [None]:
import pandas as pd

bakery = pd.read_csv('bakery.csv')
print(bakery)
ice_cream = pd.read_csv('ice_cream.csv')
print(ice_cream)

menu = pd.concat([bakery, ice_cream])
print(menu)

## Review

In [None]:
visits = pd.read_csv('visits.csv',
                        parse_dates=[1])
checkouts = pd.read_csv('checkouts.csv',
                        parse_dates=[1])

print(visits)
print(checkouts)

v_to_c = pd.merge(visits, checkouts)
v_to_c['time'] = v_to_c.checkout_time - v_to_c.visit_time
print(v_to_c['time'].mean())