# Python Pandas 4 -- Working w/ Multiple Data Frames

We looked at merging briefly in the previous examples (mostly because the data I selected called for it!). Here we're going to look at something a little bit more focused. 

### Setup
Let's start by setting up our environment

In [None]:
# pandas 
# in many examples and codebases you'll find statements that alias the import (i.e import pandas as pd). 
# I don't like doing this due to the ambiguity. Autocomplete protects our fingers. 
import pandas

In [None]:
# orders
orders = pandas.read_csv('orders.csv')
orders

In [None]:
# customers
customers = pandas.read_csv('customers.csv')
customers

In [None]:
# products
products = pandas.read_csv('products.csv')
products

In [None]:
# monthly_sales
sales = pandas.read_csv('monthly_sales.csv')
sales

In [None]:
# monthly sales targets
targets = pandas.read_csv('monthly_targets.csv')
targets

In [None]:
# This is a comparison of east vs. west regions 
east_vs_west = pandas.read_csv('east_vs_west.csv')
east_vs_west

### Inner Merge

Let's just create a basic inner merge. (This is the default merge, so we don't have to specify it, but if we decide to do it anyway it would be **how='inner'**

In [None]:
# comparing sales and targets
sales_and_targets = pandas.merge(sales, targets)
sales_and_targets

In [None]:
# We can do different things like calculate the difference
sales_and_targets['difference'] = sales_and_targets['revenue'] - sales_and_targets['target']
sales_and_targets

In [None]:
# ... or separate out the months we're over 
sales_and_targets[sales_and_targets.difference > 0]

In [None]:
# (We didn't need to create a new column to do this...) 
sales_and_targets[sales_and_targets.revenue > sales_and_targets.target]

Our previous merges have been performed on DataFrame's merge() method, however each instance of the DataFrame has it's own merge() method. This is the preferential method for merging dataframes when you are going to chain more than 2 dataframes together. (Basic Chain of Responsibility). 

In [None]:
# let's chain our sales data
chained_sales = sales.merge(targets).merge(east_vs_west)
chained_sales

In [None]:
# we can now perform complex filters on multiple data frames
chained_sales[(chained_sales.revenue > chained_sales.target) & (chained_sales.east > chained_sales.west)]

### Merging on specific columns

Prior to this, the examples automatically merged on the columns that made sense, because there was only a single shared column name. This is a neat feature of pandas. (That can be painful when working w/ many datasets that you aren't familiar with. **ALWAYS INSPECT YOUR DATA**

Its far more common that you'll find data frames that don't have shared column names (or if they do, the column names don't have the same semantics)

In [None]:
# Lets take a look at our orders and product data frames. I know, we looked at them above.. 
print(orders)
print(products)


We have no common column names in this case, but we have a foreign key relationship between **orders.product_id** and **products.id**. This means we have a way to merge the two frames... but how?  

The simplest solution would be to make the names match so that pandas can work its black magic!

In [None]:
# Solution 1: Trust the Black Magic -- rename your columns. 
orders_and_products = pandas.merge(
    orders,
    products.rename(columns={'id':'product_id'})
)
orders_and_products

In [None]:
# Solution 2: A more SQL-like solution -- use merge, but specify the column relationship
## Suffixes are ways for pandas to resolve columns w/ the same name. It won't allow this. 
## The suffixes tell you where they've originated from (the defaults are _x and _y)
orders_and_products = pandas.merge(
    orders,
    products,
    left_on='product_id',
    right_on='id',
    suffixes=['_orders','_products'])
orders_and_products

Let's show how merges go wrong... (let's look at products and orders again)

In [None]:
# I lied, let's look at a different example of orders
orders2 = pandas.read_csv('orders2.csv')
orders2

In [None]:
products

Did you notice that there is a **product_id=5** that corresponds to **order_id=3**?? 
However, there that **product_id** doesn't exist in the product dataframe.

Let's do a default inner merge to see waht happens

In [None]:
busted_merge = pandas.merge(
    orders2,
    products,
    left_on="product_id",
    right_on="id"
)
busted_merge

????
Where did order 3 go??

Ok. Ok. You got me. Inner merge is the same concept as an inner join, so the "query" (or merge in this case) is only going to include rows that have complete/perfect matches. 

Let's look at two different locations of "Sully's Hahdware" stores in the Greater Boston area. (Yes, I made it up).

In [None]:
# Billerica store
billerica = pandas.read_csv('billerica.csv')
billerica

In [None]:
# Methuen store
methuen = pandas.read_csv('methuen.csv')
methuen

As you can see there are a lot of columns without perfect matches, so an inner merge probably isn't going to work. Introducing

### Outer Merges!

An outer merge includes ALL rows from both tables even if they don't match. (remember?? We used this strategy w/ the Celtics!) 

In [None]:
pandas.merge(billerica, methuen, how='outer')

### Left and Right Merge

As we saw above, imperfect matched rows result in NaNs or Nones. In the case above, it is a valid value, because we're measuring inventory across multiple stores. 

Let's look at left and right merge. 

In [None]:
# left merge for Billerica
## This is going to show the products that are in Billerica but not Methuen. 
## In other words we'll only include non-matching values from the "right".
pandas.merge(billerica, methuen, how='left')

In [None]:
# Lets swap this as a left merge for methuen  (Same concept, just switching which column is on the left)
pandas.merge(methuen, billerica, how='left') 

In [None]:
# What about a right merge??
## This will show the same result we had when performing a left merge on billerica, but we've reordered the columns. 
pandas.merge(methuen, billerica, how='right')

In [None]:
# right merge on methuen to go full circle...
pandas.merge(billerica, methuen, how='right')

### Concatenation

This is a wonderful tool. Concatenate allows you to add multiple dataframes together. This is very useful when you want to store large (**LARGE**) datasets in the cloud, but want to avoid massive download times. It's also useful in breaking up spreadsheets or csv files to avoid LFS limitations in git. 

In [None]:
# go get our CSVs.
black_lion = pandas.read_csv('black_lion.csv')
red_lion = pandas.read_csv('red_lion.csv')
blue_lion = pandas.read_csv('blue_lion.csv')
green_lion = pandas.read_csv('green_lion.csv')
yellow_lion = pandas.read_csv('yellow_lion.csv')

voltron = pandas.concat([black_lion, red_lion,blue_lion,green_lion,yellow_lion]).reset_index()
voltron