## Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

## Merging on a specific column
This exercise follows on the last one with the DataFrames revenue and managers for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a branch_id column to both DataFrames. Moreover, new cities have been added to both the revenue and managers DataFrames as well. pandas has been imported as pd and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the city and branch_id fields. In that case, the result of a merge on the city columns ought to give you the same output as a merge on the branch_id columns. Do they? Can you spot an ambiguity in one of the DataFrames?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as ply

In [3]:
revenue = pd.DataFrame({'city':['Austin', 'Denver', 'Springfield', 'Mendocino'], 'branch_id':[10,20,30,47], 'revenue':[100,83,4,200]})
revenue

Unnamed: 0,city,branch_id,revenue
0,Austin,10,100
1,Denver,20,83
2,Springfield,30,4
3,Mendocino,47,200


In [4]:
managers = pd.DataFrame({'city':['Austin', 'Denver', 'Springfield', 'Mendocino'], 'branch_id':[10,20,47,31], 'managers':['Charles', 'Joel', 'Brett', 'Sally']})
managers

Unnamed: 0,city,branch_id,managers
0,Austin,10,Charles
1,Denver,20,Joel
2,Springfield,47,Brett
3,Mendocino,31,Sally


In [5]:
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [6]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on = 'city')

# Print merge_by_city
merge_by_city

# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on = 'branch_id')

# Print merge_by_id
merge_by_id

Unnamed: 0,city,branch_id_x,revenue,branch_id_y,managers
0,Austin,10,100,10,Charles
1,Denver,20,83,20,Joel
2,Springfield,30,4,47,Brett
3,Mendocino,47,200,31,Sally


Unnamed: 0,city_x,branch_id,revenue,city_y,managers
0,Austin,10,100,Austin,Charles
1,Denver,20,83,Denver,Joel
2,Mendocino,47,200,Springfield,Brett


__Well done! Notice that when you merge on 'city', the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The revenue DataFrame has the one from Illinois, and the managers DataFrame has the one from Missouri. Consequently, when you merge on 'branch_id', both of these get dropped from the merged DataFrame.__

In [11]:
managers['branch'] = managers['city']
managers.drop('city', axis = 1, inplace = True)
managers

Unnamed: 0,branch_id,managers,branch
0,10,Charles,Austin
1,20,Joel,Denver
2,47,Brett,Springfield
3,31,Sally,Mendocino


## Merging on columns with non-matching labels
You continue working with the revenue & managers DataFrames from before. This time, someone has changed the field name 'city' to 'branch' in the managers table. Now, when you attempt to merge DataFrames, an exception is thrown:
```python 
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
 ```
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the left_on and right_on parameters in the call to pd.merge().

As before, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with Springfield be handled?

In [12]:
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on = 'city', right_on = 'branch')

# Print combined
combined

Unnamed: 0,city,branch_id_x,revenue,branch_id_y,managers,branch
0,Austin,10,100,10,Charles,Austin
1,Denver,20,83,20,Joel,Denver
2,Springfield,30,4,47,Brett,Springfield
3,Mendocino,47,200,31,Sally,Mendocino


In [14]:
managers.head()
managers.rename(columns = {'branch':'city'}, inplace = True)
managers.head()

Unnamed: 0,branch_id,managers,branch
0,10,Charles,Austin
1,20,Joel,Denver
2,47,Brett,Springfield
3,31,Sally,Mendocino


Unnamed: 0,branch_id,managers,city
0,10,Charles,Austin
1,20,Joel,Denver
2,47,Brett,Springfield
3,31,Sally,Mendocino


## Merging on multiple columns
Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called state to both DataFrames from the preceding exercises. Again, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace.

Your goal in this exercise is to use pd.merge() to merge DataFrames using multiple columns (using 'branch_id', 'city', and 'state' in this case).

Are you able to match all your company's branches correctly?

In [15]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX', 'CO', 'IL', 'CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX', 'CO', 'CA', 'MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on = ['branch_id', 'city', 'state'])

# Print combined
combined

Unnamed: 0,city,branch_id,revenue,state,managers
0,Austin,10,100,TX,Charles
1,Denver,20,83,CO,Joel


In [17]:
managers.head()
managers.rename(columns = {'city':'branch'}, inplace = True)
managers.head()

Unnamed: 0,branch_id,managers,city,state
0,10,Charles,Austin,TX
1,20,Joel,Denver,CO
2,47,Brett,Springfield,CA
3,31,Sally,Mendocino,MO


Unnamed: 0,branch_id,managers,branch,state
0,10,Charles,Austin,TX
1,20,Joel,Denver,CO
2,47,Brett,Springfield,CA
3,31,Sally,Mendocino,MO


## Left & right merging on multiple columns
You now have, in addition to the revenue and managers DataFrames from prior exercises, a DataFrame sales that summarizes units sold from specific branches (identified by city and state but not branch_id).

Once again, the managers DataFrame uses the label branch in place of city as in the other two DataFrames. Your task here is to employ left and right merges to preserve data and identify where data is missing.

By merging revenue and sales with a right merge, you can identify the missing revenue values. Here, you don't need to specify left_on or right_on because the columns to merge on have matching labels.

By merging sales and managers with a left merge, you can identify the missing manager. Here, the columns to merge on have conflicting labels, so you must specify left_on and right_on. In both cases, you're looking to figure out how to connect the fields in rows containing Springfield.

In [26]:
sales = pd.DataFrame({'city':['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'], 'state':['CA', 'CO', 'TX', 'MO', 'IL'], 
                     'units':[1,4,2,5,1]})
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


In [27]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales,how='right',on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'],right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

Unnamed: 0,city,branch_id,revenue,state,units
0,Austin,10.0,100.0,TX,2
1,Denver,20.0,83.0,CO,4
2,Springfield,30.0,4.0,IL,1
3,Mendocino,47.0,200.0,CA,1
4,Springfield,,,MO,5


Unnamed: 0,city,state,units,branch_id,managers,branch
0,Mendocino,CA,1,,,
1,Denver,CO,4,20.0,Joel,Denver
2,Austin,TX,2,10.0,Charles,Austin
3,Springfield,MO,5,,,
4,Springfield,IL,1,,,


In [28]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
merge_default

# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how= 'outer')

# Print merge_outer
merge_outer

# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales,how='outer', on = ['city', 'state'])

# Print merge_outer_on
merge_outer_on

Unnamed: 0,city,state,units,branch_id,managers,branch,revenue
0,Denver,CO,4,20.0,Joel,Denver,83.0
1,Austin,TX,2,10.0,Charles,Austin,100.0
2,Springfield,MO,5,,,,


Unnamed: 0,city,state,units,branch_id,managers,branch,revenue
0,Mendocino,CA,1,,,,
1,Denver,CO,4,20.0,Joel,Denver,83.0
2,Austin,TX,2,10.0,Charles,Austin,100.0
3,Springfield,MO,5,,,,
4,Springfield,IL,1,,,,
5,Springfield,IL,1,30.0,,,4.0
6,Mendocino,CA,1,47.0,,,200.0


Unnamed: 0,city,state,units_x,branch_id_x,managers,branch,branch_id_y,revenue,units_y
0,Mendocino,CA,1,,,,47.0,200.0,1
1,Denver,CO,4,20.0,Joel,Denver,20.0,83.0,4
2,Austin,TX,2,10.0,Charles,Austin,10.0,100.0,2
3,Springfield,MO,5,,,,,,5
4,Springfield,IL,1,,,,30.0,4.0,1


## Using merge_ordered()
This exercise uses pre-loaded DataFrames austin and houston that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use pd.merge_ordered(). After you're done, note the order of the rows before and after merging.

In [29]:
austin = pd.DataFrame({'date':['2016-01-01', '2016-02-08', '2016-01-17'], 'ratings':['Cloudy', 'Cloudy', 'Sunny']})
austin

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-02-08,Cloudy
2,2016-01-17,Sunny


In [30]:
houston = pd.DataFrame({'date':['2016-01-04', '2016-01-01', '2016-03-01'], 'ratings':['Rainy', 'Cloudy', 'Sunny']})
houston

Unnamed: 0,date,ratings
0,2016-01-04,Rainy
1,2016-01-01,Cloudy
2,2016-03-01,Sunny


In [31]:
# Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result as tx_weather
tx_weather = pd.merge_ordered(austin, houston)
# Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city
tx_weather
# Perform another ordered merge on austin and houston.
# This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on = 'date',suffixes = ['_aus','_hus'])

# Print tx_weather_suff to examine its contents.
tx_weather_suff

# Perform a third ordered merge on austin and houston.
#This time, in addition to the on and suffixes parameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!
tx_weather_ffill = pd.merge_ordered(austin, houston,on = 'date',suffixes = ['_aus','_hus'], fill_method='ffill')

tx_weather_ffill

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-01-04,Rainy
2,2016-01-17,Sunny
3,2016-02-08,Cloudy
4,2016-03-01,Sunny


Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,,Rainy
2,2016-01-17,Sunny,
3,2016-02-08,Cloudy,
4,2016-03-01,,Sunny


Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,Cloudy,Rainy
2,2016-01-17,Sunny,Rainy
3,2016-02-08,Cloudy,Rainy
4,2016-03-01,Cloudy,Sunny
