# Merging (joining) two data sets

It is very common to want to bring in some extra data from a second table and add it to our primary table.

- Our primary table was prepared in tidy form and saved to a CSV file in the [SplitExplodeLists](SplitExplodeLists.ipynb) notebook. It has people and the states they've lived in.
- The second table we'll load from the second sheet in an Excel workbook. It contains populations of states.

### This is the Pandas equivalent of an SQL JOIN command!

## Load tidy data as main table


In [1]:
import pandas as pd

In [8]:
ps_tidy = pd.read_csv('data/people_states_tidy.csv')
ps_tidy

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Bobby,Michigan
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida
6,Tamika,Washington
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon


## Load state level data to join to our main table

*Note here how we specify the sheet name since the table we want isn't the first sheet in the Excel workbook.*

In [12]:
state_pop = pd.read_excel('./data/PeopleStates.xlsx', sheet_name='Sheet2')
state_pop.tail(7)

Unnamed: 0,state,population_2010,house_seats
43,Montana,989417,1
44,Delaware,897936,1
45,South Dakota,814191,1
46,North Dakota,672591,1
47,Alaska,710249,1
48,Vermont,625745,1
49,Wyoming,563767,1


## LEFT JOIN

We'll do a LEFT JOIN by using the `merge()` function, specifying which DataFrame is on the "left" and which is on the "right" for the JOIN. It's just the order in which you list them as the first two arguments to `merge()`.

*(Note: For more details on JOIN types, see the [JoinTypes notebook](JoinTypes.ipynb).)*

We also need to specify which column contains the ID fields / keys to join on. We put these in the "left_on" and "right_on" arguments.

In [4]:
ps_tidy_pop = ps_tidy.merge(state_pop, how='left', on='state')
ps_tidy_pop

Unnamed: 0,name,state,population_2010,house_seats
0,Bobby,Wyoming,563767,1
1,Bobby,Michigan,9884129,14
2,Sue,Wisconsin,5687289,8
3,Sue,Nevada,2700691,4
4,Sue,California,37254503,53
5,Tamika,Florida,18804623,27
6,Tamika,Washington,6724543,10
7,Cale,South Dakota,814191,1
8,Iris,Washington,6724543,10
9,Iris,Oregon,3831073,5


## Sort by values

Then, we'll sort the rows "descending" and "in place" by the populatin column.

In [5]:
ps_tidy_pop.sort_values(by=['population_2010'], ascending=False, inplace=True)
ps_tidy_pop

Unnamed: 0,name,state,population_2010,house_seats
4,Sue,California,37254503,53
10,Iris,California,37254503,53
5,Tamika,Florida,18804623,27
1,Bobby,Michigan,9884129,14
6,Tamika,Washington,6724543,10
8,Iris,Washington,6724543,10
2,Sue,Wisconsin,5687289,8
9,Iris,Oregon,3831073,5
3,Sue,Nevada,2700691,4
7,Cale,South Dakota,814191,1


---

# Saving table out to a CSV file

Usually we can save to an Excel file, but we'd need to install another module
so, we'll save as CSV file for now, which is a very useful format.

- It's good practice to specify the `encoding`, which is the method used for recording characters beyond the 256 ASCII character set. 
- In this case we also don't need to save the `index` column to the file, so we'll turn that option off

In [6]:
ps_tidy_pop.to_csv('./data/PeopleStates_Merged.csv', encoding='utf-8', index=False)

## Save to JSON

Another option is to save as a JSON file. There are multiple "orientations":
[to_json docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html)

`records` orientation will make a list of rows, each an object/dictionary


In [7]:
ps_tidy_pop.to_json('./data/PeopleStates_Merged.json', orient='records')