# Merging (joining) two data sets

It is very common to want to bring in some extra data from a second table and add it to our primary table.

- Our primary table was prepared in tidy form and saved to a CSV file in the [SplitExplodeLists](SplitExplodeLists.ipynb) notebook. It has people and the states they've lived in.
- The second table we'll load from the second sheet in an Excel workbook. It contains populations of states.
- **This is the Pandas equivalent of an SQL JOIN command!**

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

## Load people/states tidy data as main table


In [1]:
import pandas as pd

In [2]:
ps_tidy = pd.read_csv('data/people_states_tidy.csv')
ps_tidy

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Bobby,Michigan
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida
6,Tamika,Washington
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon


## Load state level data to join to our main table

- **Note here how we specify the sheet name since the table we want isn't the first sheet in the Excel workbook.** 
- Reading from an Excel file requires the `openpyxl` module, which should be installed by default with the Anaconda Python distribution.
- *Notice that this table also has a "state" variable, which we'll use as the "key" to join the two.*

In [3]:
state_pop = pd.read_excel('data/PeopleStates.xlsx', sheet_name='Sheet2')
state_pop.tail(7)

Unnamed: 0,state,population_2010,house_seats
43,Montana,989417,1
44,Delaware,897936,1
45,South Dakota,814191,1
46,North Dakota,672591,1
47,Alaska,710249,1
48,Vermont,625745,1
49,Wyoming,563767,1


## LEFT JOIN

We'll do a LEFT JOIN by using the `merge()` function. 
[DataFrame.merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

- A LEFT JOIN keeps all information from the left table, but drops rows from the right that don't have a key entry in the left table's key column.
- Notice the Nulls (NaN in Pandas) in the resulting table where keys were missing on the Right!
- The default is an "inner" join, which drops any rows where matches are missing from either "left" or "right"

<img src="images/left_join.png" width=487>

*(Note: For more details on JOIN types, see the [JoinTypes notebook](JoinTypes.ipynb). JOIN image comes from the [Tableau tutorial](http://www.tableau.com/learn/tutorials/on-demand/join-types-8.2).*

### There are two equivalent variants of the same command:

```
result_df = pd.merge(left_df, right_df, how='left', on='state')
result_df = left_df.merge(right_df, how='left', on='state')
```

- If the variable you want to join on, called the "key" field, has different names in each table, instead of `on=` you can specify `left_on=` and `right_on=`
- If you need to join on a combination of multiple key columns, you just put the column names in a list

In [4]:
ps_tidy_pop = ps_tidy.merge(state_pop, how='left', on='state')
ps_tidy_pop

Unnamed: 0,name,state,population_2010,house_seats
0,Bobby,Wyoming,563767,1
1,Bobby,Michigan,9884129,14
2,Sue,Wisconsin,5687289,8
3,Sue,Nevada,2700691,4
4,Sue,California,37254503,53
5,Tamika,Florida,18804623,27
6,Tamika,Washington,6724543,10
7,Cale,South Dakota,814191,1
8,Iris,Washington,6724543,10
9,Iris,Oregon,3831073,5


### There is also a less-flexible `.join()` method

- [DataFrame.join documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)
- `.merge()` can join on any columns in the DataFrames
- with `.merge()` there are options for using the DataFrame index(s) instead of columns: `left_index=True` and/or `right_index=True`
- **`.join()` is fine, but has less options and only joins on the DataFrame indexes**
- my key columns are very often not in the index, so I just use `.merge()`

---

## Drop unwanted columns

There are two alternative ways to specify rows or columns to drop:

- The older style, which I don't like as much, is to specify row or column labels, and then the "axis" along which to drop ({0 or ‘index’, 1 or ‘columns’}, default 0)
- **The way I prefer, because it's easier for me to read, is to specify `columns=` or `index=` and then a list of labels I want to drop**
- As with `.sort_values()` and many other methods, you need to remember to either reassign to a variable (preferred), or specify `inplace=True`

In [5]:
ps_tidy_final = ps_tidy_pop.drop(columns=['house_seats'])
ps_tidy_final

Unnamed: 0,name,state,population_2010
0,Bobby,Wyoming,563767
1,Bobby,Michigan,9884129
2,Sue,Wisconsin,5687289
3,Sue,Nevada,2700691
4,Sue,California,37254503
5,Tamika,Florida,18804623
6,Tamika,Washington,6724543
7,Cale,South Dakota,814191
8,Iris,Washington,6724543
9,Iris,Oregon,3831073
