# joining tables
a key concept for tidy data is the separation of data into tables, such that only relevant data is stored in each table. this reduces data size and increases readability. 

however, what data are relevant together can change based on the use case, so we often need to bring together into a single table information from two or more tables. the `join` operations that do this are the topic of today's session.

## today's exercise
consider the new york rodent inspection dataset from last time. we have already discussed how to read in that data, how to clean/manipulate the datetime information in it, and how to use group by and aggregation to calculate things like the monthly number of inspections. 

one might suspect that the weather plays a role in how many rodent inspections occur in a given day/week/month. to explore that hypothesis we must join the data on daily new york rodent inspection counts with data on the daily weather in new york. in the `data` folder there is a data file containing the daily weather summary for new york. read it in, count the number of inspections per day and join the result with the daily weather (precipitation (`PRCP` and `SNOW`) and temperature (`TMAX`, `TMIN`, and `TAVG`). then calculate the average number of daily inspections for days when the precipitation is nonzero, and the average number of daily inspaections for days when the precipitation is zero.

## note on image credits
this notebook contains links to animated gifs which i copied from here: https://github.com/gadenbuie/tidyexplain

# the art of the join
- in the following, when we talk of tables we mean a `pandas` dataframe.
- when we talk about joining tables, order matters. that is, joining table `x` with table `y` is in general not the same as joining `y` to `x`.
- in order to be able to join two tables into a single table, they must have at least one column in common, a column that is the key to the join and lets us know which row of one table to match against the other. 
- the resulting joined table retains some subset of the rows and columns of the two tables. 
- we will be using the `pd.merge()` function:
```
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
```
consider two tables, `x` and `y` defined as dataframes below (for comparison to this week's exercise, imagine  `x` has information on the count of inspections per day, and table `y` has the daily weather information):

In [13]:
import pandas as pd
x = pd.DataFrame([{'a':1, 'b':'x1'}, {'a':2, 'b':'x2'}, {'a':3, 'b':'x3'}], index=[1,2,3])
y = pd.DataFrame([{'a':1, 'c':'y1'}, {'a':2, 'c':'y2'}, {'a':4, 'c':'y4'}], index=[11,12,13])

In [2]:
print(x)

   a   b
1  1  x1
2  2  x2
3  3  x3


In [3]:
print(y)

    a   c
11  1  y1
12  2  y2
13  4  y4


note that `x` and `y` have one column in common: column `a`. each dataframe also has another column unique to itself. 
it is a good idea to pause here and think about what you expect the resulting joined table to look like? in fact there are several different ways to join these tables with different results. 

let's start with the simplest.

## full outer join
a *full outer join* of `x` and `y` is a table containing all the rows of `x` and all the rows of `y`, matched up on columns in common. wherever one table has a gap (lacks a value of the key), a `NaN` is inserted in its columns. 
![full join](images/full-join.gif)
this is the kind of join we would use when we are concerned about not dropping any of the data. 

here is how we can execute this kind of join:

In [4]:
outer_join = pd.merge(x,y,how='outer')
print(outer_join)

   a    b    c
0  1   x1   y1
1  2   x2   y2
2  3   x3  NaN
3  4  NaN   y4


note that the index is not being used.

## inner join
an *inner join* of `x` and `y` is a table containing only the rows of `x` and only the rows of `y` that have a match in the key column.
![inner join](images/inner-join.gif)
this is the kind of join we would often use when we want to investigate (model or plot) the relantionship between columns `b` and `c` and cannot use rows where either is missing.

In [5]:
inner_join = pd.merge(x,y,how='inner', on='a')
print(inner_join)

   a   b   c
0  1  x1  y1
1  2  x2  y2


## left join
a *left join* of `x` and `y` is a table containing all the rows of `x` and only the rows of `y` which have matching values in the key column(s).
![left join](images/left-join.gif)
this is the most frequently used type of join. we use this kind of join when `x` is our main data table that we do not want to drop rows from, but we are augmenting it with some other external data, `y`.

In [6]:
left_join = pd.merge(x,y,how='left', on='a')
print(left_join)

   a   b    c
0  1  x1   y1
1  2  x2   y2
2  3  x3  NaN


## right join
a *right join* of `x` and `y` is a table containing all the rows of `y` and only the rows of `x` which have matching values in the key column(s).
![right join](images/right-join.gif)
right join of `x` and `y` is exactly like a left join of `y` and `x`. what matters is which table is the one whose data we are not willing to drop in case of missing key values

In [7]:
left_join = pd.merge(x,y,how='right', on='a')
print(left_join)

   a    b   c
0  1   x1  y1
1  2   x2  y2
2  4  NaN  y4


## specifying columns
in case there are no identically named columns between the two tables, we must specify the column in each table

In [None]:
x = pd.DataFrame([{'x_a':1, 'x_b':'x1'}, {'x_a':2, 'x_b':'x2'}, {'x_a':3, 'x_b':'x3'}], index=[1,2,3])
print(x)
y = pd.DataFrame([{'y_a':1, 'y_c':'y1'}, {'y_a':2, 'y_c':'y2'}, {'y_a':4, 'y_c':'y4'}], index=[1,2,3])
print(y)

In [None]:
outer_join = pd.merge(x, y, how='outer', left_on='x_a', right_on='y_a').drop(['y_a'], axis=1)
print(outer_join)

In [None]:
outer_join = pd.merge(x, y, how='outer', left_on='x_a', right_on='y_a').drop(['y_a'], axis=1)
print(outer_join)

# filtered joins

## left semi join
a *left semi join* of `x` and `y` is a table which retains only the columns of `x` and only the rows where `x` and `y` have matching keys. i
![semi join](images/semi-join.gif)
(a rarely-seen *right semi join* works similarly) in effect, the rows of `x` get filtered by the intersection of the keys in `x` and `y`. we use a semi join to trim a dataframe based on another dataframe or list.

In [26]:
boolean_mask = x['a'].isin(y['a'])
print(boolean_mask)
semi_join = x.loc[boolean_mask]
print(semi_join)

1     True
2     True
3    False
Name: a, dtype: bool
   a   b
1  1  x1
2  2  x2


## intersect
when `x` and `y` have some number of columns in common, an *intersect join* of `x` and `y` is a table with the common columns, containing only the rows of `x` and `y` which have matching values in those columns, dropping the rows that only occur in each tables.
![intersect join](images/intersect.gif)

In [38]:
x = pd.DataFrame([{'a':1, 'b':'x1'}, {'a':2, 'b':'x2'}, {'a':3, 'b':'x3'}])
y = pd.DataFrame([{'a':1, 'b':'x1', 'c':'y1'}, {'a':2, 'b':'x4', 'c':'y2'}, {'a':4, 'b':'x1', 'c':'y4'}])
print(x)
print(y)

   a   b
0  1  x1
1  2  x2
2  3  x3
   a   b   c
0  1  x1  y1
1  2  x4  y2
2  4  x1  y4


In [39]:
boolean_mask = x['a'].isin(y['a']) & x['b'].isin(y['b'])
#print(boolean_mask)
intersect = x.loc[boolean_mask]
print(intersect)

   a   b
0  1  x1


this gets simpler when the dataframes have the same columns:

In [8]:
x = pd.DataFrame([{'col1':1, 'col2':'a'}, {'col1':1, 'col2':'b'}, {'col1':2, 'col2':'a'}])
y = pd.DataFrame([{'col1':1, 'col2':'a'}, {'col1':2, 'col2':'b'}])
print(x)
print(y)


   col1 col2
0     1    a
1     1    b
2     2    a
   col1 col2
0     1    a
1     2    b


In [9]:
intersect_join = pd.merge(x,y,how='inner')
print(intersect_join)

   col1 col2
0     1    a
