## Joins

For the full discussion and instructions, see the [README for this episode](README.md). 

Before looking at join examples, we need to do our standard import and data loading:


In [9]:
import pandas as pd

data_dir = '../../../data/input/ch1/'
airports_file = data_dir + 'airports.csv'
routes_file = data_dir + 'routes.csv'

# read data fom csv
airports = pd.read_csv(airports_file, header=0)
routes = pd.read_csv(routes_file, header=0)


### Merge()

Pandas provides a `merge()` function, which is the most flexible way to perform join operations on multiple datasets. First, let's take a look at an *inner join* (one which returns only rows which have a column match from the left and right `DataFrame`s):


In [11]:
# filter for only NY airports
ny_airports = airports[airports.state == 'NY']
print(ny_airports)
print("Just New York airports:\n", ny_airports.shape)

# drop duplicate routes
src_dest = routes[['src', 'dest']].drop_duplicates(ignore_index=True)
print(src_dest)
print("All possible src and dest pairs:\n", src_dest.shape)

# inner join on the resulting DataFrames using merge()
ny_routes_inner = pd.merge(ny_airports, src_dest, left_on='iata', right_on='src')
print(ny_routes_inner.head(5)) 
print(ny_routes_inner.shape)


     iata                 airport            city state country        lat  \
3     01G            Perry-Warsaw           Perry    NY     USA  42.741347   
19    06N                Randall       Middletown    NY     USA  41.431566   
42    0B8              Elizabeth   Fishers Island    NY     USA  41.251308   
54    0G0  North Buffalo Suburban        Lockport    NY     USA  43.103184   
57    0G7   Finger Lakes Regional    Seneca Falls    NY     USA  42.880623   
...   ...                     ...             ...   ...     ...        ...   
2901  SCH         Schenectady Cty     Schenectady    NY     USA  42.852456   
2972  SLK              Adirondack    Saranac Lake    NY     USA  44.385310   
3039  SWF                 Stewart        Newburgh    NY     USA  41.504094   
3048  SYR   Syracuse-Hancock Intl        Syracuse    NY     USA  43.111187   
3193  UCA              Oneida Cty           Utica    NY     USA  43.145119   

            lon  
3    -78.052081  
19   -74.391917  
42   -72.

There are no null values, illustrating the fact that an inner join returns rows with matches in both DataFrames.

Now let's look at an outer join using the same input DataFrames:

In [14]:
# outer join:
ny_routes_outer = pd.merge(ny_airports, src_dest, how='outer', left_on='iata', right_on='src')
print(ny_routes_outer.head(5)) 
print(ny_routes_outer[30:35])
print(ny_routes_outer.shape)

  iata                 airport            city state country        lat  \
0  01G            Perry-Warsaw           Perry    NY     USA  42.741347   
1  06N                Randall       Middletown    NY     USA  41.431566   
2  0B8              Elizabeth   Fishers Island    NY     USA  41.251308   
3  0G0  North Buffalo Suburban        Lockport    NY     USA  43.103184   
4  0G7   Finger Lakes Regional    Seneca Falls    NY     USA  42.880623   

         lon  src dest  
0 -78.052081  NaN  NaN  
1 -74.391917  NaN  NaN  
2 -72.031611  NaN  NaN  
3 -78.703346  NaN  NaN  
4 -76.781620  NaN  NaN  
   iata           airport     city state country        lat        lon  src  \
30  9G0  Buffalo Airfield  Buffalo    NY     USA  42.862003 -78.716585  NaN   
31  9G3             Akron    Akron    NY     USA  43.021087 -78.482968  NaN   
32  9G5          Royalton  Gasport    NY     USA  43.182002 -78.557805  NaN   
33  ALB        Albany Cty   Albany    NY     USA  42.748119 -73.802979  ALB   
34  

Notice that the outer join returns a result with many more rows than the one returned by the inner join, confirming that an outer join is much less restrictive, because it does not require a match in either the left or right DataFrame.

Now let's look at a left outer join, which returns rows which match the left DataFrame (but does not require a match in the right DataFrame):

In [17]:
# left outer join:
ny_routes_left_outer = pd.merge(ny_airports, src_dest, how='left', left_on='iata', right_on='src')
print(ny_routes_left_outer.head()) 
print(ny_routes_left_outer.shape)

  iata                 airport            city state country        lat  \
0  01G            Perry-Warsaw           Perry    NY     USA  42.741347   
1  06N                Randall       Middletown    NY     USA  41.431566   
2  0B8              Elizabeth   Fishers Island    NY     USA  41.251308   
3  0G0  North Buffalo Suburban        Lockport    NY     USA  43.103184   
4  0G7   Finger Lakes Regional    Seneca Falls    NY     USA  42.880623   

         lon  src dest  
0 -78.052081  NaN  NaN  
1 -74.391917  NaN  NaN  
2 -72.031611  NaN  NaN  
3 -78.703346  NaN  NaN  
4 -76.781620  NaN  NaN  
(437, 9)


The left outer join returns more rows than the inner join, but far fewer than the outer join.

Finally, let's look at a right outer join:

In [18]:
# right join:
ny_routes_right_join = pd.merge(ny_airports, src_dest,how='right', left_on='iata', right_on='src')
print(ny_routes_right_join.head()) 
print(ny_routes_right_join.shape)

  iata airport city state country  lat  lon  src dest
0  NaN     NaN  NaN   NaN     NaN  NaN  NaN  ASF  KZN
1  NaN     NaN  NaN   NaN     NaN  NaN  NaN  ASF  MRV
2  NaN     NaN  NaN   NaN     NaN  NaN  NaN  CEK  KZN
3  NaN     NaN  NaN   NaN     NaN  NaN  NaN  CEK  OVB
4  NaN     NaN  NaN   NaN     NaN  NaN  NaN  DME  KZN
(37594, 9)


The right outer join returns 37,594 rows. Why does it return so many more rows than the left outer join?

### Join with Pandas join() method
`DataFrames` also have built in `join` methods. This object method uses `merge` but can be a more straightforward method for performing joins onto a specific `DataFrame`. `join()` always joins using the index of the other `DataFrame` so it is less flexible but can give similar behavior to `merge()` when many indices are common:

In [16]:
left_join = ny_airports.join(src_dest.set_index('src'), how='left', on='iata')
print(left_join.head()) 
print(left_join.shape)

inner_join = ny_airports.join(src_dest.set_index('src'), how='inner', on='iata')
print(inner_join.head()) 
print(inner_join.shape)

   iata                 airport            city state country        lat  \
3   01G            Perry-Warsaw           Perry    NY     USA  42.741347   
19  06N                Randall       Middletown    NY     USA  41.431566   
42  0B8              Elizabeth   Fishers Island    NY     USA  41.251308   
54  0G0  North Buffalo Suburban        Lockport    NY     USA  43.103184   
57  0G7   Finger Lakes Regional    Seneca Falls    NY     USA  42.880623   

          lon dest  
3  -78.052081  NaN  
19 -74.391917  NaN  
42 -72.031611  NaN  
54 -78.703346  NaN  
57 -76.781620  NaN  
(437, 8)
    iata     airport    city state country        lat        lon dest
825  ALB  Albany Cty  Albany    NY     USA  42.748119 -73.802979  BOS
825  ALB  Albany Cty  Albany    NY     USA  42.748119 -73.802979  MSS
825  ALB  Albany Cty  Albany    NY     USA  42.748119 -73.802979  OGS
825  ALB  Albany Cty  Albany    NY     USA  42.748119 -73.802979  CLT
825  ALB  Albany Cty  Albany    NY     USA  42.748119 -73.

### Concat()

The last way to combine `DataFrames` we will discuss is using the `concat()` method. Unlike the joins we have been discussing, `concat()` does not attempt to match common data within columns. Instead, it can be thought of as simply stacking the `DataFrames` on top of or next to each other. The first argument of `concat()` is a list of objects to concatenate, usually `DataFrames`. 
We then specify which way to concatenate them with the `axis` argument. With `axis=0`, the `DataFrames` are concatenated vertically (stacked on top of each other). All column names are retained and columns common to the `DataFrames` are combined. Similarly, `axis=1` is a horizontal concatenation; the `DataFrames` are stacked next to each other, aligned by index.

In [7]:
# vertical concatenate
first_airports = airports.head()
first_routes = routes.head().copy()
second_airports = airports.iloc[5:10]
first_routes.rename(columns={'src': 'iata'}, inplace=True)
print("Vertical concat:\n", pd.concat([first_airports, first_routes, second_airports], axis=0))
print("Horizontal concat:\n", pd.concat([first_airports, first_routes, second_airports], axis=1))

Vertical concat:
   iata               airport              city state country        lat  \
0  00M              Thigpen        Bay Springs    MS     USA  31.953765   
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861   
2  00V           Meadow Lake  Colorado Springs    CO     USA  38.945749   
3  01G          Perry-Warsaw             Perry    NY     USA  42.741347   
4  01J      Hilliard Airpark          Hilliard    FL     USA  30.688012   
0  ASF                   NaN               NaN   NaN     NaN        NaN   
1  ASF                   NaN               NaN   NaN     NaN        NaN   
2  CEK                   NaN               NaN   NaN     NaN        NaN   
3  CEK                   NaN               NaN   NaN     NaN        NaN   
4  DME                   NaN               NaN   NaN     NaN        NaN   
5  01M     Tishomingo County           Belmont    MS     USA  34.491667   
6  02A           Gragg-Wade            Clanton    AL     USA  32.850487   
7  02C 