<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Combining DataFrames
_**Author**: Boom D. (DSI-NYC), Mahdi S. (DSI-NYC)_
***

__First, we'll cover a _simplification_ of the two most common Pandas methods you can combine dataframes together.__

## Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for Pandas plotting

## Loading data

_Note: I've drastically modified and simplified the data from its original source, the [Central Park Squirrel Dataset](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)_

In [2]:
age      = pd.read_csv("./datasets/squirrel_age.csv")
color    = pd.read_csv("./datasets/squirrel_color.csv")
location = pd.read_csv("./datasets/squirrel_location.csv")

In [3]:
age # notice number of observations

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult


In [4]:
color # notice number of observations

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray


In [5]:
location # notice number of observations

Unnamed: 0,unique_squirrel_id,lat,long
0,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,-73.970026,40.769934
2,31F-AM-1013-01,-73.959687,40.789379
3,8A-AM-1013-06,-73.97731,40.773805
4,22F-AM-1007-07,-73.96466,40.78277
5,20A-PM-1017-01,-73.970069,40.782889


<span style="color:blue"><b>Task</b></span>: Get the shape of these dataframes --> notice both rows are columns are different. 

Notice also the difference in the `unique squirrel id` column

In [6]:
print(f'age_dataframe shape: {age.shape}')
print(f'color_dataframe shape: {color.shape}')
print(f'location_dataframe shape: {location.shape}')

age_dataframe shape: (5, 2)
color_dataframe shape: (4, 2)
location_dataframe shape: (6, 3)


## `.merge()`

When we use `.merge()`:
- Only merges 2 dataframes
- We MUST merge on a common column - this is information that is shared by both dataframes.
- Remember, if you're familiar with spreadsheets, this is the Python equivalent of `vlookup`
---
__What is the common column in the `age` and `color` dataframes?__
- Look for the `primary_key`

<span style="color:blue"><b>Task</b></span>: use dataframe `merge()` to combine age and color. Share your inferences on the output.

In [7]:
pd.merge(left = age,
         right = color,
         on = "Unique Squirrel ID")

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


__Are we missing an observation?__
- Note: we did an inner join --> only matching ID data merges

In [8]:
# Alternative syntax that does the same thing
age.merge(color, on = "Unique Squirrel ID") # gives flexibility to merge on multiple columns in a list
# syntax would just revise to: df1.merge(df2, on=['col1', 'col2'])

Unnamed: 0,Unique Squirrel ID,Age,Primary Fur Color
0,7H-PM-1006-07,Adult,Gray
1,3G-PM-1013-03,Adult,Gray
2,22F-AM-1007-07,Juvenile,Cinnamon
3,20A-PM-1017-01,Adult,Gray


### What if we reverse the input order? What changes?
- Note: Pandas was built in a way to mirror Excel, SQL, where the left table is considered *primary* and data from *secondary* tables merge **after** primary table's data

In [9]:
pd.merge(left = color,
         right = age,
         on = "Unique Squirrel ID") 

Unnamed: 0,Unique Squirrel ID,Primary Fur Color,Age
0,20A-PM-1017-01,Gray,Adult
1,22F-AM-1007-07,Cinnamon,Juvenile
2,7H-PM-1006-07,Gray,Adult
3,3G-PM-1013-03,Gray,Adult


We still get the same rows but order of the rows, columns follows primary `left` dataframe, which we defined as color

### What if I don't want the _intersection_ instead, I want to keep everything from the right table (i.e. `age`, the bigger one)?
- Concept understanding: [SQL right join](https://stackoverflow.com/questions/13997365/sql-joins-as-venn-diagram)

In [10]:
pd.merge(left = color,
         right = age,
         how = "right",
         on = "Unique Squirrel ID")

Unnamed: 0,Unique Squirrel ID,Primary Fur Color,Age
0,8A-AM-1013-06,,Juvenile
1,7H-PM-1006-07,Gray,Adult
2,3G-PM-1013-03,Gray,Adult
3,22F-AM-1007-07,Cinnamon,Juvenile
4,20A-PM-1017-01,Gray,Adult


Using `how="right"`, what's changed?
- Hint: recap the shapes of these dataframes

### What if I have a dataframe with a _different_ name for the column I wish to join "on"?

In [12]:
age

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult


In [13]:
location

Unnamed: 0,unique_squirrel_id,lat,long
0,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,-73.970026,40.769934
2,31F-AM-1013-01,-73.959687,40.789379
3,8A-AM-1013-06,-73.97731,40.773805
4,22F-AM-1007-07,-73.96466,40.78277
5,20A-PM-1017-01,-73.970069,40.782889


In [14]:
# This breaks...
pd.merge(left = age,
         right = location,
         on = "Unique Squirrel ID")

KeyError: 'Unique Squirrel ID'

#### 'Unique Squirrel ID' isn't 'Unique Squirrel ID' in the `locations` dataframe. The column names that is, though the column values are matched with the other dataframes.

In [14]:
# This WORKS!
pd.merge(left = age,
         right = location,
         left_on = "Unique Squirrel ID",
         right_on = "unique_squirrel_id")

Unnamed: 0,Unique Squirrel ID,Age,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,8A-AM-1013-06,-73.97731,40.773805
1,7H-PM-1006-07,Adult,7H-PM-1006-07,-73.970026,40.769934
2,3G-PM-1013-03,Adult,3G-PM-1013-03,-73.974437,40.767428
3,22F-AM-1007-07,Juvenile,22F-AM-1007-07,-73.96466,40.78277
4,20A-PM-1017-01,Adult,20A-PM-1017-01,-73.970069,40.782889


We see some redundancy (because of the column name mismatch in the `primary key`), which is working as expected...
- You may have code that breaks if it expects some incoming datafame to have the specific column "unique_squirrel_id" in some place and "Unique Squirrel ID" in others
- You may consider using .rename to change the column name in `location` dataframe to match with the rest or vice versa, change the column name in the `age` and `color` dataframes (since it doesn't folow best practise naming with spacing and capitalization), then we will just need to merge `on = 'unique_squirrel_id'`

## `.concat()`

Recommended read on [key differences between concat and merge for concept understanding](https://towardsdatascience.com/3-key-differences-between-merge-and-concat-functions-of-pandas-ab2bab224b59)

#### Concatenating by columns _(not recommended)_
Wondering why? Look at the primary key columns after `concat()`. Compare it with the output from `merge()` above

In [15]:
# axis = 1 --> concatenate by column, axis = 0 --> concatenate by row
pd.concat(objs = [age, location], axis = 1)

Unnamed: 0,Unique Squirrel ID,Age,unique_squirrel_id,lat,long
0,8A-AM-1013-06,Juvenile,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,Adult,7H-PM-1006-07,-73.970026,40.769934
2,3G-PM-1013-03,Adult,31F-AM-1013-01,-73.959687,40.789379
3,22F-AM-1007-07,Juvenile,8A-AM-1013-06,-73.97731,40.773805
4,20A-PM-1017-01,Adult,22F-AM-1007-07,-73.96466,40.78277
5,,,20A-PM-1017-01,-73.970069,40.782889


Notice how we can concatenate two dataframes without the same number of rows, but...
- The overlap is filled with `NaN` values --> different shapes between the concatenated dataframes
- concat simply pasted the 2 dataframes side-by-side `without` ensuring one-to-one mapping for primary key on ID column

### Can we `.concat()` more than 2 dataframes?

In [15]:
# let's recap the shapes of the 3 dataframes again
print(f'age_dataframe shape: {age.shape}')
print(f'color_dataframe shape: {color.shape}')
print(f'location_dataframe shape: {location.shape}')

age_dataframe shape: (5, 2)
color_dataframe shape: (4, 2)
location_dataframe shape: (6, 3)


In [16]:
pd.concat(objs = [age, location, color], axis = 1)

Unnamed: 0,Unique Squirrel ID,Age,unique_squirrel_id,lat,long,Unique Squirrel ID.1,Primary Fur Color
0,8A-AM-1013-06,Juvenile,3G-PM-1013-03,-73.974437,40.767428,20A-PM-1017-01,Gray
1,7H-PM-1006-07,Adult,7H-PM-1006-07,-73.970026,40.769934,22F-AM-1007-07,Cinnamon
2,3G-PM-1013-03,Adult,31F-AM-1013-01,-73.959687,40.789379,7H-PM-1006-07,Gray
3,22F-AM-1007-07,Juvenile,8A-AM-1013-06,-73.97731,40.773805,3G-PM-1013-03,Gray
4,20A-PM-1017-01,Adult,22F-AM-1007-07,-73.96466,40.78277,,
5,,,20A-PM-1017-01,-73.970069,40.782889,,


#### Concatenating by rows _(THIS is useful! And what typical `concat()` industry usecase is)_
Remember this about `concat`: it is an **effective stacker**

Sytax difference: takes default value of argument `axis=0`

In [20]:
# Creating a new data point (row)
new_datapoint = pd.DataFrame(data = [['8A-AM-1013-06', "Cinnamon"]],
                             columns = ['Unique Squirrel ID', 'Primary Fur Color'])

# alternate way to create the new_datapoint dataframe:
new_datapoint_revised = pd.DataFrame({'Unique Squirrel ID':['8A-AM-1013-06'], 'Primary Fur Color':['Cinnamon']})

In [21]:
new_datapoint

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,8A-AM-1013-06,Cinnamon


In [22]:
new_datapoint_revised

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,8A-AM-1013-06,Cinnamon


In [23]:
# Concatenate new datapoint to existing color dataframe - they afterall have the same cols!
new_color = pd.concat(objs = [color, new_datapoint], axis = 0)
new_color

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray
0,8A-AM-1013-06,Cinnamon


In [24]:
# alternatively, below simplified syntax gets the job done too
pd.concat([color, new_datapoint])

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray
0,8A-AM-1013-06,Cinnamon


<span style="color:blue"><b>Task</b></span>: Look carefully at the above dataframe,__Is there anything odd about this new dataframe?__

In [26]:
# Reset index to fix the index! but..that's not enough
new_color.reset_index()

Unnamed: 0,index,Unique Squirrel ID,Primary Fur Color
0,0,20A-PM-1017-01,Gray
1,1,22F-AM-1007-07,Cinnamon
2,2,7H-PM-1006-07,Gray
3,3,3G-PM-1013-03,Gray
4,0,8A-AM-1013-06,Cinnamon


In [21]:
# drop non-value added index
new_color.reset_index(drop=True)

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray
4,8A-AM-1013-06,Cinnamon


In [27]:
# but wait, new_color dataframe still carries the load of the non-value added indices
new_color

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray
0,8A-AM-1013-06,Cinnamon


<span style="color:blue"><b>Task</b></span>: How do you make the reset_index change `stick`?

In [29]:
new_color.reset_index(drop=True, inplace = True)
new_color

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,Gray
4,8A-AM-1013-06,Cinnamon


## `.join()`
Difference vs `merge`: use to join multiple dataframes on their indexes (best part: **not restricted to just 2 dataframe joins**). 

Think about it, a dataframe's index *is in itself a primary key* present in each dataframe

[Recommended read](https://towardsdatascience.com/pandas-join-vs-merge-c365fd4fbf49)

## References
- [Central Park Squirrel Census](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)