# Merging and Joining

One important task that will come up again and again is merging data together. In Real Life&#8482; we rarely have a single dataset - we need to combine data from multiple sources,
Luckily Pandas makes it very simple to join together datasets! Let's start by reading in our previous data and we can think about some possible external data to read in

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('data/Consumo_cerveja.csv', 
                 decimal=',', 
                 thousands='.', 
                 header=0, 
                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], 
                 parse_dates=['date'], 
                 nrows=365)

In [None]:
df.head()

We have a timeseries - so holidays is always relevant to look at - is beer consumption affected by holidays? So let's find some holiday data! (and show off pandas `.read_html`!)
For all our informational needs, we turn to [wikipedia]('https://en.wikipedia.org/wiki/Public_holidays_in_Brazil'):

![warning](images/warning.resized.png) `.read_html` requires `lxml` to be installed

`.read_html` will read a webpage and intelligently find all `<table>` elements, converting those to dataframes - we always end up with a list of dataframes, even if there's only one!

In [None]:
holidays = pd.read_html('https://en.wikipedia.org/wiki/Public_holidays_in_Brazil', header=0)[0]
holidays

Now we have a beautiful dataframe straight from Wikipedia! We do have one more step before we are ready to merge - we need to turn `Date` into a `DateTime` type so we can match with our data. We have to remember to add a year, else pandas will interpret it as 1900.

In [None]:
holidays['Date'] = pd.to_datetime(holidays.Date + ', 2015', format='%B %d, %Y')
holidays

Now we have a nicely formatted dataframe, ready to join. We don't really want all the extra information, we are mostly interested in a binary holiday/not_holiday marker, so let's add a column of ones:

In [None]:
holidays['holiday'] = 1
holidays

To merge, we simply use `df.merge()` - it takes a number of options, so let's try merging this one first and we can work through them

In [None]:
df.merge(holidays, left_on='date', right_on='Date')

First off, we are only getting 8 rows back - why is that? 

By default, `.merge` assumes you want to do an `inner join` - for those of you who know SQL, this makes perfect sense :-)

An inner join returns only those rows where we can match the key in both datasets - in this case we keep only those days that are in both our consumption dataset and our holidays dataset.

What we want is to keep all rows in our consumption dataset and add on the rows from holiday that match  - we want a `left join` (This is of course relative to which dataframe we call `.merge` on)

In [None]:
df_holidays = df.merge(holidays, left_on='date', right_on='Date', how='left')
df_holidays

Now we have all our rows and pandas simply fills in `NaN` in those rows that don't match. In order to get our binary holiday marker, we simply `.fillna()` our holiday column

In [None]:
df_holidays['holiday'] = df_holidays.holiday.fillna(0).astype(int)
df_holidays

We got a lot of junk now that we are not interested in, particulary the duplication of the Date columns - I could simply drop them, but let's look at how we can remove them from the merge alltogether

In [None]:
merge_holidays = holidays.rename(columns={"Date": "date"})[["date", "holiday"]]
df_holidays = df.merge(merge_holidays, on='date', how='left').assign(holiday=lambda x: x.holiday.fillna(0).astype(int))
df_holidays

If the keys have the same name in both datasets, we can use the `on` parameter, which will avoid duplication - we also select out only the columns we are interested in merging in the beginning. Now we have our prepared data!

# Joining

Joining is a special case of merging, where we simply merge on the index - this can be useful when we know our indexes are the same.

The main differences are that `.join` defaults to a `left join` and it only joins on indexes

In [None]:
df = df.set_index('date')
holidays = holidays.set_index('Date')

In [None]:
df.join(holidays['holiday']).assign(holiday=lambda x: x.holiday.fillna(0).astype(int))

# Concatenation

Concatenation also comes up often, you can concatenate on both axes - adding more columns and adding more rows.

Let's say I want to look only at the top10 and bottom10 days per consumption

In [None]:
top10 = df_holidays.sort_values(by='consumption', ascending=False).reset_index(drop=True).head(10)
bottom10 = df_holidays.sort_values(by='consumption').reset_index(drop=True).head(10)

In [None]:
top10

In [None]:
bottom10

If I want to compare them easily, I can just concatenate them together

In [None]:
pd.concat([bottom10, top10]).sort_values(by='consumption', ascending=False)

What if I want to compare the consumption side by side?

In [None]:
pd.concat([bottom10.consumption, top10.consumption], axis=1, keys=['bottom', 'top'])