In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%reload_ext postcell
%postcell register

In [None]:
%matplotlib inline

### `pd.concat` combines dataframes vertically or horizontally

Combining multiple datasets is very common. Given these two dataframes, you can combine them via `pd.concat`:

In [None]:
simpsons_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']
            )
simpsons_2assignments_pd = simpsons_2assignments_pd.round()

got_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=[ 'Jon', 'Arya', 'Ned', 'Danny', 'That red lady']
            )
got_2assignments_pd = got_2assignments_pd.round()

#### Combine the following two dataframes vertically

In [None]:
#simpsons_2assignments_pd['Assignment 3'] = '100'

In [None]:
simpsons_2assignments_pd

In [None]:
got_2assignments_pd

In [None]:
pd.concat([simpsons_2assignments_pd, got_2assignments_pd])

We can add an `axis` parameter, but it isn't needed, by default

In [None]:
pd.concat([simpsons_2assignments_pd, got_2assignments_pd], axis='rows')

**Exercise** Please take the first two rows of `simpsons_2assignments_pd` and the last two rows of `got_2assignments_pd` and combine them (vertically) into a single dataframe

In [None]:
%%postcell exercise_030_140_a

#type your answer here


#### Combine the following two dataframes horizontally

In [None]:
simpsons_2assignments_pd

In [None]:
simpsons_2more_assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 3', 'Assignment 4']
             , index=['Homer', 'Marge', 'Bart', 'Maggie', 'Lisa']
            )
simpsons_2more_assignments_pd = simpsons_2more_assignments_pd.round()
simpsons_2more_assignments_pd

In [None]:
pd.concat([simpsons_2assignments_pd, simpsons_2more_assignments_pd], axis='columns')

What if the second table didn't have enough entries for everyone in the first dataframe (remember the importance of `index`)

In [None]:
simpsons_2more_assignments_pd.loc['Homer':'Bart']

In [None]:
pd.concat([simpsons_2assignments_pd, simpsons_2more_assignments_pd.loc['Homer':'Bart']], axis='columns', sort=False)

**Note** `pd.concat` takes a _list_ of dataframes

### `pd.merge` to do sql style joins

`pd.concat` is quite a bit more powerful than just plopping together two dataframes, vertically or horizontally. It provides many features of sql's joins. However, I generally use `pd.merge` or `df.merge` to duplicate the functioanlity of sql.

This stackoverflow answer provides better comparison of `merge`, `join` and `concat` than any book or documentation I've read: https://stackoverflow.com/questions/40468069/merge-two-dataframes-by-index/40468090#40468090

Note that since corporate environments generally store data in sql databases, I prefer to do my joins there, rather than in pandas. SQL servers usually have more memory, faster computation and better optimizers, resulting in faster joins.

#### Combine the following two dataframes

In [None]:
simpsons_jobs_df = pd.DataFrame({'age':[38, 107, 42, 56, 37, 60, 46], 
              'iq': [55, 107, 100, 111, 83, 97, 110],
              'profession': ['Nuclear Safety Inspector', 'CEO', 'Teacher', 'Physician', 'Business Owner', 'Business Owner', 'School Principal'],
              'name': ['Homer', 'Mr. Burns', 'Mrs Krabapple', 'Dr. Hiburt', 'Moe', 'Ned', 'Principal Skinner']
             }
            #, index = ['Homer', 'Mr. Burns', 'Mrs Krabapple', 'Dr. Hiburt', 'Moe', 'Ned', 'Principal Skinner']
            )
simpsons_jobs_df

In [None]:
profession_df = pd.DataFrame({ 
    'profession': ['CEO', 'Teacher', 'Physician', 'Business Owner', 'Nuclear Safety Inspector', 'Mayor'],
    'salary':[17000000, 29000, 120000, 80000, 36000, 98000], 
    'vacation_days': [90, 90, 12, 3, 10, 10]
             })
profession_df

Notice that we can't just `concat` the two tables together. For each person in the original table, we need to look up their profession, then match that profession in the second column and bring columns from the second table back to the first table.

In [None]:
pd.merge(simpsons_jobs_df, profession_df, left_on='profession', right_on='profession')

If the column name in both tables is the same, we can just use the `on` argument

In [None]:
pd.merge(simpsons_jobs_df, profession_df, on='profession')

If you want to join with an index, rather than a column, you can use the `right_index` or the `left_index` arguments.

In [None]:
profession_idx_df = profession_df.set_index('profession')
profession_idx_df

Notice that we don't have to do the `reindex` and `set_index` silliness here

In [None]:
pd.merge(simpsons_jobs_df, profession_idx_df, left_on='profession', right_index=True)

#### Outer join

In [None]:
simpsons_jobs_df

In [None]:
profession_df

In [None]:
profession_idx_df = profession_df.set_index('profession')
profession_idx_df

Notice that Principal Skinner is not in any of the combined tables. This is because there is no corresponding profession in the second table

In [None]:
pd.merge(simpsons_jobs_df, profession_idx_df, left_on='profession', right_index=True)

We can change merge style from the default of `inner` to `outer` to force-include all rows in the left table

In [None]:
pd.merge(simpsons_jobs_df, profession_idx_df, left_on='profession', right_index=True, how='outer')

Notice that we now also get a row for _Mayor_, which didn't show up before because there was no mayor in the first table

If you further wanted to control if the `outer` merge only included left orthe right tables, you could use `left` or `right` as values for the `how` attribute. Note that your _sql_ class will explain these joins in more detail

In [None]:
pd.merge(simpsons_jobs_df, profession_idx_df, left_on='profession', right_index=True, how='left')

**Dealing with join complexity** You should know that Pandas lets you do such joins. Detail description of joins will be explained in your sql class. Once you understand these concepts via SQL, using `merge` will become trivial.