In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%reload_ext postcell
%postcell register

In [None]:
%matplotlib inline

### `pd.concat` combines dataframes vertically or horizontally

Combining multiple datasets is very common. Given these two dataframes, you can combine them via `pd.concat`:

In [None]:
simpsons_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']
            )
simpsons_2assignments_pd = simpsons_2assignments_pd.round()

got_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=[ 'Jon', 'Arya', 'Ned', 'Danny', 'That red lady']
            )
got_2assignments_pd = got_2assignments_pd.round()

#### Combine the following two dataframes vertically

In [None]:
#simpsons_2assignments_pd['Assignment 3'] = '100'

In [None]:
simpsons_2assignments_pd

In [None]:
got_2assignments_pd

In [None]:
pd.concat([simpsons_2assignments_pd, got_2assignments_pd])

We can add an `axis` parameter, but it isn't needed, by default

In [None]:
pd.concat([simpsons_2assignments_pd, got_2assignments_pd], axis='rows')

**Exercise** Please take the first two rows of `simpsons_2assignments_pd` and the last two rows of `got_2assignments_pd` and combine them (vertically) into a single dataframe

In [None]:
%%postcell exercise_030_140_a

#type your answer here


#### Combine the following two dataframes horizontally

In [None]:
simpsons_2assignments_pd

In [None]:
simpsons_2more_assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 3', 'Assignment 4']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']
            )
simpsons_2more_assignments_pd = simpsons_2more_assignments_pd.round()
simpsons_2more_assignments_pd

In [None]:
pd.concat([simpsons_2assignments_pd, simpsons_2more_assignments_pd], axis='columns')

What if the second table didn't have enough entries for everyone in the first dataframe (remember the importance of `index`)

In [None]:
simpsons_2more_assignments_pd.loc['Homer':'Bart']

In [None]:
pd.concat([simpsons_2assignments_pd, simpsons_2more_assignments_pd.loc['Homer':'Bart']], axis='columns', sort=False)

**Note** `pd.concat` takes an _array_ of dataframes

### `pd.merge` to do sql style joins

`pd.concat` is quite a bit more powerful than just plopping together two dataframes, vertically or horizontally. It provides many features of sql's joins. However, I generally use `pd.merge` or `df.merge` to duplicate the functioanlity of sql.

This stackoverflow answer provides better comparison of `merge`, `join` and `concat` than any book or documentation I've read: https://stackoverflow.com/questions/40468069/merge-two-dataframes-by-index/40468090#40468090

Note that since corporate environments generally store data in sql databases, I prefer to do my joins there, rather than in pandas. SQL servers usually have more memory, faster computation and better optimizers, resulting in faster joins.

#### Combine the following two dataframes

In [None]:
simpsons_jobs_df = pd.DataFrame({'age':[38, 107, 42, 56, 37], 
              'iq': [55, 107, 100, 111, 83],
              'profession': ['Nuclear Safety Inspector', 'CEO', 'Teacher', 'Physician', 'Business Owner'],
             }
            , index = ['Homer', 'Mr. Burns', 'Mrs Krabapple', 'Dr. Hiburt', 'Moe'])
simpsons_jobs_df

In [None]:
profession_df = pd.DataFrame({'salary':[36000, 17000000, 29000, 120000, 80000], 
              'vacation_days': [10, 90, 90, 12, 3]
             }
            , index = ['Nuclear Safety Inspector', 'CEO', 'Teacher', 'Physician', 'Business Owner'])
profession_df

Notice that for each rows in the dataframe containing Simpsons characters, we need to find the `profession` column, match that with the index in the second dataframe. Instead of doing this manually, we can let `merge` do this for us:

In [None]:
pd.merge(simpsons_jobs_df, profession_df, left_on='profession', right_index=True)

**Dealing with join complexity** You should know that Pandas lets you do such joins. Detail description of joins will be explained in your sql class. Once you understand these concepts via SQL, using `merge` will become trivial.