# 1. Combining Data

### Objectives

+ Concatenate multiple DataFrames vertically and horizontally
+ Use basic SQL-style joins with **`merge`**
+ Use pandas **`join`** to combine datasets
+ Know the difference between **`join`** and **`merge`**


### Resources
+ [Merge, join, and concatenate](http://pandas.pydata.org/pandas-docs/stable/merging.html)

### Introduction
Most data analyses will use multiple different datasets or at least multiple datasets created from the same source. Pandas has tools to merge and combine DataFrames in a wide variety of ways.

In [56]:
import pandas as pd

## Concatenating Data
[Concatenating data](http://pandas.pydata.org/pandas-docs/stable/merging.html) in Pandas refers to stacking DataFrames either one on top of each other or side by side. The **`pd.concat`** function is flexible and versatile with many different arguments that give you power to combine two ore more datasets at the same time.


### Concatenating very similar DataFrames
**`pd.concat`** provides many different and sometimes confusing arguments. We can use the IEX trading API to get some stock data from Amazon and Apple. We select just three columns and the first 5 rows of each. We will use these small datasets to illustrate how the `concat` function works.

In [35]:
url = 'https://api.iextrading.com/1.0/stock/{}/chart/5y'
cols = ['date', 'close', 'volume']
amzn = pd.read_json(url.format('amzn'))[cols]
aapl = pd.read_json(url.format('aapl'))[cols]

amzn_head = amzn.head()
aapl_head = aapl.head()

In [36]:
aapl_head

Unnamed: 0,date,close,volume
0,2013-10-22,68.0008,133515753
1,2013-10-23,68.6669,78431122
2,2013-10-24,69.576,96191095
3,2013-10-25,68.7975,84448133
4,2013-10-28,69.31,137610123


In [37]:
amzn_head

Unnamed: 0,date,close,volume
0,2013-10-22,332.54,3942953
1,2013-10-23,326.756,2818158
2,2013-10-24,332.21,5884655
3,2013-10-25,363.39,12043903
4,2013-10-28,358.16,3635848


## Stacking data one on top of the other
The first argument for `concat` needs to be a list of DataFrames. As usual in Pandas, the default is to do the action vertically. We stack them with the following command:

In [38]:
pd.concat([amzn_head, aapl_head])

Unnamed: 0,date,close,volume
0,2013-10-22,332.54,3942953
1,2013-10-23,326.756,2818158
2,2013-10-24,332.21,5884655
3,2013-10-25,363.39,12043903
4,2013-10-28,358.16,3635848
0,2013-10-22,68.0008,133515753
1,2013-10-23,68.6669,78431122
2,2013-10-24,69.576,96191095
3,2013-10-25,68.7975,84448133
4,2013-10-28,69.31,137610123


Notice that the index was kept the same. Use `ignore_index` to make a completely new `RangeIndex` from 0 to n-1.

In [39]:
pd.concat([amzn_head, aapl_head], ignore_index=True)

Unnamed: 0,date,close,volume
0,2013-10-22,332.54,3942953
1,2013-10-23,326.756,2818158
2,2013-10-24,332.21,5884655
3,2013-10-25,363.39,12043903
4,2013-10-28,358.16,3635848
5,2013-10-22,68.0008,133515753
6,2013-10-23,68.6669,78431122
7,2013-10-24,69.576,96191095
8,2013-10-25,68.7975,84448133
9,2013-10-28,69.31,137610123


In [40]:
pd.concat([amzn_head, aapl_head], ignore_index=True)

Unnamed: 0,date,close,volume
0,2013-10-22,332.54,3942953
1,2013-10-23,326.756,2818158
2,2013-10-24,332.21,5884655
3,2013-10-25,363.39,12043903
4,2013-10-28,358.16,3635848
5,2013-10-22,68.0008,133515753
6,2013-10-23,68.6669,78431122
7,2013-10-24,69.576,96191095
8,2013-10-25,68.7975,84448133
9,2013-10-28,69.31,137610123


### Label each piece of the DataFrame with the `keys` parameter
You can use the `keys` parameter to label each piece of the DataFrame. This creates a MultiLevel index.

In [41]:
pd.concat([amzn_head, aapl_head], keys=['amzn', 'aapl'])

Unnamed: 0,Unnamed: 1,date,close,volume
amzn,0,2013-10-22,332.54,3942953
amzn,1,2013-10-23,326.756,2818158
amzn,2,2013-10-24,332.21,5884655
amzn,3,2013-10-25,363.39,12043903
amzn,4,2013-10-28,358.16,3635848
aapl,0,2013-10-22,68.0008,133515753
aapl,1,2013-10-23,68.6669,78431122
aapl,2,2013-10-24,69.576,96191095
aapl,3,2013-10-25,68.7975,84448133
aapl,4,2013-10-28,69.31,137610123


### Perhaps its better to just make a new column beforehand

In [44]:
amzn_head['symbol'] = 'amzn'
aapl_head['symbol'] = 'aapl'
pd.concat([amzn_head, aapl_head])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,date,close,volume,symbol
0,2013-10-22,332.54,3942953,amzn
1,2013-10-23,326.756,2818158,amzn
2,2013-10-24,332.21,5884655,amzn
3,2013-10-25,363.39,12043903,amzn
4,2013-10-28,358.16,3635848,amzn
0,2013-10-22,68.0008,133515753,aapl
1,2013-10-23,68.6669,78431122,aapl
2,2013-10-24,69.576,96191095,aapl
3,2013-10-25,68.7975,84448133,aapl
4,2013-10-28,69.31,137610123,aapl


## Beware! Automatic Alignment of Index
Of extreme importance to **`pd.concat`** (and all of pandas) is the automatic alignment of indexes that happens behind the scenes. For instance, let's change the second column of `amzn_head` and concatenate once again.

In [49]:
amzn_head2 = amzn_head.rename(columns={'close': 'closing price'})
pd.concat([amzn_head2, aapl_head])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,close,closing price,date,symbol,volume
0,,332.54,2013-10-22,amzn,3942953
1,,326.756,2013-10-23,amzn,2818158
2,,332.21,2013-10-24,amzn,5884655
3,,363.39,2013-10-25,amzn,12043903
4,,358.16,2013-10-28,amzn,3635848
0,68.0008,,2013-10-22,aapl,133515753
1,68.6669,,2013-10-23,aapl,78431122
2,69.576,,2013-10-24,aapl,96191095
3,68.7975,,2013-10-25,aapl,84448133
4,69.31,,2013-10-28,aapl,137610123


## Column names align first
`pd.concat` does automatic alignment on the columns and by default does an outer join. Notice the missing values where the misalignment is. We can force an `inner` join, where only the columns in column are kept.

In [51]:
pd.concat([amzn_head2, aapl_head], join='inner')

Unnamed: 0,date,volume,symbol
0,2013-10-22,3942953,amzn
1,2013-10-23,2818158,amzn
2,2013-10-24,5884655,amzn
3,2013-10-25,12043903,amzn
4,2013-10-28,3635848,amzn
0,2013-10-22,133515753,aapl
1,2013-10-23,78431122,aapl
2,2013-10-24,96191095,aapl
3,2013-10-25,84448133,aapl
4,2013-10-28,137610123,aapl


## Use `axis=1` to change the direction of concatenation
An automatic alignment on the index still happens here

In [54]:
pd.concat([amzn_head, aapl_head], axis='columns')

Unnamed: 0,date,close,volume,symbol,date.1,close.1,volume.1,symbol.1
0,2013-10-22,332.54,3942953,amzn,2013-10-22,68.0008,133515753,aapl
1,2013-10-23,326.756,2818158,amzn,2013-10-23,68.6669,78431122,aapl
2,2013-10-24,332.21,5884655,amzn,2013-10-24,69.576,96191095,aapl
3,2013-10-25,363.39,12043903,amzn,2013-10-25,68.7975,84448133,aapl
4,2013-10-28,358.16,3635848,amzn,2013-10-28,69.31,137610123,aapl


# SQL-style joins with `merge`

Many people will come to learn Pandas after learning SQL (structured query language). A very important component of SQL is its ability to join tables. Pandas provides this a similar feature with the **`merge`** method.

**`merge`** is both a pandas function and a DataFrame method that do the exact same thing. It joins tables horizontally by aligning column or index values. **`merge`** cannot be used to stack two frames on top of one another like **`concat`** and **`append`**.

# End of Section Summary
1. Add a single column
2. Use **`insert`** to add column in specific place
3. Drop a column
4. Automatic alignment happens first with **`pd.concat`**
4. concat by row or column with **`axis`** argument
5. Create multiple levles of indexes to specify them with the **`keys`** argument
6. Change type of alignment with **`outer`** and **`inner`** 
7. Know how to spot a pandas function vs a DataFrame method
8. **`append`** is a dataframe method for simple concatenation. **`pd.concat`** is a function with more power
1. Use **`merge`** to do SQL like joins based on column values. 
1. **`merge`** is both a DataFrame method and a pandas function
1. Use **`join`** to do SQL joins mainly on the index

# Problem Set
We will be working with the city of college admissions dataset for the questions in this notebook. Run the following command before attempting the problems.

In [36]:
import pandas as pd
import numpy as np

college = pd.read_csv('../data/college.csv')
pd.options.display.max_columns = 40

### Problem 1
<span  style="color:green; font-size:16px">Insert a column called **`SAT_AVG`** that averages the math and verbal SAT scores before the **`SATVRMID`** column.</span>

In [37]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Read in all three stock csv files and concatenate them horizontally and vertically. Create a hierarchical index that labels each year of data.</span>

In [38]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Take a look at the DataFrame below. Count the total appearances of each letter.</span>

In [39]:
from string import ascii_lowercase
np.random.seed(1)
df = pd.DataFrame(np.random.choice(list(ascii_lowercase), (20,5), replace=True), 
                  columns = ['col1', 'col2', 'col3', 'col4', 'col5'])

df

Unnamed: 0,col1,col2,col3,col4,col5
0,f,l,m,i,j
1,l,f,p,a,q
2,b,m,h,n,g
3,z,s,u,f,s
4,u,l,k,o,s
5,e,x,x,j,r
6,x,a,w,n,j
7,j,h,w,z,b
8,a,r,i,y,n
9,t,p,k,z,i


In [40]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Each Series below represents the amount of TV watched for each sport. Combine all Series so that each column represents a different labeled day. Fill in the missing values with 0. Save it to **`df_sports`**</span>

In [41]:
day1 = pd.Series({'soccer':45, 'basketball':30, 'tennis':10})
day2 = pd.Series({'soccer':55, 'basketball':10, 'bowling':10, 'volleyball':30})
day3 = pd.Series({'soccer':15, 'basketball':20, 'volleyball':40})
day4 = pd.Series({'bowling':100, 'volleyball':20})

In [42]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Use **`df_sports`** to find the total TV watched per sport for all the days and also the total amount of TV watched per day. Sort both results from greatest to least.</span>

In [43]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Look up the method **`isnull`** and count the number of nulls per sport.</span>

In [44]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Combine all Series again, keeping only the sports that have no missing values for any days.</span>

In [45]:
# your code here