<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Merging and Concatenation with `pandas`

_Authors: Kiefer Katovich (SF)_

---


### Learning Objectives
- Understand the use cases of concatenation of vectors, matrices, and DataFrames.
- Practice concatenating DataFrames using `pandas`.
- Join `pandas` DataFrames using SQL-style JOIN operations.

### Lesson Guide
- [Overview of Concatenation and Joining](#introduction)
- [Concatenation using `pandas`](#pandas_concatenation)
- [SQL-style JOINs using `pandas`](#pandas_joins)

<a id='introduction'></a>

### Overview of Concatenation and Joining

---

**Concatenation** is the process of joining separate objects along a dimension to create a new single object. In
computer programming and data processing, two or more character strings are sometimes concatenated for the purpose of saving space or so that they can be addressed as a single item.

In `pandas`, we will be concatenating DataFrames together along rows or columns. Likewise, in `numpy`, you can concatenate vectors and matrices along an axis.

**JOINs** with `pandas` happen when columns of two DataFrames are joined together on an index or key column. The concept is the same as SQL JOINs. In `pandas`, JOINs are typically accomplished using the `.merge()` function. 

Here is a representation of LEFT, RIGHT, INNER, and OUTER JOINs in Venn diagrams:

![](./assets/joins.png)

<a id='pandas_concatenation'></a>

### Concatenation using `pandas`

---

Oftentimes, you'll want to concatenate two DataFrames together. Perhaps your data is split up into two groups of subjects with the same variables/columns and you want to join them together (stacking vertically — i.e., adding rows). Or, perhaps you want to add new variables for all of your existing subjects (stacking horizontally — i.e. adding columns).

Below we have two simple data sets we can use to practice `pandas` concatenation.

In [17]:
import pandas as pd

In [18]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])

In [32]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [33]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [36]:
# Stacking
pd.concat([df1,df2],axis=0)

# SHORTCUT: df1.append(df2)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [38]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


In [39]:
# WHAAAT?!
pd.concat([df1,df2],axis=1)['A']

Unnamed: 0,A,A.1
0,A0,
1,A1,
2,A2,
3,A3,
4,,A4
5,,A5
6,,A6
7,,A7


In `pandas`, we can use the `pd.concat()` function to stack DataFrames vertically or horizontally. `pd.concat()` takes a list of `pandas` DataFrames as its first argument, then an axis keyword argument indicating how to concatenate the DataFrames. 

The axis argument works the same as in `numpy`.

**Concatenate `df1` and `df2` by stacking them vertically.**

In [19]:
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

0    a
1    b
0    c
1    d
dtype: object

In [20]:
# Vertical concatenation

**Concatenate `df1` and `df2` by stacking them horizontally.**

In [21]:
# Horizontal concatenation

You can see that, because the `pandas` indices are different for the two DataFrames, the function fills the empty cells with null values. Perhaps we don't care about the row labels during the horizontal concatenation. If you reset the index for `df2` prior to concatenation, it will not fill in null values:

In [22]:
# Horizontal concatenation ignoring row labels

<a id='pandas_joins'></a>

### LEFT, RIGHT, INNER, and OUTER JOINs in `pandas`

---

The `pandas` `.merge()` function allows us to join together DataFrames using columns as keys.

The same walk through can be found [here](http://chrisalbon.com/python/pandas_join_merge_dataframe.html).

Below we have two DataFrames with information on `subject_id`, `first_name`, and `last_name`. We also have a third DataFrame with information on `subject_id` and `test_id`.

In [23]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data)
df_a

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches


In [24]:
raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data)
df_b

Unnamed: 0,subject_id,first_name,last_name
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


In [25]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data)
df_n

Unnamed: 0,subject_id,test_id
0,1,51
1,2,15
2,3,15
3,4,61
4,5,16
5,7,14
6,8,15
7,9,1
8,10,61
9,11,16


#### `pandas` `.pd.merge()` for SQL-style JOINs

A LEFT JOIN produces a complete set of records from `df_a`, along with the matching records (where available) in `df_b`. If there is no match, the right side will contain null.

The `pandas` `.pd.merge()` command has arguments for:
- A left-hand data set.
- A right-hand data set.
- `on=` : A keyword argument specifying the key column on which to join the DataFrames.
- `how=` : A keyword argument specifying the type of JOIN (LEFT, RIGHT, INNER, OUTER).

#### LEFT JOIN `df_b` onto `df_a` by `subject_id`.

In [26]:
# left join

#### RIGHT JOIN `df_b` onto `df_a` by `subject_id`.

Merging with a RIGHT JOIN produces a complete set of records from `df_b`, along with the matching records (where available) in `df_a`. If there is no match, the left side will contain null.


In [27]:
# right join

#### OUTER JOIN `df_b` onto `df_a` by `subject_id`.

An OUTER JOIN produces the set of all records in `df_a` and `df_b`, along with matching records from both sides (where available). If there is no match, the missing side will contain null.

In [28]:
# outer join

#### INNER JOIN `df_b` onto `df_a` by `subject_id`.

An INNER JOIN produces only the set of records that matches in both `df_a` and `df_b`.

In [29]:
# inner join

#### Combine the information in `df_a`, `df_b`, and `df_n` using JOINs.

No information should be lost.

In [30]:
# A:

#### Combine the information in the three data sets only where information is contained in all rows of the output.

In [31]:
# A:

### Additional Resources
[Vector Shapes: (4, ) vs (4,1)](http://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r)