<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Merging and Concatenation with `numpy` and `pandas`

_Authors: Kiefer Katovich (SF)_

---


### Learning Objectives
- Understand the use cases of concatenation of vectors, matrices, and DataFrames.
- Practice concatenating vectors and matrices using `numpy`.
- Practice concatenating DataFrames using `pandas`.
- Join `pandas` DataFrames using SQL-style JOIN operations.

### Lesson Guide
- [Overview of Concatenation and Joining](#introduction)
- [Concatenation using `numpy`](#numpy_concatenation)
- [Concatenation using `pandas`](#pandas_concatenation)
- [SQL-style JOINs using `pandas`](#pandas_joins)

<a id='introduction'></a>

### Overview of Concatenation and Joining

---

**Concatenation** is the process of joining separate objects along a dimension to create a new single object. In
computer programming and data processing, two or more character strings are sometimes concatenated for the purpose of saving space or so that they can be addressed as a single item.

In `pandas`, we will be concatenating DataFrames together along rows or columns. Likewise, in `numpy`, you can concatenate vectors and matrices along an axis.

**JOINs** with `pandas` happen when columns of two DataFrames are joined together on an index or key column. The concept is the same as SQL JOINs. In `pandas`, JOINs are typically accomplished using the `.merge()` function. 

Here is a representation of LEFT, RIGHT, INNER, and OUTER JOINs in Venn diagrams:

![](./assets/joins.png)

<a id='numpy_concatenation'></a>

### Concatenation Using `numpy`

---

Concatenating vectors and matrices in `numpy` is a common operation that's useful to know. Because `pandas` uses `numpy` under the hood, concatenation in `pandas` is essentially equivalent to concatenation in `numpy`.


In [None]:
import numpy as np
import pandas as pd

In [None]:
vector1 = np.array([1,2,3,4])
vector2 = np.array([5,6,7,8])

#### Concatenate `vector1` and `vector2` together with `np.concatenate()`.

**Note**: Unlike Python lists, you cannot simply add two `numpy` vectors together and expect them to concatenate. The addition operator has a different meaning in `numpy` (i.e., element-wise addition).

In [None]:
# Concatenate the vectors

Our two arrays are one-dimensional. We can make them two-dimensional by adding an axis with `np.newaxis`. (This is one of the many ways to perform this action).

#### Add a new axis to `vector1` to make it two-dimensional. Print out the shape to verify.

In [None]:
# Add a dimension to the vector

There is a big difference between a vector of shape `(4,)` and another vector with shape `(4,1)`. Especially when it comes to matrix multiplications and linear algebra. Numpy prefers to operate with Vectors of shape `(R,1)` where `R` is Rows. 

Alternatively, put the new axis in the first position:

In [None]:
# Add a dimension in another axis to the vector

With 2D (and above) matrices, concatenation must be performed along an **axis**. With a two-dimensional matrix, you can think of concatenating along `axis 0` as stacking vertically, and concatenation along `axis 1` as stacking horizontally.

#### Make `vector1` and `vector2` 2D. Concatenate them vertically and horizontally by specifying the axis in the function.

In [None]:
# Make the vectors 2D and join them along both axes 

<a id='pandas_concatenation'></a>

### Concatenation using `pandas`

---

Oftentimes, you'll want to concatenate two DataFrames together. Perhaps your data is split up into two groups of subjects with the same variables/columns and you want to join them together (stacking vertically — i.e., adding rows). Or, perhaps you want to add new variables for all of your existing subjects (stacking horizontally — i.e. adding columns).

Below we have two simple data sets we can use to practice `pandas` concatenation.

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])

In `pandas`, we can use the `.pd.concat()` function to stack DataFrames vertically or horizontally. `pd.concat()` takes a list of `pandas` DataFrames as its first argument, then an axis keyword argument indicating how to concatenate the DataFrames. 

The axis argument works the same as in `numpy`.

**Concatenate `df1` and `df2` by stacking them vertically.**

In [None]:
# Vertical concatenation

**Concatenate `df1` and `df2` by stacking them horizontally.**

In [None]:
# Horizontal concatenation

You can see that, because the `pandas` indices are different for the two DataFrames, the function fills the empty cells with null values. Perhaps we don't care about the row labels during the horizontal concatenation. If you reset the index for `df2` prior to concatenation, it will not fill in null values:

In [None]:
# Horizontal concatenation ignoring row labels

<a id='pandas_joins'></a>

### LEFT, RIGHT, INNER, and OUTER JOINs in `pandas`

---

The `pandas` `.merge()` function allows us to join together DataFrames using columns as keys.

The same walk through can be found [here](http://chrisalbon.com/python/pandas_join_merge_dataframe.html).

Below we have two DataFrames with information on `subject_id`, `first_name`, and `last_name`. We also have a third DataFrame with information on `subject_id` and `test_id`.

In [None]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data)
df_a

In [None]:
raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data)
df_b

In [None]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data)
df_n

#### `pandas` `.pd.merge()` for SQL-style JOINs

A LEFT JOIN produces a complete set of records from `df_a`, along with the matching records (where available) in `df_b`. If there is no match, the right side will contain null.

The `pandas` `.pd.merge()` command has arguments for:
- A left-hand data set.
- A right-hand data set.
- `on=` : A keyword argument specifying the key column on which to join the DataFrames.
- `how=` : A keyword argument specifying the type of JOIN (LEFT, RIGHT, INNER, OUTER).

#### LEFT JOIN `df_b` onto `df_a` by `subject_id`.

In [None]:
# left join

#### RIGHT JOIN `df_b` onto `df_a` by `subject_id`.

Merging with a RIGHT JOIN produces a complete set of records from `df_b`, along with the matching records (where available) in `df_a`. If there is no match, the left side will contain null.


In [None]:
# right join

#### OUTER JOIN `df_b` onto `df_a` by `subject_id`.

An OUTER JOIN produces the set of all records in `df_a` and `df_b`, along with matching records from both sides (where available). If there is no match, the missing side will contain null.

In [None]:
# outer join

#### INNER JOIN `df_b` onto `df_a` by `subject_id`.

An INNER JOIN produces only the set of records that matches in both `df_a` and `df_b`.

In [None]:
# inner join

#### Combine the information in `df_a`, `df_b`, and `df_n` using JOINs.

No information should be lost.

In [None]:
# A:

#### Combine the information in the three data sets only where information is contained in all rows of the output.

In [None]:
# A:

### Additional Resources
[Vector Shapes: (4, ) vs (4,1)](http://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r)