<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Merging and Concatenation with numpy and pandas

_Authors: Kiefer Katovich (SF)_

---


### Learning Objectives
- Understand use-cases of concatenation of vectors, matrices, and dataframes
- Practice concatenating vectors and matrices using numpy
- Practice concatenating dataframes using pandas
- Join pandas dataframes using SQL-style join operations

### Lesson Guide
- [Overview of concatenation and joining](#introduction)
- [Concatenation using numpy](#numpy_concatenation)
- [Concatenation using pandas](#pandas_concatenation)
- [SQL-style joins using pandas](#pandas_joins)

<a id='introduction'></a>

### Overview of concatenation and joining

---

**Concatenation** is the process of joining separate objects along a dimension to create a new single object. In
computer programming and data processing, two or more character strings are sometimes concatenated for the purpose of saving space or so that they can be addressed as a single item.

In pandas, we will be concatenating dataframes together along rows or columns. Likewise, in numpy you can concatenate vectors and matrices along an axis.

**Joins** with pandas happen when columns of two DataFrames are joined together on index or on a key column. The concept is the same as SQL joins. In pandas, joins are done typically with the `.merge()` function. 

Here is a representation of left, right, inner, and outer joins with Venn diagrams:

![](../assets/joins.png)

<a id='numpy_concatenation'></a>

### Concatenation using numpy

---

Concatenating vectors and matrices in numpy is a common operation that is wise to practice. Because pandas uses numpy under the hood, concatenation in pandas is essentially equivalent to concatenation in numpy.


In [1]:
import numpy as np
import pandas as pd

In [2]:
vector1 = np.array([1,2,3,4])
vector2 = np.array([5,6,7,8])

#### Concatenate `vector1` and `vector2` together with `np.concatenate()`.

Note: unlike python lists, you cannot simply add two numpy vectors together and expect them to concatenate. The addition operator has a different meaning in numpy (element-wise addition).

In [3]:
vector1+vector2

array([ 6,  8, 10, 12])

In [4]:
np.concatenate([vector1, vector2])

array([1, 2, 3, 4, 5, 6, 7, 8])

Our two arrays are 1-dimensional. We can make them 2-dimensional by adding an axis with `np.newaxis`. (One of the many ways to do this.)

#### Add a new axis to `vector1` to make it 2-dimensional. Print out the shape to verify.

In [5]:
vec1_2d = vector1[:, np.newaxis]
print vector1, vector1.shape
print vec1_2d, vec1_2d.shape

[1 2 3 4] (4,)
[[1]
 [2]
 [3]
 [4]] (4, 1)


In [24]:
vec1_2d = vector1[np.newaxis, np.newaxis]


In [25]:
print vec1_2d, vec1_2d.shape

[[[1 2 3 4]]] (1, 1, 4)


There is a big difference between a vector of shape `(4,)` and another vector with shape `(4,1)`. Especially when it comes to matrix multiplications and linear algebra. Numpy prefers to operate with Vectors of shape `(R,1)` where `R` is Rows. 

[Here is a great answer to a question on StackOverflow explaining more on the topic.](http://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r)

Alternatively put the new axis in the first position:

In [6]:
vec1_2d = vector1[np.newaxis, :]
print vector1
print vec1_2d

[1 2 3 4]
[[1 2 3 4]]


With 2D (and above) matrices concatenation must be done along an **axis**. With a two dimensional matrix you can think of concatenating along axis 0 as stacking vertically, and concatenation along axis 1 as stacking horizontally.

#### Make `vector1` and `vector2` 2D. Concatenate them vertically and horizontally by specifying the axis in the function.

In [7]:
vec1_2d = vector1[np.newaxis, :]
vec2_2d = vector2[np.newaxis, :]

print "Vertical Stack"
print np.concatenate([vec1_2d, vec2_2d], axis=0)
print "Horizontal Stack"
print np.concatenate([vec1_2d, vec2_2d], axis=1)

Vertical Stack
[[1 2 3 4]
 [5 6 7 8]]
Horizontal Stack
[[1 2 3 4 5 6 7 8]]


<a id='pandas_concatenation'></a>

### Concatenation using pandas

---

It is often the case that you  would like to concatenate two dataframes together. Perhaps your data is split up into two groups of subjects with the same variables/columns and you want to join them together (stacking vertically - adding rows). Or perhaps you have new variables for all of your existing subjects (stacking horizontally - adding columns).

Below we have two simple datasets we can use to practice pandas concatenation.

In [8]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])

In pandas we can use the `pd.concat` function to stack DataFrames vertically or horizontally. `pd.concat()` takes a list of pandas dataframes as its first argument, and then an axis keyword argument indicating how to concatenate the dataframes. 

The axis argument works the same as in numpy.

**Concatenate `df1` and `df2` by stacking them vertically.**

In [9]:
pd.concat([df1, df2], axis=0)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


**Concatenate `df1` and `df2` by stacking them horizontally.**

In [10]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


You can see that because the pandas indice are different for the two dataframes, it fills in null values. Perhaps we don't care about the row labels during the horizontal concatenation. If you reset the index for `df2` prior to the concatenation it will not fill in null values:

In [11]:
pd.concat([df1, df2.reset_index(drop=True)], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,A4,B4,C4,D4
1,A1,B1,C1,D1,A5,B5,C5,D5
2,A2,B2,C2,D2,A6,B6,C6,D6
3,A3,B3,C3,D3,A7,B7,C7,D7


<a id='pandas_joins'></a>

### Left, right, inner, and outer joins in pandas

---

The pandas `merge` function allows us to join together DataFrames using columns as keys.

[(The same walkthrough can be found here.)](http://chrisalbon.com/python/pandas_join_merge_dataframe.html)

Below we have two dataframes with information on `subject_id`, `first_name`, and `last_name`. We also have a third dataframe with information on `subject_id` and `test_id`.

In [12]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data)
df_a

Unnamed: 0,first_name,last_name,subject_id
0,Alex,Anderson,1
1,Amy,Ackerman,2
2,Allen,Ali,3
3,Alice,Aoni,4
4,Ayoung,Atiches,5


In [13]:
raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data)
df_b

Unnamed: 0,first_name,last_name,subject_id
0,Billy,Bonder,4
1,Brian,Black,5
2,Bran,Balwner,6
3,Bryce,Brice,7
4,Betty,Btisan,8


In [14]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data)
df_n

Unnamed: 0,subject_id,test_id
0,1,51
1,2,15
2,3,15
3,4,61
4,5,16
5,7,14
6,8,15
7,9,1
8,10,61
9,11,16


**Pandas `pd.merge()` for SQL-style joins**

A left join produces a complete set of records from `df_a`, with the matching records (where available) in `df_b`. If there is no match, the right side will contain null.

The pandas `pd.merge()` command has arguments:
- left-hand dataset
- right-hand dataset
- `on=` : keyword argument specifying the key column to join the dataframes on.
- `how=` : keyword argument specifying the type of join (left, right, inner, outer).

#### Left join `df_b` onto `df_a` by `subject_id`.

In [15]:
pd.merge(df_a, df_b, on='subject_id', how='left')

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y
0,Alex,Anderson,1,,
1,Amy,Ackerman,2,,
2,Allen,Ali,3,,
3,Alice,Aoni,4,Billy,Bonder
4,Ayoung,Atiches,5,Brian,Black


#### Right join `df_b` onto `df_a` by `subject_id`

Merge with a right join produces a complete set of records from `df_b`, with the matching records (where available) in `df_a`. If there is no match, the left side will contain null.


In [16]:
pd.merge(df_a, df_b, on='subject_id', how='right')

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y
0,Alice,Aoni,4,Billy,Bonder
1,Ayoung,Atiches,5,Brian,Black
2,,,6,Bran,Balwner
3,,,7,Bryce,Brice
4,,,8,Betty,Btisan


#### Outer join `df_b` onto `df_a` by `subject_id`

An outer join produces the set of all records in `df_a` and `df_b`, with matching records from both sides where available. If there is no match, the missing side will contain null.

In [17]:
pd.merge(df_a, df_b, on='subject_id', how='outer')

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y
0,Alex,Anderson,1,,
1,Amy,Ackerman,2,,
2,Allen,Ali,3,,
3,Alice,Aoni,4,Billy,Bonder
4,Ayoung,Atiches,5,Brian,Black
5,,,6,Bran,Balwner
6,,,7,Bryce,Brice
7,,,8,Betty,Btisan


#### Inner join `df_b` onto `df_a` by `subject_id`

An inner join produces only the set of records that match in both df_a and df_b.

In [18]:
pd.merge(df_a, df_b, on='subject_id', how='inner')

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y
0,Alice,Aoni,4,Billy,Bonder
1,Ayoung,Atiches,5,Brian,Black


#### Combine the information in `df_a`, `df_b` and `df_n` using joins

No information should be lost.

In [19]:
df_1 = pd.merge(df_a, df_b, on='subject_id', how='outer')

df_2 = pd.merge(df_1, df_n, on='subject_id', how='outer')
df_2

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y,test_id
0,Alex,Anderson,1,,,51.0
1,Amy,Ackerman,2,,,15.0
2,Allen,Ali,3,,,15.0
3,Alice,Aoni,4,Billy,Bonder,61.0
4,Ayoung,Atiches,5,Brian,Black,16.0
5,,,6,Bran,Balwner,
6,,,7,Bryce,Brice,14.0
7,,,8,Betty,Btisan,15.0
8,,,9,,,1.0
9,,,10,,,61.0


#### Combine the information in the three datasets only where information is contained in all rows of the output.

In [20]:
df_1 = pd.merge(df_a, df_b, on='subject_id', how='inner')

df_2 = pd.merge(df_1, df_n, on='subject_id', how='inner')
df_2

Unnamed: 0,first_name_x,last_name_x,subject_id,first_name_y,last_name_y,test_id
0,Alice,Aoni,4,Billy,Bonder,61
1,Ayoung,Atiches,5,Brian,Black,16
