In many situations, data comes from several sources, and we need to combine this data to perform analyses. 

Pandas provides various methods for combining dataframes. Most important are:

- [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
- [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In many situations, data comes from several sources, and we need to combine this data to perform analyses. 

In this notebook, we will see how to use pandas `concat` method to join datasets, and we will compare it to other methods: `append`, `join` and `merge`. 

## pandas.concat

pandas.concat Concatenates pandas objects along a particular axis with optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

In [1]:
import pandas as pd

### Default Parameters

To show how pandas.concat works, we will create two sample datasets:

* employees1, with 2 columns: id and name  
* employees2, with 2 columns: id and name  

In [2]:
employees1 = pd.DataFrame({'id':[1,2,3,4],'name':['Johana','Mike','Patricia','James']})
employees1

Unnamed: 0,id,name
0,1,Johana
1,2,Mike
2,3,Patricia
3,4,James


In [3]:
employees2 = pd.DataFrame({'id':[5,6,7,8],'name':['John','Jennifer','Michael','Mary']})
employees2

Unnamed: 0,id,name
0,5,John
1,6,Jennifer
2,7,Michael
3,8,Mary


First, we will use concat with the default parameters:

In [4]:
pd.concat([employees1, employees2])

Unnamed: 0,id,name
0,1,Johana
1,2,Mike
2,3,Patricia
3,4,James
0,5,John
1,6,Jennifer
2,7,Michael
3,8,Mary


As we can see, using the default parameters, pd.concat just concatenated the second dataframe to the first one along the vertical axis (index).

![pandas concat](./concat-default.png "Pandas Concat")

Parameters of Pandas Concat are:
    
* **objs**: a sequence or mapping of Series or DataFrame objects  
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected. Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.  


* **axis**: The axis to concatenate along.  
{0/’index’, 1/’columns’}, default 0

* **join**  

* **ignore_index**  

* **keys**  

* **levels**  

* **names**  

* **verify_integrity**  

* **sort**  

* **copy**  

### Objs

Objs is typically a sequence of dataframes, as in our previous example, but sometimes series are also useful.

In [5]:
serie1 = pd.Series(['Johana','Mike','Patricia','James'])
serie2 = pd.Series(['John','Jennifer','Michael','Mary'])

In [6]:
pd.concat([serie1, serie2])

0      Johana
1        Mike
2    Patricia
3       James
0        John
1    Jennifer
2     Michael
3        Mary
dtype: object

As we can see, the resulting object is a Pandas series, very similar to our previous concatenated dataframe.

In [7]:
concat_series = pd.concat([serie1, serie2])
type(concat_series)

pandas.core.series.Series

If we pass a dict, dict keys will be used as index:

In [8]:
pd.concat({'a':employees1, 'b':employees2})

Unnamed: 0,Unnamed: 1,id,name
a,0,1,Johana
a,1,2,Mike
a,2,3,Patricia
a,3,4,James
b,0,5,John
b,1,6,Jennifer
b,2,7,Michael
b,3,8,Mary


### Axis

The axis parameter is used to specify the axis to concatenate along.

There are two options: 
    
* 0, along the index. This is the default behaviour.  
* 1, along the columns.  

Let see how `axis = 1` works. Let's create a dataframe with salaries for some employees that are in our previously created dataframes.

In [9]:
salary = pd.DataFrame({'id':[1,2,3,5],'salary':[120000,100000,130000,90000]})
salary

Unnamed: 0,id,salary
0,1,120000
1,2,100000
2,3,130000
3,5,90000


Note that there is an employee with id 5, that is not in `employees1` but in `employees2`

In [10]:
pd.concat([employees1, salary], axis=1)

Unnamed: 0,id,name,id.1,salary
0,1,Johana,1,120000
1,2,Mike,2,100000
2,3,Patricia,3,130000
3,4,James,5,90000


![pandas concat axis 1](./concat-axis-1.png "Pandas Concat")

As you can see, in this example, since we have an employee with a different id, `pd.concat([employees1, salary], axis=1)` was not very useful, because on 4th row we have an employee with id 4 but salary for employee id 5.

In this case, a [merge](http://datacomy.com/courses/data_analysis/pandas/merge/) is preferred. (optionally, how=outer)

In [11]:
pd.merge(employees1, salary, how='outer')

Unnamed: 0,id,name,salary
0,1,Johana,120000.0
1,2,Mike,100000.0
2,3,Patricia,130000.0
3,4,James,
4,5,,90000.0


But if we have a meaningful index, say that the index of each dataframe corresponds with each employee id, pd.concat index=1 correctly keeps the information for employees 4 and 5 in separate rows:

In [12]:
employees1_indexed = pd.DataFrame({'name':['Johana','Mike','Patricia','James']})
employees1_indexed.index = [1, 2, 3, 4]
employees1_indexed

Unnamed: 0,name
1,Johana
2,Mike
3,Patricia
4,James


In [13]:
salary_indexed = pd.DataFrame({'salary':[120000,100000,130000,90000]})
salary_indexed.index = ([1,2,3,5])  # note the 5
salary_indexed

Unnamed: 0,salary
1,120000
2,100000
3,130000
5,90000


In [14]:
pd.concat([employees1_indexed, salary_indexed], axis=1)

Unnamed: 0,name,salary
1,Johana,120000.0
2,Mike,100000.0
3,Patricia,130000.0
4,James,
5,,90000.0


![pandas concat axis 1](./concat-axis-1_indexed.png "pandas concat axis 1")

### ignore_index

> If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
>
> Default value is False.

Recall our first example when we concatenated two dataframes with employees' names data. In that case, the index information was kept, resulting in an index with repeated keys.

In [15]:
pd.concat([employees1, employees2])

Unnamed: 0,id,name
0,1,Johana
1,2,Mike
2,3,Patricia
3,4,James
0,5,John
1,6,Jennifer
2,7,Michael
3,8,Mary


If we pass the parameter ignore_index = True, the index will be reset:

In [16]:
pd.concat([employees1, employees2], axis=0, ignore_index=True)

Unnamed: 0,id,name
0,1,Johana
1,2,Mike
2,3,Patricia
3,4,James
4,5,John
5,6,Jennifer
6,7,Michael
7,8,Mary


### keys

> If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.

Using the keys parameter, we can assign keys to the corresponding MultiIndex.

For example, if our `employees1` dataframe contained information about people in the Production section, and the `employees2` dataframe contained information about people in the Marketing section, we could add that information when performing the concatenation:

In [17]:
pd.concat([employees1, employees2], keys=['production', 'marketing'])

Unnamed: 0,Unnamed: 1,id,name
production,0,1,Johana
production,1,2,Mike
production,2,3,Patricia
production,3,4,James
marketing,0,5,John
marketing,1,6,Jennifer
marketing,2,7,Michael
marketing,3,8,Mary


We can achieve the same passing a dict instead of a list:

In [18]:
pd.concat({'production':employees1, 'marketing':employees2})

Unnamed: 0,Unnamed: 1,id,name
production,0,1,Johana
production,1,2,Mike
production,2,3,Patricia
production,3,4,James
marketing,0,5,John
marketing,1,6,Jennifer
marketing,2,7,Michael
marketing,3,8,Mary


When keys is used alongside `axis=1`, it adds a level in the `columns` attribute.

In [19]:
pd.concat([employees1, employees2], axis=1, keys=['employee','supervisor'])

Unnamed: 0_level_0,employee,employee,supervisor,supervisor
Unnamed: 0_level_1,id,name,id,name
0,1,Johana,5,John
1,2,Mike,6,Jennifer
2,3,Patricia,7,Michael
3,4,James,8,Mary


### levels

> Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.


Leves is used in conjunction with the `keys` parameter. Actually, levels is almost never used and quite esoteric. We will leave it for now. 

### names

> Names for the levels in the resulting hierarchical index.

`names` is rarely used. It allows us to specify the name of the levels in the resulting MultiIndex.

In the example `concat_dfs = pd.concat([employees1, employees2], keys=['production', 'marketing'])`, the names of the levels in the resulting MultiIndex were undefined `[None, None]`.

We could specify the name of the leves as Sector and employee id using `names=['Sector', 'Employee ID']`. The resulting MultiIndex levels are named `['Sector', 'Employee ID']`.

In [24]:
concat_dfs = pd.concat([employees1, employees2], keys=['production', 'marketing'])
concat_dfs

Unnamed: 0,Unnamed: 1,id,name
production,0,1,Johana
production,1,2,Mike
production,2,3,Patricia
production,3,4,James
marketing,0,5,John
marketing,1,6,Jennifer
marketing,2,7,Michael
marketing,3,8,Mary


In [25]:
concat_dfs.index

MultiIndex([('production', 0),
            ('production', 1),
            ('production', 2),
            ('production', 3),
            ( 'marketing', 0),
            ( 'marketing', 1),
            ( 'marketing', 2),
            ( 'marketing', 3)],
           )

In [26]:
concat_dfs.index.names

FrozenList([None, None])

In [27]:
concat_dfs_with_names = pd.concat([employees1, employees2], keys=['production', 'marketing'], names=['Sector', 'Employee ID'])
concat_dfs_with_names

Unnamed: 0_level_0,Unnamed: 1_level_0,id,name
Sector,Employee ID,Unnamed: 2_level_1,Unnamed: 3_level_1
production,0,1,Johana
production,1,2,Mike
production,2,3,Patricia
production,3,4,James
marketing,0,5,John
marketing,1,6,Jennifer
marketing,2,7,Michael
marketing,3,8,Mary


In [28]:
concat_dfs_with_names.index.names

FrozenList(['Sector', 'Employee ID'])

### sort

> Sort non-concatenation axis if it is not already aligned when join is ‘outer’. This has no effect when join='inner', which already preserves the order of the non-concatenation axis.

Default is False.

In [33]:
employees1

Unnamed: 0,id,name
0,1,Johana
1,2,Mike
2,3,Patricia
3,4,James


In [31]:
pd.concat([employees1, salary], sort=True)

Unnamed: 0,id,name,salary
0,1,Johana,
1,2,Mike,
2,3,Patricia,
3,4,James,
0,1,,120000.0
1,2,,100000.0
2,3,,130000.0
3,5,,90000.0
