## Combining Datasets (Concat, Append, Merge & Join)

These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets. Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward. Here we’ll take a look at simple concatenation of Series and DataFrames with the pd.concat function; later we’ll dive into more sophisticated in-memory merges and joins implemented in Pandas.



In [1]:
import pandas as pd
import numpy as np

In [8]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c:[str(c) + str(i) for i in ind] for c in cols}
    print(data)
    return pd.DataFrame(data, ind)

In [9]:
make_df.__doc__

'Quickly make a DataFrame'

In [10]:
make_df('ABC', range(3))

{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}


Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


#### Quick Concat Reminder (Numpy)

In [11]:
x = [1,2,3]
y = [4,5,6]
z = [7,8,9]
np.concatenate([x, y, z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:
# Speciy axis along with the result will be concatenated
x_1 = [[1,2,3], [4,5,6]]
np.concatenate([x_1, x_1], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

#### Pandas Concat

``` python
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True) pd.concat()
```
* pd.concat() be used for a simple concatenation of Series or DataFrame objects, just as np.concatenate() can be used for simple concatenations of arrays:



In [18]:
# Note here even a shared index value will have two row outputs
ser_1 = pd.Series(['A', 'B', 'C'], index=[1,2,3])
ser_2 = pd.Series(['D', 'E', 'F'], index=[4,3,6])
pd.concat([ser_1, ser_2])

1    A
2    B
3    C
4    D
3    E
6    F
dtype: object

In [20]:
df_1 = make_df('AB', [1,2])
df_2 = make_df('AB', [3,4])
pd.concat([df_1, df_2])

{'A': ['A1', 'A2'], 'B': ['B1', 'B2']}
{'A': ['A3', 'A4'], 'B': ['B3', 'B4']}


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


* By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0). Like np.concatenate, pd.concat allows specification of an axis along which concatenation will take place.

In [22]:
df_3 = make_df('AB', [0, 1])
df_4 = make_df('CD', [0,1])
display(df_3)
display(df_4)

{'A': ['A0', 'A1'], 'B': ['B0', 'B1']}
{'C': ['C0', 'C1'], 'D': ['D0', 'D1']}


Unnamed: 0,A,B
0,A0,B0
1,A1,B1


Unnamed: 0,C,D
0,C0,D0
1,C1,D1


In [24]:
print(pd.concat([df_3, df_4], axis=1))

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


In [26]:
print(pd.concat([df_3, df_4])) 
# default different columns when stacked bring in all data and simply represent NAN for values 
# of the other dataframe not having the column

     A    B    C    D
0   A0   B0  NaN  NaN
1   A1   B1  NaN  NaN
0  NaN  NaN   C0   D0
1  NaN  NaN   C1   D1


### Duplicate indices 
* One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices! Consider this simple example:

In [32]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2,3])
display(x)
display(y)
print(x.index, y.index)

{'A': ['A0', 'A1'], 'B': ['B0', 'B1']}
{'A': ['A2', 'A3'], 'B': ['B2', 'B3']}


Unnamed: 0,A,B
0,A0,B0
1,A1,B1


Unnamed: 0,A,B
2,A2,B2
3,A3,B3


Int64Index([0, 1], dtype='int64') Int64Index([2, 3], dtype='int64')


In [34]:
print(pd.concat([x, y]))

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


In [35]:
y.index = x.index
print(pd.concat([x, y]))

    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3


* Although valid, the outcome of multiple indices of the same value is often undesirable. Let's see how we can handle

#### Catching the repeats as an error
* To simply verify the indices in the result of pd.concat() do not overlap, the `verify_integrity` flag can be specified. When set to `True`, the concatenation will raise an exceiption if there are duplicate indicies.

In [38]:
try:
    pd.concat([x, y], verify_integrity=True)
except Exception as e:
    print(type(e))
    print(e)

<class 'ValueError'>
Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


In [39]:
# Can Ignore the Index
pd.concat([x,y], verify_integrity=False)

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A2,B2
1,A3,B3


#### Adding MultiIndex Keys
The `keys` argument to specify a label for the data sources; result with be hierarchically indexed series containing the data

In [41]:
display(x)
display(y)

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


Unnamed: 0,A,B
0,A2,B2
1,A3,B3


In [42]:
print(pd.concat([x, y], keys=['x', 'y']))

      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3


### Concatenation w/Joins
In practice, data from different sources might have different sets of column names, and pd.concat offers several options. 
* Let's start with combining two dataframes that have some (not all) columns in common

In [44]:
df_5 = make_df('ABC', [1,2])
df_6 = make_df('BCD', [3,4])
pd.concat([df_5, df_6], keys=['df_5', 'df_6'])

{'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']}
{'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D3', 'D4']}


Unnamed: 0,Unnamed: 1,A,B,C,D
df_5,1,A1,B1,C1,
df_5,2,A2,B2,C2,
df_6,3,,B3,C3,D3
df_6,4,,B4,C4,D4


* By default, the entries for which no data is available (example column D for key df_5 (or that dataframe) is filled by default with NA values. 
* Using the `join` argument we can specify what we'd like to concatenate
    * By default this uses the `join='outer'` and thus returns all values from the attempted join

In [45]:
print(pd.concat([df_5, df_6], join='inner'))

    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4


* Columns not present in either df_5 or df_6 ('A' & 'D') are not included

In [48]:
pd.concat([df_5, df_6], axis=1)

Unnamed: 0,A,B,C,B.1,C.1,D
1,A1,B1,C1,,,
2,A2,B2,C2,,,
3,,,,B3,C3,D3
4,,,,B4,C4,D4


In [52]:
df1_axis = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
df4_axis = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], columns=['animal', 'name'])
display(df1_axis)
display(df4_axis)
display(pd.concat([df1_axis, df4_axis], axis=1))
display(pd.concat([df1_axis, df4_axis], axis=0))

Unnamed: 0,letter,number
0,a,1
1,b,2


Unnamed: 0,animal,name
0,bird,polly
1,monkey,george


Unnamed: 0,letter,number,animal,name
0,a,1,bird,polly
1,b,2,monkey,george


Unnamed: 0,letter,number,animal,name
0,a,1.0,,
1,b,2.0,,
0,,,bird,polly
1,,,monkey,george


### The append() method
* pd.concat([df_1, df_2]) can simple also be df_1.append(df_2)

In [53]:
df_1.append(df_2)

  df_1.append(df_2)


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


* However, concat as we can see should be used with the FutureWarning above for future versions

### Merge & Join
* In-memory join and merge operations
    * Very similar to RDBMS (Database) style merges/joins

#### One-to-one joins

In [54]:
dframe_1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting', 'Engineering', 
                                                                               'Engineering', 'HR']})
dframe_2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_data': [2004, 2008, 2012, 2014]})

In [55]:
dframe_1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [56]:
dframe_2

Unnamed: 0,employee,hire_data
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [57]:
dframe_3 = pd.merge(dframe_1, dframe_2)
dframe_3

Unnamed: 0,employee,group,hire_data
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


* The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this.

#### Many-to-one joins
* Joins in which one of the two key columns contains duplicate entries.
* Resulting DataFrame will preserve those duplicate entries as appropriate

In [58]:
dframe_4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 
                        'supervisor': ['Carly', 'Guido', 'Steve']})
dframe_4

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve


In [59]:
print(pd.merge(dframe_3, dframe_4))

  employee        group  hire_data supervisor
0      Bob   Accounting       2008      Carly
1     Jake  Engineering       2012      Guido
2     Lisa  Engineering       2004      Guido
3      Sue           HR       2014      Steve


* The resulting DataFrame has an additional column with the “supervisor” information, where the information is repeated in one or more locations as required by the inputs.

#### Many-to-many joins

In [61]:
dframe_5 = pd.DataFrame({'group':['Accounting', 'Accounting', 'Engineering',
                                 'Engineering', 'HR', 'HR'],
                        'skills':['math', 'spreadsheets', 'coding', 'linux',
                                 'spreadsheets', 'organization']})
print(pd.merge(dframe_1, dframe_5))

  employee        group        skills
0      Bob   Accounting          math
1      Bob   Accounting  spreadsheets
2     Jake  Engineering        coding
3     Jake  Engineering         linux
4     Lisa  Engineering        coding
5     Lisa  Engineering         linux
6      Sue           HR  spreadsheets
7      Sue           HR  organization


### Specificaiton of the Merge Key
* Above is the default behavior for pd.merge()
    * It looks for one or more matching columns names between the two inputs, and uses this as the key
* However, often the column names will not match so nicely
    * As such, pd.merge() provides a variety of options for handling

#### On Keyword

In [63]:
display(dframe_1)
display(dframe_2)

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


Unnamed: 0,employee,hire_data
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [62]:
print(pd.merge(dframe_1, dframe_2, on='employee'))

  employee        group  hire_data
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014


* Option only works if both the left and right `DataFrames` have the specified column name as the example above shows

#### The left_on and right_on keywords
* At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as `“name”` rather than `“employee”`. 
* In this case, we can use the `left_on` and `right_on` keywords to specify the two column names


In [64]:
dframe_3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                        'salary': [70000, 80000, 120000, 90000]})
dframe_3

Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,120000
3,Sue,90000


In [66]:
dframe_1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [67]:
pd.merge(dframe_1, dframe_3, left_on='employee', right_on='name')

Unnamed: 0,employee,group,name,salary
0,Bob,Accounting,Bob,70000
1,Jake,Engineering,Jake,80000
2,Lisa,Engineering,Lisa,120000
3,Sue,HR,Sue,90000


In [68]:
pd.merge(dframe_3, dframe_1, left_on='name', right_on='employee')

Unnamed: 0,name,salary,employee,group
0,Bob,70000,Bob,Accounting
1,Jake,80000,Jake,Engineering
2,Lisa,120000,Lisa,Engineering
3,Sue,90000,Sue,HR


In [70]:
# drop redundant column joined with different names (drop defaults to rows so use non default to drop redundant column)
pd.merge(dframe_1, dframe_3, left_on='employee', right_on='name').drop('name', axis=1)

Unnamed: 0,employee,group,salary
0,Bob,Accounting,70000
1,Jake,Engineering,80000
2,Lisa,Engineering,120000
3,Sue,HR,90000


### The left_index and right_index keywords
* Sometimes, rather than merging on a column, you would instead like to merge on an index

In [74]:
df1a = dframe_1.set_index('employee')
df1a

Unnamed: 0_level_0,group
employee,Unnamed: 1_level_1
Bob,Accounting
Jake,Engineering
Lisa,Engineering
Sue,HR


In [77]:
df2a = dframe_2.set_index('employee')
df2a

Unnamed: 0_level_0,hire_data
employee,Unnamed: 1_level_1
Lisa,2004
Bob,2008
Jake,2012
Sue,2014


In [78]:
pd.merge(df1a, df2a, left_index=True, right_index=True)

Unnamed: 0_level_0,group,hire_data
employee,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,Accounting,2008
Jake,Engineering,2012
Lisa,Engineering,2004
Sue,HR,2014


* `DataFrames` implement the join() method, which perorms a merge that defaults to joining on indices

In [81]:
df1a.join(df2a)

Unnamed: 0_level_0,group,hire_data
employee,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,Accounting,2008
Jake,Engineering,2012
Lisa,Engineering,2004
Sue,HR,2014


* To mix indices and columns, combine left_index with right_on or left_on with right_index 

In [83]:
display(df1a)
display(dframe_3)
pd.merge(df1a, dframe_3, left_index=True, right_on='name')

Unnamed: 0_level_0,group
employee,Unnamed: 1_level_1
Bob,Accounting
Jake,Engineering
Lisa,Engineering
Sue,HR


Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,120000
3,Sue,90000


Unnamed: 0,group,name,salary
0,Accounting,Bob,70000
1,Engineering,Jake,80000
2,Engineering,Lisa,120000
3,HR,Sue,90000


### Specifying Set Arithmetic for Joins

In [86]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, squeeze=None, prefix=<no_default>, mangle_dupe_cols=True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression: 'CompressionOptions' = 'infer', thousands=None, decimal: 'str' = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors: 'str | None' = 'strict', dialect=None, error_bad_li