# `pandas` - Concatenation

__Contents:__

1.  Concatenating objects
2.  Set logic on other axes
3.  Concatenating using append
4.  Ignoring indexes on the concatenation axis
5.  Concatenating

Related/useful documentation:
- https://pandas.pydata.org/pandas-docs/stable/merging.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

### Setup

In [1]:
%%sh
git clone https://github.com/datalab-datasets/file-samples.git

Cloning into 'file-samples'...


In [2]:
%ls /content/file-samples/iris.csv

/content/file-samples/iris.csv


### Load libraries

In [3]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

('0.24.2', '1.16.4')

__Concatenation__

`concat` function - The concat function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

__1. Concatenating Objects__

pandas.concat takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with the other axes”:

__Example__

Defining sample data panda dataframes df1 and df2

In [4]:
 df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
 df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [5]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [0]:
frames = [df1, df2]

Using concat function to concatenate two dataframes df1 and df2. The concatenated dataframed in stored in `result`.

In [7]:
result = pd.concat(frames)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can use the keys argument:

In [10]:
result = pd.concat(frames, keys=['x', 'y'])
result.index

MultiIndex(levels=[['x', 'y'], [0, 1, 2, 3, 4, 5, 6, 7]],
           codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7]])

The resulting object's index has a hierarchical index. We can select each chunk by key:

In [11]:
result.loc['y']

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


__2. Set logic on other axes__:

While appending multiple data frames, you have a choice how to handle other axes. This can be done in three ways:

1. `join = 'outer'` which takes sorted union of all, zero information loss
2. `join = 'inner'` which takes the intersection
3. `join_axes` Use a specific index or indexes

`join = 'outer'`

In [12]:
df3 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])
df3

Unnamed: 0,B,D,F
2,B2,D2,F2
3,B3,D3,F3
6,B6,D6,F6
7,B7,D7,F7


In [13]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [14]:
result = pd.concat([df1, df3], axis=1)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


`join` = 'inner'

In [15]:
result = pd.concat([df1, df3], axis=1, join='inner')
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


`join_axes`

In [16]:
result = pd.concat([df1, df3], axis=1, join_axes=[df1.index])
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


__3. Concatenating using append__

A useful shortcut to concat are the append instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index:

In [17]:
result = df1.append(df2)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


`append` may take multiple objects to concatenate:

In [18]:
result = df1.append([df2, df3])
result

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,A4,B4,C4,D4,
5,A5,B5,C5,D5,
6,A6,B6,C6,D6,
7,A7,B7,C7,D7,
2,,B2,,D2,F2
3,,B3,,D3,F3


__4. Ignoring indexes on the concatenation axis__

For DataFrames which don’t have a meaningful index, you may wish to append them and ignore the fact that they may have overlapping indexes. We can set `ignore_index` = `True`. The same argument works in a similar way with `DataFrame.append`

In [19]:
result = pd.concat([df1, df3], ignore_index=True)
result

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,,B2,,D2,F2
5,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


__5. Concatenating with mixed ndims__

We can also concatenate a mix of Series and DataFrames. The Series gets transformed to DataFrames with the column name as the name of the Series.

In [20]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
s1

0    X0
1    X1
2    X2
3    X3
Name: X, dtype: object

In [21]:
result = pd.concat([df1, s1], axis=1)
result

Unnamed: 0,A,B,C,D,X
0,A0,B0,C0,D0,X0
1,A1,B1,C1,D1,X1
2,A2,B2,C2,D2,X2
3,A3,B3,C3,D3,X3


##Example of concact using the iris dataset

In [22]:
iris = pd.read_csv('/content/file-samples/iris.csv')
iris

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [23]:
iris.shape

(150, 5)

Split the dataset into two dataframes of different sizes

In [0]:
from sklearn.model_selection import train_test_split
iris_df1,iris_df2= train_test_split(iris,test_size=0.4,train_size=0.6)

Now we will try applying the functions on the two subsets (`iris_df1`,`iris_df2`) of the iris dataset

In [0]:
frame = [iris_df1, iris_df2]

In [26]:
iris_concat1 = pd.concat(frame, keys=['x', 'y'])
iris_concat1

Unnamed: 0,Unnamed: 1,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
x,106,4.9,2.5,4.5,1.7,Iris-virginica
x,17,5.1,3.5,1.4,0.3,Iris-setosa
x,114,5.8,2.8,5.1,2.4,Iris-virginica
x,125,7.2,3.2,6.0,1.8,Iris-virginica
x,39,5.1,3.4,1.5,0.2,Iris-setosa
x,126,6.2,2.8,4.8,1.8,Iris-virginica
x,116,6.5,3.0,5.5,1.8,Iris-virginica
x,132,6.4,2.8,5.6,2.2,Iris-virginica
x,92,5.8,2.6,4.0,1.2,Iris-versicolor
x,118,7.7,2.6,6.9,2.3,Iris-virginica


In [27]:
iris_concat2= pd.concat([iris_df1, iris_df2])
iris_concat2.shape

(150, 5)

In [28]:
iris_df1

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
106,4.9,2.5,4.5,1.7,Iris-virginica
17,5.1,3.5,1.4,0.3,Iris-setosa
114,5.8,2.8,5.1,2.4,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
39,5.1,3.4,1.5,0.2,Iris-setosa
126,6.2,2.8,4.8,1.8,Iris-virginica
116,6.5,3.0,5.5,1.8,Iris-virginica
132,6.4,2.8,5.6,2.2,Iris-virginica
92,5.8,2.6,4.0,1.2,Iris-versicolor
118,7.7,2.6,6.9,2.3,Iris-virginica


In [29]:
iris_append = iris_df1.append(iris_df2)
iris_append.shape

(150, 5)

In [30]:
iris

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


__The End__