## CMPINF 2100 Week 05
### Combine DataFrames - Concatenation
## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Read Data
Read in the Example A CSV file discussed in the previous recording.

In [2]:
dfA0 = pd.read_csv("Example_A.csv")

In [3]:
dfA0

Unnamed: 0,A,B,C,D,E,F
0,a,0,-100,Jan,aa,10
1,b,1,-200,Feb,aa,20
2,c,2,-300,Mar,aa,10
3,d,3,-400,Apr,bb,20
4,e,4,-500,May,bb,10
5,f,5,-600,Jun,bb,20
6,g,6,-700,Jul,cc,10
7,h,7,-800,Aug,cc,20
8,i,8,-900,Sep,cc,10
9,j,9,-1000,Oct,dd,20


Add a column with a constant value of 0.

In [4]:
dfA0['attempt'] = 0

In [5]:
dfA0

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


In [6]:
dfA1 = pd.read_csv("Example_A.csv")

In [7]:
dfA1

Unnamed: 0,A,B,C,D,E,F
0,a,0,-100,Jan,aa,10
1,b,1,-200,Feb,aa,20
2,c,2,-300,Mar,aa,10
3,d,3,-400,Apr,bb,20
4,e,4,-500,May,bb,10
5,f,5,-600,Jun,bb,20
6,g,6,-700,Jul,cc,10
7,h,7,-800,Aug,cc,20
8,i,8,-900,Sep,cc,10
9,j,9,-1000,Oct,dd,20


Add a constant but this time equal to 1.

In [8]:
dfA1['attempt'] = 1

In [9]:
dfA1

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,1


## Vertically Concatenate

Vertically combining means we STACK the objects on top of each other.

In [12]:
pd.concat([dfA0, dfA1])

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


This works because BOTH DFs has the SAME column NAMES!

In [13]:
dfA0.columns == dfA1.columns

array([ True,  True,  True,  True,  True,  True,  True])

Look closely at the `.index` attribute of the COMBINED VERTICALLY STACKED DataFrames!

In [14]:
pd.concat([dfA1, dfA0]).loc[10]

Unnamed: 0,A,B,C,D,E,F,attempt
10,k,10,-1100,Nov,dd,10,1
10,k,10,-1100,Nov,dd,10,0


By default, the `.index` attribute is allowed to repeat. The `.index` does NOT uniquely define a row in the new stacked DataFrame!

Ignoring the index allows each stacked row to be unique!

In [15]:
pd.concat([dfA1, dfA0], ignore_index=True)

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,1


I also like to force the DEEP COPY just in case.

In [16]:
pd.concat([dfA1, dfA0], ignore_index=True).copy()

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,1


We can assign the result to an object.

In [17]:
dfA_double = pd.concat([dfA1, dfA0], ignore_index=True, copy=True)

In [18]:
dfA_double.shape

(24, 7)

In [19]:
dfA0.shape

(12, 7)

## Horizonal Concatenation
BINDING columns together!

The default `axis` argument is ZERO meaning the DATAFRAMES are vertically combined.

In [20]:
pd.concat([dfA1, dfA0], axis=0)

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,1


If we change the `axis` to `axis=1` then the two DFs will be combined HORIZONTALLY!!!

In [21]:
pd.concat([dfA1, dfA0], axis=1)

Unnamed: 0,A,B,C,D,E,F,attempt,A.1,B.1,C.1,D.1,E.1,F.1,attempt.1
0,a,0,-100,Jan,aa,10,1,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,1,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,1,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,1,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,1,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,1,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,1,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,1,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,1,j,9,-1000,Oct,dd,20,0


In [22]:
pd.concat([dfA1, dfA0], axis=1).columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'attempt', 'A', 'B', 'C', 'D', 'E', 'F',
       'attempt'],
      dtype='object')

The column names are NO LONGER UNIQUE!!

In [23]:
pd.concat([dfA1, dfA0], axis=1).loc[:, ["A", "B"]]

Unnamed: 0,A,A.1,B,B.1
0,a,a,0,0
1,b,b,1,1
2,c,c,2,2
3,d,d,3,3
4,e,e,4,4
5,f,f,5,5
6,g,g,6,6
7,h,h,7,7
8,i,i,8,8
9,j,j,9,9


I think this is VERY BAD. I really dislike that Pandas allows combining DFs horizontally even if they have the SAME COLUMN NAMES!!

BE CAREFUL WHEN HORIZONTALLY CONCATENATING!

So why would we ever horizontally combine?

In [24]:
dfA_left = dfA0.loc[:, dfA0.columns[:3]].copy()

In [25]:
dfA_left

Unnamed: 0,A,B,C
0,a,0,-100
1,b,1,-200
2,c,2,-300
3,d,3,-400
4,e,4,-500
5,f,5,-600
6,g,6,-700
7,h,7,-800
8,i,8,-900
9,j,9,-1000


In [29]:
dfA_right = dfA0.loc[:, dfA0.columns[-2:]].copy()

In [30]:
dfA_right

Unnamed: 0,F,attempt
0,10,0
1,20,0
2,10,0
3,20,0
4,10,0
5,20,0
6,10,0
7,20,0
8,10,0
9,20,0


In [32]:
dfA_left.shape

(12, 3)

In [33]:
dfA_right.shape

(12, 2)

The point of horizontally combining is to bring together DIFF columns that have the SAME number of rows.

In [35]:
pd.concat([dfA_left, dfA_right], axis=1)

Unnamed: 0,A,B,C,F,attempt
0,a,0,-100,10,0
1,b,1,-200,20,0
2,c,2,-300,10,0
3,d,3,-400,20,0
4,e,4,-500,10,0
5,f,5,-600,20,0
6,g,6,-700,10,0
7,h,7,-800,20,0
8,i,8,-900,10,0
9,j,9,-1000,20,0


BUT BE CAREFUL, IF YOU IGNORE THE INDEX WITH HORIZONTAL CONCATENATION...

In [36]:
pd.concat([dfA_left, dfA_right], axis=1, ignore_index=True)

Unnamed: 0,0,1,2,3,4
0,a,0,-100,10,0
1,b,1,-200,20,0
2,c,2,-300,10,0
3,d,3,-400,20,0
4,e,4,-500,10,0
5,f,5,-600,20,0
6,g,6,-700,10,0
7,h,7,-800,20,0
8,i,8,-900,10,0
9,j,9,-1000,20,0
