# Chapter 04 Data Assembly
Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

In [2]:
import pandas as pd

In [3]:
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')

In [4]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [5]:
df2

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [6]:
df3

Unnamed: 0,A,B,C,D
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


### Join Together by Rows
We can use pd.concat() function to join multiple data frames together in the row direction.

In [25]:
df = pd.concat([df1, df2, df3])
df

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


### Surprise !
Try the code blow.

In [8]:
df.loc[0] 

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
0,a4,b4,c4,d4
0,a8,b8,c8,d8


It not only retrieves the first row, but all the rows with index = 0. It means in a DataFrame, there can be duplicate indices and .loc\[i\] retrieves all rows with index i.

### Surprise 2 !

In [9]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [28]:
new_series = pd.Series({'A': 'n1', 'B': 'n2', 'C': 'n3', 'D': 'n4'})
new_series

A    n1
B    n2
C    n3
D    n4
dtype: object

In [29]:
pd.concat([df1, new_series])

Unnamed: 0,A,B,C,D,0
0,a0,b0,c0,d0,
1,a1,b1,c1,d1,
2,a2,b2,c2,d2,
3,a3,b3,c3,d3,
A,,,,,n1
B,,,,,n2
C,,,,,n3
D,,,,,n4


Why is it so? Shouldn't it be something like 5 rows and the same 4 columns?

This is because concat() converts the new_series to a dataframe with 1 column (column index = 0) and 4 rows before conversion. Therefore concat() appends that 4 new rows and 1 new column to df1 to form a new datafrome of 8 rows and 5 columns.

To make concat output a new dataframe with 5 rows and the same 4 columns, we need to create a new DataFrame with just one row and the same columns (A, B, C, D).

In [23]:
new_df = pd.DataFrame([['n1', 'n2', 'n3', 'n4']], columns = ['A', 'B', 'C', 'D'])
pd.concat([df1, new_df])

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,n1,n2,n3,n4


### Ignoreing the index

Get rid of the original index and generate a new one.

In [24]:
pd.concat([df1, new_df], ignore_index=True)

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
4,n1,n2,n3,n4


## Join Together by Columns
To make joining in the column direction, use axis=1 option.

In [34]:
new_series = pd.Series(['n0', 'n1', 'n2', 'n3'])
new_series

0    n0
1    n1
2    n2
3    n3
dtype: object

In [36]:
pd.concat([df1, new_series], axis=1) # join in the column direction

Unnamed: 0,A,B,C,D,0
0,a0,b0,c0,d0,n0
1,a1,b1,c1,d1,n1
2,a2,b2,c2,d2,n2
3,a3,b3,c3,d3,n3


### Concatenation With Different Columns
Now we change the column names of df2 and df3, then see what happens when joining them together.

In [39]:
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']

In [40]:
pd.concat([df1, df2, df3])

Unnamed: 0,A,B,C,D,E,F,G,H
0,a0,b0,c0,d0,,,,
1,a1,b1,c1,d1,,,,
2,a2,b2,c2,d2,,,,
3,a3,b3,c3,d3,,,,
0,,,,,a4,b4,c4,d4
1,,,,,a5,b5,c5,d5
2,,,,,a6,b6,c6,d6
3,,,,,a7,b7,c7,d7
0,a8,,b8,,,c8,,d8
1,a9,,b9,,,c9,,d9


If we don't want those NaN values, we can choose to specify 'join=inner' option (default 'outer')

In [42]:
pd.concat([df1, df2, df3], join='inner')

0
1
2
3
0
1
2
3
0
1
2


We get an empty DataFrame object with just the index 

### Concatenation With Different Rows

In [43]:
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]
pd.concat([df1, df2, df3], axis=1)

Unnamed: 0,A,B,C,D,E,F,G,H,A.1,C.1,F.1,H.1
0,a0,b0,c0,d0,,,,,a8,b8,c8,d8
1,a1,b1,c1,d1,,,,,,,,
2,a2,b2,c2,d2,,,,,a9,b9,c9,d9
3,a3,b3,c3,d3,,,,,,,,
4,,,,,a4,b4,c4,d4,,,,
5,,,,,a5,b5,c5,d5,a10,b10,c10,d10
6,,,,,a6,b6,c6,d6,,,,
7,,,,,a7,b7,c7,d7,a11,b11,c11,d11


## Merging Multiple Data Sets

At times, we want to put related information from different data sets together. In the world of Pandas, we can use the merge() method to put related rows from two DataFrame objects together.

1. We have two DataFrame objects: df1 and df2;
2. We use column A of df1 and column B of df2 as the join key. That means, for a row r1 in df1, we search for related rows r2 in df2 where r1.A == r2.B. If found, we join the columns of r1 and columns of r2 together to form a new row.
3. Repeat step 2 for all rows in df1 and df2 and join all new rows together to form a new DataFrame object.

The above is called a join operation. It is very much similar to a join operation in SQL database operation.

### One-to-One Merge

First of all, let's look the simplest use case, each row in df1 has only one related row in df2. This is called a one-to-one merge.

In [44]:
site = pd.read_csv('data/survey_site.csv')
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [45]:
visited = pd.read_csv('data/survey_visited.csv')
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


To make the case simpler, we eliminate the duplicate site values in 'visited'

In [46]:
visited_subset = visited.loc[[0, 2, 6]]
visited_subset

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
2,734,DR-3,1939-01-07
6,837,MSK-4,1932-01-14


In [48]:
o2o_merge = site.merge( visited_subset
                      , left_on='name'  # left means 'site', the caller of the merge() method, left_on is the join key of site
                      , right_on='site' # right means 'visited_subset', right_on means the join key of visited_subset
                      , how='inner'     # inner means keeps only rows where keys exist both on the left and right (default)
                      )
o2o_merge

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-One Merge
If a row from df1 (the left) has more than one related rows in df2 (the right), then it's called a many-to-one merge. Here it is.

In [49]:
m2o_merge = site.merge(visited, left_on='name', right_on='site')
m2o_merge

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
4,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
5,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
6,DR-3,-47.15,-126.72,752,DR-3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-Many Merge

There are times when we want to perform a match based on multiple columns. For example, we want to build a table on who (person) did what (survey) at where and when (visit). We need to join data from the following 3 tables:

1. Person (who);
2. Survey (what);
3. Site (where and when).

Let's load the Person and Survey table first.

In [50]:
person = pd.read_csv('data/survey_person.csv')
person

Unnamed: 0,ident,personal,family
0,dyer,William,Dyer
1,pb,Frank,Pabodie
2,lake,Anderson,Lake
3,roe,Valentina,Roerich
4,danforth,Frank,Danforth


In [51]:
survey = pd.read_csv('data/survey_survey.csv')
survey

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41
5,734,lake,sal,0.05
6,734,pb,temp,-21.5
7,735,pb,rad,7.22
8,735,,sal,0.06
9,735,,temp,-26.0


#### Join Person and Survey
Who did what

In [52]:
ps = person.merge(survey, left_on='ident', right_on='person')
ps

Unnamed: 0,ident,personal,family,taken,person,quant,reading
0,dyer,William,Dyer,619,dyer,rad,9.82
1,dyer,William,Dyer,619,dyer,sal,0.13
2,dyer,William,Dyer,622,dyer,rad,7.8
3,dyer,William,Dyer,622,dyer,sal,0.09
4,pb,Frank,Pabodie,734,pb,rad,8.41
5,pb,Frank,Pabodie,734,pb,temp,-21.5
6,pb,Frank,Pabodie,735,pb,rad,7.22
7,pb,Frank,Pabodie,751,pb,rad,4.35
8,pb,Frank,Pabodie,751,pb,temp,-18.5
9,lake,Anderson,Lake,734,lake,sal,0.05


There are 19 rows of the table above, less than that of the survey table. Because survey has 2 rows with person = NaN.

#### Join Site and Survey
What happened where

In [53]:
sv = survey.merge(visited, left_on='taken', right_on='ident')
sv

Unnamed: 0,taken,person,quant,reading,ident,site,dated
0,619,dyer,rad,9.82,619,DR-1,1927-02-08
1,619,dyer,sal,0.13,619,DR-1,1927-02-08
2,622,dyer,rad,7.8,622,DR-1,1927-02-10
3,622,dyer,sal,0.09,622,DR-1,1927-02-10
4,734,pb,rad,8.41,734,DR-3,1939-01-07
5,734,lake,sal,0.05,734,DR-3,1939-01-07
6,734,pb,temp,-21.5,734,DR-3,1939-01-07
7,735,pb,rad,7.22,735,DR-3,1930-01-12
8,735,,sal,0.06,735,DR-3,1930-01-12
9,735,,temp,-26.0,735,DR-3,1930-01-12


#### Join Together

In [54]:
ps_sv = ps.merge(sv, left_on=['taken', 'person', 'quant', 'reading'], right_on=['ident', 'person', 'quant', 'reading'])
ps_sv

Unnamed: 0,ident_x,personal,family,taken_x,person,quant,reading,taken_y,ident_y,site,dated
0,dyer,William,Dyer,619,dyer,rad,9.82,619,619,DR-1,1927-02-08
1,dyer,William,Dyer,619,dyer,sal,0.13,619,619,DR-1,1927-02-08
2,dyer,William,Dyer,622,dyer,rad,7.8,622,622,DR-1,1927-02-10
3,dyer,William,Dyer,622,dyer,sal,0.09,622,622,DR-1,1927-02-10
4,pb,Frank,Pabodie,734,pb,rad,8.41,734,734,DR-3,1939-01-07
5,pb,Frank,Pabodie,734,pb,temp,-21.5,734,734,DR-3,1939-01-07
6,pb,Frank,Pabodie,735,pb,rad,7.22,735,735,DR-3,1930-01-12
7,pb,Frank,Pabodie,751,pb,rad,4.35,751,751,DR-3,1930-02-26
8,pb,Frank,Pabodie,751,pb,temp,-18.5,751,751,DR-3,1930-02-26
9,lake,Anderson,Lake,734,lake,sal,0.05,734,734,DR-3,1939-01-07


We can do the below to get the same result

In [57]:
ps2 = ps.merge(visited, left_on='taken', right_on='ident')
ps2

Unnamed: 0,ident_x,personal,family,taken,person,quant,reading,ident_y,site,dated
0,dyer,William,Dyer,619,dyer,rad,9.82,619,DR-1,1927-02-08
1,dyer,William,Dyer,619,dyer,sal,0.13,619,DR-1,1927-02-08
2,dyer,William,Dyer,622,dyer,rad,7.8,622,DR-1,1927-02-10
3,dyer,William,Dyer,622,dyer,sal,0.09,622,DR-1,1927-02-10
4,pb,Frank,Pabodie,734,pb,rad,8.41,734,DR-3,1939-01-07
5,pb,Frank,Pabodie,734,pb,temp,-21.5,734,DR-3,1939-01-07
6,lake,Anderson,Lake,734,lake,sal,0.05,734,DR-3,1939-01-07
7,pb,Frank,Pabodie,735,pb,rad,7.22,735,DR-3,1930-01-12
8,pb,Frank,Pabodie,751,pb,rad,4.35,751,DR-3,1930-02-26
9,pb,Frank,Pabodie,751,pb,temp,-18.5,751,DR-3,1930-02-26
