# Data Assembly

The notebook focuses on assembling a dataset for analysis by combining various datasets together. 

---

## Combine Datasets 

Think about what happens when we normalize our data to be stored in a database. We often end up having to break up our data into smaller tables to store relevant information together and to reduce redundancy. 

The same thing happens with our datasets. We may have a large dataset that we need to break apart into smaller datasets for various reasons. Once we need to perform any analysis, we'll then need to figure out how to recombine the relevant parts.

---

## Concatenation 

Concatenation is thought of either appending a row or a column to our data. This is possible if the data was split into parts or say if you performed some calculation that you want to append to your existing dataset. 

In [17]:
import pandas as pd

df1 = pd.read_csv("./concat_1.csv")
df2 = pd.read_csv("./concat_2.csv")
df3 = pd.read_csv("./concat_3.csv")

In [18]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [19]:
df2

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [20]:
df3

Unnamed: 0,A,B,C,D
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


Use the `.concat()` function to concatenate dataframes.

### Add Rows

Stacking dataframes on top of each other uses Pandas `concat()` with the dataframes passed in as a Python list.

In [21]:
row_concat = pd.concat([df1, df2, df3])
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


Trying to append a Series object to a dataframe doesn't work correctly. Observe...

In [22]:
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
new_row_series

0    n1
1    n2
2    n3
3    n4
dtype: object

In [23]:
# attempt to add this row to the dataframe 
pd.concat([df1, new_row_series])

Unnamed: 0,A,B,C,D,0
0,a0,b0,c0,d0,
1,a1,b1,c1,d1,
2,a2,b2,c2,d2,
3,a3,b3,c3,d3,
0,,,,,n1
1,,,,,n2
2,,,,,n3
3,,,,,n4


You'll see that it just went and create a new column since it doesn't match any of the existing ones. 

What we want to do is convert that series into a dataframe and specify the column names each value will bind to.

In [24]:
new_row_df = pd.DataFrame(
    data=[["n1", "n2", "n3", "n4"]],
    columns=["A", "B", "C", "D"]
)

new_row_df

Unnamed: 0,A,B,C,D
0,n1,n2,n3,n4


In [25]:
# now we can concatenate properly
pd.concat([df1, new_row_df])

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,n1,n2,n3,n4


So, takeaway here is that we can really only concatenate dataframes with dataframe. Like with like.

#### Ignore the index

I personally like this better, we can choose to ignore the index of rows whenever we want to combine dataframes. This will reset the row indices so that it actually makes sense.

This is accomplished using the `ignore_index` parameter.

In [26]:
# stack our dataframes and reset the row indices
row_concat = pd.concat([df1, df2, df3], ignore_index=True)
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6
7,a7,b7,c7,d7
8,a8,b8,c8,d8
9,a9,b9,c9,d9


### Add Columns 

Concatenating columns of dataframes is similar to doing so with rows, it's just that we specify on which axis to do so.

In [27]:
# concatenate along columns but does it even make sense to do so?...
col_concat = pd.concat([df1, df2, df3], axis="columns")
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11


The issue with doing this is that as you can see, we kept the column names resulting in duplicate columns. In most cases, this won't make any sense to do so. Observe...

In [28]:
# duplicate cols
col_concat['A']

Unnamed: 0,A,A.1,A.2
0,a0,a4,a8
1,a1,a5,a9
2,a2,a6,a10
3,a3,a7,a11


Recall there are a few different ways to add a *single* column to a dataframe.

In [29]:
# by a passing a new column name and assigning it with a Python list
col_concat['new_col_list'] = ['n1', 'n2', 'n3', 'n4']
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2,new_col_list
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8,n1
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9,n2
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10,n3
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11,n4


In [30]:
# same thing but with a Series object
col_concat['new_col_series'] = pd.Series(['n1', 'n2', 'n3', 'n4'])
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2,new_col_list,new_col_series
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8,n1,n1
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9,n2,n2
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10,n3,n3
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11,n4,n4


Here's the example of concatenating with columns. 

In [None]:
# join on columns and reset column indices
pd.concat([df1, df2, df3], axis="columns", ignore_index=True)

### Concatenate with Different Indices

Previous examples assume new rows has the same column names and new columns had the same row indices. This means that our dataframes are *aligned*.

We'll now address what happens when the row and column indices aren't aligned and how we can still perform concatenations with it. 

#### Concatenate Rows with Different Columns

In [31]:
# modify dataframes for this section 
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']

In [32]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [33]:
df2

Unnamed: 0,E,F,G,H
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [34]:
df3

Unnamed: 0,A,C,F,H
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


Default row concatenation behavior results in the rows automatically aligning themselves along columns and any missing areas will be filled in with `NaN`. Usually, this isn't what we want.

In [35]:
row_concat = pd.concat([df1, df2, df3])
row_concat

Unnamed: 0,A,B,C,D,E,F,G,H
0,a0,b0,c0,d0,,,,
1,a1,b1,c1,d1,,,,
2,a2,b2,c2,d2,,,,
3,a3,b3,c3,d3,,,,
0,,,,,a4,b4,c4,d4
1,,,,,a5,b5,c5,d5
2,,,,,a6,b6,c6,d6
3,,,,,a7,b7,c7,d7
0,a8,,b8,,,c8,,d8
1,a9,,b9,,,c9,,d9


This first way to prevent all the `NaN` values from being included is to keep only those columns that are shared in common by the list of objects to be concatenated. We can use the `join` parameter. 

`outer` will keep all the columns (the default and what we're trying to avoid) and `inner` will keep only the columns that are shared among the datasets. 

Note that *all* of the datasets must share these columns. If even one doesn't have it, then the result will be an empty set. 

In [36]:
# none of the columns are shared across all datasets
pd.concat([df1, df2, df3], join='inner')

0
1
2
3
0
1
2
3
0
1
2


In [37]:
# example of dataframes with shared columns 
pd.concat([df1, df3], join='inner')

Unnamed: 0,A,C
0,a0,c0
1,a1,c1
2,a2,c2
3,a3,c3
0,a8,b8
1,a9,b9
2,a10,b10
3,a11,b11


#### Concatenate Columns with Different Rows 

In [38]:
# modify row indices 
df1.index = [0, 1, 2, 3]
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]

In [39]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [40]:
df2

Unnamed: 0,E,F,G,H
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6
7,a7,b7,c7,d7


In [41]:
df3

Unnamed: 0,A,C,F,H
0,a8,b8,c8,d8
2,a9,b9,c9,d9
5,a10,b10,c10,d10
7,a11,b11,c11,d11


Again, default behavior when concatenating columns is to concatenate the columns at the end and try to align themselves along matching row indices. This results in `NaN` values populating where there are missing values for a given row index.

In [42]:
# potentially undesirable column concatenation 
pd.concat([df1, df2, df3], axis="columns")

Unnamed: 0,A,B,C,D,E,F,G,H,A.1,C.1,F.1,H.1
0,a0,b0,c0,d0,,,,,a8,b8,c8,d8
1,a1,b1,c1,d1,,,,,,,,
2,a2,b2,c2,d2,,,,,a9,b9,c9,d9
3,a3,b3,c3,d3,,,,,,,,
4,,,,,a4,b4,c4,d4,,,,
5,,,,,a5,b5,c5,d5,a10,b10,c10,d10
6,,,,,a6,b6,c6,d6,,,,
7,,,,,a7,b7,c7,d7,a11,b11,c11,d11


Really, the solution to this is to only match along rows with matching indices. 

In this example, only the first and third dataframes have matching row indices, so we won't get an empty dataset.

In [43]:
pd.concat([df1, df3], axis="columns", join="inner")

Unnamed: 0,A,B,C,D,A.1,C.1,F,H
0,a0,b0,c0,d0,a8,b8,c8,d8
2,a2,b2,c2,d2,a9,b9,c9,d9


So, this is clearly something that is going to take some actual practice executing. Find some datasets to mess with and practice in a low-stakes environment.

---

## Observational Units Across Multiple Tables

There may be times that our data may be split across multiple files. This can occur due to the size of the files, the data collection process, or any other reason. The point is that our data won't always be together in one file and we'll need to assemble data from multiple sources. 

To demonstrate this, we'll use billboard data that has observational units spread out over weeks. The first thing we'll need to do is to load all our data sources and assemble them together using Python's file manipulation facilities. 

**ISSUE** The data used for this section isn't presently available, so without it, it's basically useless. Moving on...

---

## Merge Multiple Datasets

Sometime we might want to merge dataframes based on common data values (wtf does this mean?). If we want to do this, based on columns, we can use the `.merge()` function. 

Consider using `.join()` method to merge dataframes by row index. 

In [48]:

# import some data for use to merge
person = pd.read_csv("./survey_person.csv")
site = pd.read_csv("./survey_site.csv")
survey = pd.read_csv("./survey_survey.csv")
visited = pd.read_csv("./survey_visited.csv")

In [49]:
person

Unnamed: 0,ident,personal,family
0,dyer,William,Dyer
1,pb,Frank,Pabodie
2,lake,Anderson,Lake
3,roe,Valentina,Roerich
4,danforth,Frank,Danforth


In [50]:
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [51]:
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [52]:
survey

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41
5,734,lake,sal,0.05
6,734,pb,temp,-21.5
7,735,pb,rad,7.22
8,735,,sal,0.06
9,735,,temp,-26.0


These datasets represent four different observational units. If we want to look at different aspects of one unit but some of that data is contained in another dataframe, we'll need to combine them. 

We can do so using Pandas' `merge()` method with the following syntax: `left.merge(right, how)`
* `left` refers to the dataframe the method is called on 
* `right` refers to the first parameters of the method call 
* `how` refers to the specified method we'll need to merge them

### One-to-One Merge 

The simplest merge is when we want to merge two dataframes on a single shared column and there are no duplicate values.

One key piece of information is that, if we want to merge dataframes on a column with the same data but with different names, we can use the `left_on` and `right_on` parameters to specify the column names for each.

In [64]:
# modify our dataset to remove duplicate site values
visited_subset = visited.loc[[0, 2, 6], :]
visited_subset

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
2,734,DR-3,1939-01-07
6,837,MSK-4,1932-01-14


In [65]:
# get a count of the values in the site column 
visited["site"].value_counts()

site
DR-3     4
DR-1     3
MSK-4    1
Name: count, dtype: int64

In [66]:
visited_subset["site"].value_counts()

site
DR-1     1
DR-3     1
MSK-4    1
Name: count, dtype: int64

We'll now demonstrate merging datasets using an `inner` join which keeps only the keys that exist in both dataframes.

In [69]:
# merge is the default, so it doesn't need to be specified
one2one_merge = site.merge(
    visited_subset, left_on="name", right_on="site"
)

one2one_merge

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


You can see from the output that we merged the two datasets by combining them where the column values for `name` and `site` are equal. 

---

## Many-to-One Merge

This similar to the previous merging except this is the case where *we want to merge but one of the dataframes 
has key values that contain duplicates.*

Observe the `visited` dataframe's site column contains repeated values and this is the same column we'd like to join on.

In [77]:
# dataframe 1
visited['site'].value_counts()

site
DR-3     4
DR-1     3
MSK-4    1
Name: count, dtype: int64

What will happen is the dataframes that contain the single observations matching that column will be duplicated in the merge.

In [79]:
m2o_merge = site.merge(visited, left_on="name", right_on="site")
m2o_merge

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
4,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
5,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
6,DR-3,-47.15,-126.72,752,DR-3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-Many Merge 

**Note:** In practice this is a Cartesian product, as such we're unlikely to have much use for this. 

In this case we want to merge based on multiple columns and there are duplicates in the keys for both the left and right dataframe. 

As an example, we'll first create two joined datasets and then attempt to join those.

In [84]:
# merge person and survey on 'ident' and 'person'
person_survey = person.merge(
    survey,
    left_on='ident', 
    right_on='person'
)

person_survey

Unnamed: 0,ident,personal,family,taken,person,quant,reading
0,dyer,William,Dyer,619,dyer,rad,9.82
1,dyer,William,Dyer,619,dyer,sal,0.13
2,dyer,William,Dyer,622,dyer,rad,7.8
3,dyer,William,Dyer,622,dyer,sal,0.09
4,pb,Frank,Pabodie,734,pb,rad,8.41
5,pb,Frank,Pabodie,734,pb,temp,-21.5
6,pb,Frank,Pabodie,735,pb,rad,7.22
7,pb,Frank,Pabodie,751,pb,rad,4.35
8,pb,Frank,Pabodie,751,pb,temp,-18.5
9,lake,Anderson,Lake,734,lake,sal,0.05


In [86]:
# join visited and survey
visited_survey = visited.merge(
    survey,
    left_on='ident',
    right_on='taken'
)

visited_survey

Unnamed: 0,ident,site,dated,taken,person,quant,reading
0,619,DR-1,1927-02-08,619,dyer,rad,9.82
1,619,DR-1,1927-02-08,619,dyer,sal,0.13
2,622,DR-1,1927-02-10,622,dyer,rad,7.8
3,622,DR-1,1927-02-10,622,dyer,sal,0.09
4,734,DR-3,1939-01-07,734,pb,rad,8.41
5,734,DR-3,1939-01-07,734,lake,sal,0.05
6,734,DR-3,1939-01-07,734,pb,temp,-21.5
7,735,DR-3,1930-01-12,735,pb,rad,7.22
8,735,DR-3,1930-01-12,735,,sal,0.06
9,735,DR-3,1930-01-12,735,,temp,-26.0


In [88]:
# merge the two 
ps_vs = person_survey.merge(
    visited_survey,
    left_on=["quant"],
    right_on=['quant']
)

ps_vs

Unnamed: 0,ident_x,personal,family,taken_x,person_x,quant,reading_x,ident_y,site,dated,taken_y,person_y,reading_y
0,dyer,William,Dyer,619,dyer,rad,9.82,619,DR-1,1927-02-08,619,dyer,9.82
1,dyer,William,Dyer,619,dyer,rad,9.82,622,DR-1,1927-02-10,622,dyer,7.80
2,dyer,William,Dyer,619,dyer,rad,9.82,734,DR-3,1939-01-07,734,pb,8.41
3,dyer,William,Dyer,619,dyer,rad,9.82,735,DR-3,1930-01-12,735,pb,7.22
4,dyer,William,Dyer,619,dyer,rad,9.82,751,DR-3,1930-02-26,751,pb,4.35
...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,roe,Valentina,Roerich,844,roe,rad,11.25,735,DR-3,1930-01-12,735,pb,7.22
144,roe,Valentina,Roerich,844,roe,rad,11.25,751,DR-3,1930-02-26,751,pb,4.35
145,roe,Valentina,Roerich,844,roe,rad,11.25,752,DR-3,,752,lake,2.19
146,roe,Valentina,Roerich,844,roe,rad,11.25,837,MSK-4,1932-01-14,837,lake,1.46


### Check Your Work with Assert 

A good way to check that we didn't make an error when merging is by looking at the number of rows of our data before and after the merge. If you end up with **more** rows than either dataframe, then a many-to-many merge has occurred and this is basically never what we'll want. 

In [89]:
# use assert to compare the number of rows from the merged dataset
# to the number of rows on one of the original datasets
assert ps_vs.shape[0] <= visited_subset.shape[0]

AssertionError: 

As you can see the assertion failed, thus we know the condition doesn't hold and can interpret it as a many-to-many merge has occurred.