# Data Assembly

The notebook focuses on assembling a dataset for analysis by combining various datasets together. 

---

## Combine Datasets 

Think about what happens when we normalize our data to be stored in a database. We often end up having to break up our data into smaller tables to store relevant information together and to reduce redundancy. 

The same thing happens with our datasets. We may have a large dataset that we need to break apart into smaller datasets for various reasons. Once we need to perform any analysis, we'll then need to figure out how to recombine the relevant parts.

---

## Concatenation 

Concatenation is thought of either appending a row or a column to our data. This is possible if the data was split into parts or say if you performed some calculation that you want to append to your existing dataset. 

In [1]:
import pandas as pd

df1 = pd.read_csv("./concat_1.csv")
df2 = pd.read_csv("./concat_2.csv")
df3 = pd.read_csv("./concat_3.csv")

In [2]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [3]:
df2

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [4]:
df3

Unnamed: 0,A,B,C,D
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


Use the `.concat()` function to concatenate dataframes.

### Add Rows

Stacking dataframes on top of each other uses Pandas `concat()` with the dataframes passed in as a Python list.

In [5]:
row_concat = pd.concat([df1, df2, df3])
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


Trying to append a Series object to a dataframe doesn't work correctly. Observe...

In [6]:
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
new_row_series

0    n1
1    n2
2    n3
3    n4
dtype: object

In [7]:
# attempt to add this row to the dataframe 
pd.concat([df1, new_row_series])

Unnamed: 0,A,B,C,D,0
0,a0,b0,c0,d0,
1,a1,b1,c1,d1,
2,a2,b2,c2,d2,
3,a3,b3,c3,d3,
0,,,,,n1
1,,,,,n2
2,,,,,n3
3,,,,,n4


You'll see that it just went and create a new column since it doesn't match any of the existing ones. 

What we want to do is convert that series into a dataframe and specify the column names each value will bind to.

In [10]:
new_row_df = pd.DataFrame(
    data=[["n1", "n2", "n3", "n4"]],
    columns=["A", "B", "C", "D"]
)

new_row_df

Unnamed: 0,A,B,C,D
0,n1,n2,n3,n4


In [11]:
# now we can concatenate properly
pd.concat([df1, new_row_df])

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,n1,n2,n3,n4


So, takeaway here is that we can really only concatenate dataframes with dataframe. Like with like.

#### Ignore the index

I personally like this better, we can choose to ignore the index of rows whenever we want to combine dataframes. This will reset the row indices so that it actually makes sense.

This is accomplished using the `ignore_index` parameter.

In [12]:
# stack our dataframes and reset the row indices
row_concat = pd.concat([df1, df2, df3], ignore_index=True)
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6
7,a7,b7,c7,d7
8,a8,b8,c8,d8
9,a9,b9,c9,d9


### Add Columns 

TODO: Complete this later.