# Tutorial 7 B: Merging DataFrames

This part discusses the basic and essential practical steps for integrating data from various sources.

Dealing with data from different sources requires essentially the integration between these sources to have a full information view. For example, if you collect data about house pricing separately in each suburb, you might need to "**concatenate**" data together to get a view of the house prices all over Victoria and even Australia. However, each data collection **might have different attributes**. How to merge them together to get a complete yet precise (with no duplication) presentation of the whole collection? Consider another example for sales department, when you have customer details in one table and product details in another table. How can we use Python (`pandas` library) to execute the join between the two `DataFrame` and manage the many to many relationship. This is exactly about implementing database techniques for merging tables using python on a general data.

Panda library has offered methods to manage the data integration task. In the following, we will discuss each method with examples. 

There are **four ways** to merge/combine between different DataFrames in pandas: concatenating, appending, merging and joining. Each has its own use cases and best practice. We present **concatenating** and **appending** in this tutorial, the other two methods will be discussed in the next tutorial. In all of these methods, we assume data is fetched or scrapped from the web (as explained in the Part A) or it is stored locally on the machine. We also assume that the data from each source is stored in a DataFrame structure. We start first with the easiest way for merging DataFrame with concatenating.

## Methods for integrating data with Pandas:

## 1. Concatenating:
This refers to gluing together data from different DataFrames by **stacking them either vertically or side by side**. Consider the following example.


### Create a dataframe

In [1]:
#Import module
import pandas as pd

# Create a dataframe
df1= pd.DataFrame({'Student_ID': ['1', '2', '3', '4'],
                      'First_Name': ['A1','A2','A3','A4'],
                    'Last_Name': ['B1', 'B2', 'B3', 'B4']},
                  index=[1,2,3,4])
df1

Unnamed: 0,First_Name,Last_Name,Student_ID
1,A1,B1,1
2,A2,B2,2
3,A3,B3,3
4,A4,B4,4


### Create a second dataframe


In [2]:
df2= pd.DataFrame({'Student_ID': ['4', '5', '6', '7','8'],
                      'First_Name': ['A4','A5','A6','A7','A8'],
                    'Last_Name': ['B4', 'B5', 'B6', 'B7','b8']},
                 index= [4,5,6,7,8])
df2

Unnamed: 0,First_Name,Last_Name,Student_ID
4,A4,B4,4
5,A5,B5,5
6,A6,B6,6
7,A7,B7,7
8,A8,b8,8


### Create a third dataframe

In [5]:
df3= pd.DataFrame({'Student_ID': ['9', '10', '11', '12'],
                    'Last_Name': ['B9', 'B10', 'B11', 'B12'],
                  'address':['AD9','AD10','AD11','AD12']})
# without indexing
df3

Unnamed: 0,Last_Name,Student_ID,address
0,B9,9,AD9
1,B10,10,AD10
2,B11,11,AD11
3,B12,12,AD12


### Join the two dataframes by `concat()` 
#### Along rows (Vertically)

In [14]:
concat_1_2= pd.concat([df1,df2])
concat_1_2

Unnamed: 0,First_Name,Last_Name,Student_ID
1,A1,B1,1
2,A2,B2,2
3,A3,B3,3
4,A4,B4,4
4,A4,B4,4
5,A5,B5,5
6,A6,B6,6
7,A7,B7,7
8,A8,b8,8


#### Along columns (Horizontally)

In [12]:
concat_1_2= pd.concat([df1,df2], axis=1) # 1 column side, 0 row side, default axis = 0
concat_1_2

Unnamed: 0,First_Name,Last_Name,Student_ID,First_Name.1,Last_Name.1,Student_ID.1
1,A1,B1,1.0,,,
2,A2,B2,2.0,,,
3,A3,B3,3.0,,,
4,A4,B4,4.0,A4,B4,4.0
5,,,,A5,B5,5.0
6,,,,A6,B6,6.0
7,,,,A7,B7,7.0
8,,,,A8,b8,8.0


\* **Notice data at row 4**

In [15]:
concat_1_2_3= pd.concat([df1,df2,df3])
concat_1_2_3

Unnamed: 0,First_Name,Last_Name,Student_ID,address
1,A1,B1,1,
2,A2,B2,2,
3,A3,B3,3,
4,A4,B4,4,
4,A4,B4,4,
5,A5,B5,5,
6,A6,B6,6,
7,A7,B7,7,
8,A8,b8,8,
0,,B9,9,AD9


As explained in this example, concatenation glues different DataFrames together **without considering the index of any** or **duplication that might cause**. When concatenating, you need to make sure that **all DataFrames have the same headers**. Otherwise, the values of any column that **does not exist** in the union of headers of concatenated DataFrame will be replaced with `NaN` values, as shown when concatenated df3 with df1 and df2. 

Note also, concatenation **assigns the default index** when the index was not provided in the original DataFrame (as in df3).  
`ignore_index` argument is used in case you want to concatenate the DataFrames and **ignore the indexes** (because they are not meaningful)

In [16]:
result = pd.concat([df1, df3], ignore_index=True)
print(df1)
print(df3)
print(result)

  First_Name Last_Name Student_ID
1         A1        B1          1
2         A2        B2          2
3         A3        B3          3
4         A4        B4          4
  Last_Name Student_ID address
0        B9          9     AD9
1       B10         10    AD10
2       B11         11    AD11
3       B12         12    AD12
  First_Name Last_Name Student_ID address
0         A1        B1          1     NaN
1         A2        B2          2     NaN
2         A3        B3          3     NaN
3         A4        B4          4     NaN
4        NaN        B9          9     AD9
5        NaN       B10         10    AD10
6        NaN       B11         11    AD11
7        NaN       B12         12    AD12


The concatenation is very useful when you have data (**with the same attributes**) coming **from different sources** i.e., house prices collected from each suburb. So we glue them together to get a view for house prices all over victoria. We can also **add another** <font color='red'>**key column**</font> **to indicate the source of each chunk of data in a hierarchical way**. The example below explains this as follows:

In [18]:
concat_1_2= pd.concat([df1,df2], keys=['source1','source2'])
print(concat_1_2)

          First_Name Last_Name Student_ID
source1 1         A1        B1          1
        2         A2        B2          2
        3         A3        B3          3
        4         A4        B4          4
source2 4         A4        B4          4
        5         A5        B5          5
        6         A6        B6          6
        7         A7        B7          7
        8         A8        b8          8


An important use case for concatenating is when we different perspectives of the same data is collected, so we need to concatenate all the information together to get all the details in one DataFrame. The following example illustrates the concatenation of DataFrame represents landlord profile with another DataFrame for the property details. The concatenation in this case has to be implemented along the horizontal axis (axis=1) rather than the vertical one, while the default concatenation axis is 0 (vertical).  


In [19]:
df4=pd.DataFrame({'Student_ID': ['1', '2', '3', '4'],
                      'Addres': ['AD1','AD2','AD3','AD4'],
                    'year': ['Y1', 'Y2', 'Y3', 'Y4']},
                 index={1,2,3,4}
                 )
concat_1_4_horizontal= pd.concat([df1,df4],axis=1)
print(concat_1_4_horizontal)
concat_2_4_horizontal= pd.concat([df2,df4],axis=1)
print(concat_2_4_horizontal)

  First_Name Last_Name Student_ID Addres Student_ID year
1         A1        B1          1    AD1          1   Y1
2         A2        B2          2    AD2          2   Y2
3         A3        B3          3    AD3          3   Y3
4         A4        B4          4    AD4          4   Y4
  First_Name Last_Name Student_ID Addres Student_ID year
1        NaN       NaN        NaN    AD1          1   Y1
2        NaN       NaN        NaN    AD2          2   Y2
3        NaN       NaN        NaN    AD3          3   Y3
4         A4        B4          4    AD4          4   Y4
5         A5        B5          5    NaN        NaN  NaN
6         A6        B6          6    NaN        NaN  NaN
7         A7        B7          7    NaN        NaN  NaN
8         A8        b8          8    NaN        NaN  NaN


Concatenating df2 and df4 results in many NAN values because of the **non-overlapping in index**. Therefore, joining the DataFrames is very useful in this case. Concatenation gives you the three options to handle the other axes too (apart from the one we concatenate on) using **outer** or **inner** options or using a specific index.   
* The **outer** option is the default gets the union of data and grantees zero loss. The previous example shows the outer join of df2 and df4.  
* On the other hand, **inner** gets the **intersection between the two DataFrames**. Last but not least, the “join_axes” argument is used for joining with a specific index. 

In [24]:
joint_result= pd.concat([df2, df4], axis=1, join='inner') #get the intersection
print(joint_result)

joint_result2= pd.concat([df2,df4],axis=1,join_axes= [df2.index]) # join on df2.index
print('\ndf2')
print(df2)
print('\ndf4')
print(df4)
print('\n')
print(joint_result2)

  First_Name Last_Name Student_ID Addres Student_ID year
4         A4        B4          4    AD4          4   Y4

df2
  First_Name Last_Name Student_ID
4         A4        B4          4
5         A5        B5          5
6         A6        B6          6
7         A7        B7          7
8         A8        b8          8

df4
  Addres Student_ID year
1    AD1          1   Y1
2    AD2          2   Y2
3    AD3          3   Y3
4    AD4          4   Y4


  First_Name Last_Name Student_ID Addres Student_ID year
4         A4        B4          4    AD4          4   Y4
5         A5        B5          5    NaN        NaN  NaN
6         A6        B6          6    NaN        NaN  NaN
7         A7        B7          7    NaN        NaN  NaN
8         A8        b8          8    NaN        NaN  NaN


*** 
## 2. Append:
Append method in Series and DataFrames is a **shortcut of concatenating**. It is easy to use **but not efficient in terms of performance**: 
> When appending a DataFrame, the original DataFrame **remains in memory and a new appended one is created**. In the appending process, **the indexes must be disjoint but the columns do not need to be**. 

In [27]:
appended_df= df1.append([df2,df3])
print('\ndf1')
print(df1)
print('\ndf2')
print(df2)
print('\ndf3')
print(df3)
print('\n')
print(appended_df)


df1
  First_Name Last_Name Student_ID
1         A1        B1          1
2         A2        B2          2
3         A3        B3          3
4         A4        B4          4

df2
  First_Name Last_Name Student_ID
4         A4        B4          4
5         A5        B5          5
6         A6        B6          6
7         A7        B7          7
8         A8        b8          8

df3
  Last_Name Student_ID address
0        B9          9     AD9
1       B10         10    AD10
2       B11         11    AD11
3       B12         12    AD12


  First_Name Last_Name Student_ID address
1         A1        B1          1     NaN
2         A2        B2          2     NaN
3         A3        B3          3     NaN
4         A4        B4          4     NaN
4         A4        B4          4     NaN
5         A5        B5          5     NaN
6         A6        B6          6     NaN
7         A7        B7          7     NaN
8         A8        b8          8     NaN
0        NaN        B9          9 

You can mix/concatenate both Series and DataFrames. Concat (and therefore append) makes a full copy of the data, and that **constantly reusing this function can create a significant performance hit**. In the following section, we will introduce merge and join as a **more efficient way for merging DataFrames** in Tutorial 6.