<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Data Joining
<br> 

Sometimes, it might be nesesary to **combine** several data sets. To do so, you need to have a common feature in each data set to join/(merge) data from various sources.  
Let's look at different methods on how your data can be joined together.

<table><tr><td><img src='./image/inner_join.png' width = 300></td><td><img src='./image/outer_join.png' width = 300></td></tr></table>
<table><tr><td><img src='./image/left_join.png' width = 300></td><td><img src='./image/right_join.png' width = 300></td></tr></table>

### Examples

In order to understand all these methods, take a look at the examples below and try to figure out how this **data joining** works. Hint: The key column on which you are merging the dataframes is column C. 

</br>
<center><img src="./image/inner_join_example.png" width = 800/></center>

</br>
<center><img src="./image/outer_join_example.png" width = 800/></center>

</br>
<center><img src="./image/left_join_example.png" width = 800 /></center>

</br>
<center><img src="./image/right_join_example.png" width = 800 /></center>

## Merge - First Steps

Now it's your turn!
Let's upload two data sets from [Elia Open Data API](https://www.elia.be/en/grid-data/open-data). The first data set describes the measured and upscaled **total load on the Belgian grid** and presents data from 1st Jan 2019 **until 30th Jan 2019**, whereas the second one describes the measured and upscaled **load on the Elia grid** from 1st Jan 2019 until **only the 15th Jan 2019**. <br>

1. Read in the two data sets

In [1]:
import pandas as pd

In [8]:
total_load = pd.read_csv("./data/energy/total_load_2019_01.csv", sep = ";", parse_dates = True)

In [9]:
elia_load = pd.read_csv("./data/energy/elia_load_2019_01_15.csv", sep = ";", parse_dates = True)

2. Take a quick look at them so you know with what your are working

In [4]:
print("This is the Total Load Data Set: ")
print(total_load.head())
print("Shape: ", total_load.shape)

This is the Total Load Data Set: 
                    DateTime Resolution code  Total Load
0  2019-01-31T23:45:00+01:00           PT15M    11292.30
1  2019-01-31T23:30:00+01:00           PT15M    11382.79
2  2019-01-31T23:15:00+01:00           PT15M    11520.47
3  2019-01-31T23:00:00+01:00           PT15M    11633.02
4  2019-01-31T22:45:00+01:00           PT15M    11546.67
Shape:  (2976, 3)


In [5]:
print("This is the Elia Load Data Set:")
print(elia_load.head())
print("Shape: ", elia_load.shape)

This is the Elia Load Data Set:
                    Datetime Resolution code  Elia Grid Load
0  2019-01-15T23:45:00+01:00           PT15M         9100.73
1  2019-01-15T23:30:00+01:00           PT15M         9479.93
2  2019-01-15T23:15:00+01:00           PT15M         9645.24
3  2019-01-15T23:00:00+01:00           PT15M         9691.24
4  2019-01-15T22:45:00+01:00           PT15M         9717.09
Shape:  (1440, 3)


In order to merge both data frames, you need to rename one of the Datetime columns, since their content is the same but their column is named differently:

In [7]:
elia_load = elia_load.rename(columns={"Datetime": "DateTime"})
elia_load.head()

Unnamed: 0,DateTime,Resolution code,Elia Grid Load
0,2019-01-15T23:45:00+01:00,PT15M,9100.73
1,2019-01-15T23:30:00+01:00,PT15M,9479.93
2,2019-01-15T23:15:00+01:00,PT15M,9645.24
3,2019-01-15T23:00:00+01:00,PT15M,9691.24
4,2019-01-15T22:45:00+01:00,PT15M,9717.09


In [8]:
df_merged = pd.merge(total_load, elia_load)
df_merged

Unnamed: 0,DateTime,Resolution code,Total Load,Elia Grid Load
0,2019-01-15T23:45:00+01:00,PT15M,10201.90,9100.73
1,2019-01-15T23:30:00+01:00,PT15M,10501.04,9479.93
2,2019-01-15T23:15:00+01:00,PT15M,10665.05,9645.24
3,2019-01-15T23:00:00+01:00,PT15M,10698.07,9691.24
4,2019-01-15T22:45:00+01:00,PT15M,10685.16,9717.09
...,...,...,...,...
1435,2019-01-01T01:00:00+01:00,PT15M,8728.57,7843.09
1436,2019-01-01T00:45:00+01:00,PT15M,8746.80,7945.00
1437,2019-01-01T00:30:00+01:00,PT15M,8964.21,8084.29
1438,2019-01-01T00:15:00+01:00,PT15M,9060.07,8229.64


**Question**: Look at the merged dataframe - can you guess which kind ouf merge method you used? Since we did not specify any parameters within the merge function, this method is the default!

And now?

In [9]:
df2_merged = pd.merge(total_load, elia_load, how = "outer")
df2_merged

Unnamed: 0,DateTime,Resolution code,Total Load,Elia Grid Load
0,2019-01-31T23:45:00+01:00,PT15M,11292.30,
1,2019-01-31T23:30:00+01:00,PT15M,11382.79,
2,2019-01-31T23:15:00+01:00,PT15M,11520.47,
3,2019-01-31T23:00:00+01:00,PT15M,11633.02,
4,2019-01-31T22:45:00+01:00,PT15M,11546.67,
...,...,...,...,...
2971,2019-01-01T01:00:00+01:00,PT15M,8728.57,7843.09
2972,2019-01-01T00:45:00+01:00,PT15M,8746.80,7945.00
2973,2019-01-01T00:30:00+01:00,PT15M,8964.21,8084.29
2974,2019-01-01T00:15:00+01:00,PT15M,9060.07,8229.64


## Additional parameters of merge

Obviously, you can do way more then just "inner" or "outer" merging. There are lots of parameters with which you can specify things like merging method or add suffixes. Find out all about the different parameters in merge [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). <br>
Let's check out some of them:

In [90]:
#help(pd.merge)

In [10]:
df3_merged = total_load.merge(elia_load, on = "DateTime", how = "left", suffixes=("_total", "_elia"))

df3_merged

Unnamed: 0,DateTime,Resolution code_total,Total Load,Resolution code_elia,Elia Grid Load
0,2019-01-31T23:45:00+01:00,PT15M,11292.30,,
1,2019-01-31T23:30:00+01:00,PT15M,11382.79,,
2,2019-01-31T23:15:00+01:00,PT15M,11520.47,,
3,2019-01-31T23:00:00+01:00,PT15M,11633.02,,
4,2019-01-31T22:45:00+01:00,PT15M,11546.67,,
...,...,...,...,...,...
2971,2019-01-01T01:00:00+01:00,PT15M,8728.57,PT15M,7843.09
2972,2019-01-01T00:45:00+01:00,PT15M,8746.80,PT15M,7945.00
2973,2019-01-01T00:30:00+01:00,PT15M,8964.21,PT15M,8084.29
2974,2019-01-01T00:15:00+01:00,PT15M,9060.07,PT15M,8229.64


Do you remember how you renamed the datetime column of the `elia_load` dataframe in order to merge it afterwards? You don't need to do that. Instead, you can use the parameters **left_on** and **right_on**. To do so, let's import the data again, but store it into another variable called `elia_load_2`:

In [11]:
elia_load_2 = pd.read_csv("./data/energy/elia_load_2019_01_15.csv", sep = ";", parse_dates = True)

Check the column names of your two dataframes: 

In [12]:
elia_load_2.head(n=3)

Unnamed: 0,Datetime,Resolution code,Elia Grid Load
0,2019-01-15T23:45:00+01:00,PT15M,9100.73
1,2019-01-15T23:30:00+01:00,PT15M,9479.93
2,2019-01-15T23:15:00+01:00,PT15M,9645.24


In [13]:
total_load.head(n=3)

Unnamed: 0,DateTime,Resolution code,Total Load
0,2019-01-31T23:45:00+01:00,PT15M,11292.3
1,2019-01-31T23:30:00+01:00,PT15M,11382.79
2,2019-01-31T23:15:00+01:00,PT15M,11520.47


As you can see, the datetime columns are basically the same, **BUT** renamed differently. To still be able to merge these to dataframes based **on** this datetime column, you can use **"left_on"** and **"right_on"**:

In [14]:
df4_merged = elia_load_2.merge(total_load, left_on = "Datetime", right_on = "DateTime", suffixes=("_total", "_elia"))

df4_merged

Unnamed: 0,Datetime,Resolution code_total,Elia Grid Load,DateTime,Resolution code_elia,Total Load
0,2019-01-15T23:45:00+01:00,PT15M,9100.73,2019-01-15T23:45:00+01:00,PT15M,10201.90
1,2019-01-15T23:30:00+01:00,PT15M,9479.93,2019-01-15T23:30:00+01:00,PT15M,10501.04
2,2019-01-15T23:15:00+01:00,PT15M,9645.24,2019-01-15T23:15:00+01:00,PT15M,10665.05
3,2019-01-15T23:00:00+01:00,PT15M,9691.24,2019-01-15T23:00:00+01:00,PT15M,10698.07
4,2019-01-15T22:45:00+01:00,PT15M,9717.09,2019-01-15T22:45:00+01:00,PT15M,10685.16
...,...,...,...,...,...,...
1435,2019-01-01T01:00:00+01:00,PT15M,7843.09,2019-01-01T01:00:00+01:00,PT15M,8728.57
1436,2019-01-01T00:45:00+01:00,PT15M,7945.00,2019-01-01T00:45:00+01:00,PT15M,8746.80
1437,2019-01-01T00:30:00+01:00,PT15M,8084.29,2019-01-01T00:30:00+01:00,PT15M,8964.21
1438,2019-01-01T00:15:00+01:00,PT15M,8229.64,2019-01-01T00:15:00+01:00,PT15M,9060.07


&#128526; Cool, right?

### Exercise

Now it's your turn! 

- Check out the following two data frames:

In [15]:
df1_job = pd.DataFrame({'employee': ['Bob', 'Jake', 'Ahmed', 'Sue'],
                    'group': ['Data Science', 'Data Engineering', 'Data Engineering', 'Data Analyst']})
df2_hired = pd.DataFrame({'employee': ['Ahmed', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2006, 2008, 2014, 2021]})

- print out the first three rows of each DataFrame before merging
- Merge the following dataframes based on their name and save it into a new dataframe called df_job_hired
- print out the first 3 rows

### Advanced Exercise

In the DataFrame `df2_hired` somebody changed the name of the employee column. In addition, a column called `responsibility` was added. 

- See if both `responsibility` columns represent the same values and can be merged into one column.

- Merge the two DataFrames, save them in a new variable called `df_job_hired_2`.
- Add suffixes ("_job", "_hired") so you can still distinguish the columns after merging. 
- In the end, print out the DataFrame.

&#128161; <ins>Hint</ins>: Use the parameters left_on, right_on of the .merge() function. And don't forget, you can always check the documentation if needed.

In [16]:
df1_job = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                       'group': ['Data Science', 'Data Engineering', 'Data Engineering', 'Data Analyst'],
                    'responsibility': ["Energy Forecast", "Data pipelines", "Data pipelines", "Interpreting Results"]})

df2_hired = pd.DataFrame({'employee_names': ['Lisa', 'Bob', 'Jake', 'Sue'],
                         'hire_date': [2004, 2008, 2012, 2014],
                         'responsibility': ["yes", "no", "no", "yes"]})

## Concatenating two dataframes
<br>

You can also stitch two, three or more dataframes together with `.concat()`. Let's get right to it! <br>
For the following example, you need to import three new data sets. They all describe the physical energy flows on the interconnections between the Belgian bidding zone and the neighbouring bidding zones. A positive figure means export from Belgium, while negative figure means import into Belgium.

In [17]:
# read in files
df_morning = pd.read_csv("data/energy/PhysicalFlow_0-8.csv", index_col=0)
df_noon = pd.read_csv("data/energy/PhysicalFlow_8-16.csv", index_col=0)
df_evening = pd.read_csv("data/energy/PhysicalFlow_16-24.csv", index_col=0)

Let's have a look at the data: 

In [18]:
df_morning.head(n=3) #includes data from 0-8 am

Unnamed: 0,Datetime,Resolution code,Control area,Physical Flow Value
64,2021-12-01 07:45:00,PT15M,Germany,-537.172
65,2021-12-01 07:30:00,PT15M,Germany,-538.348
66,2021-12-01 07:15:00,PT15M,Germany,-522.304


In [19]:
df_noon.head(n=3) #includes data from 8-16 am

Unnamed: 0,Datetime,Resolution code,Control area,Physical Flow Value
32,2021-12-01 15:45:00,PT15M,Germany,-519.664
33,2021-12-01 15:30:00,PT15M,Germany,-501.404
34,2021-12-01 15:15:00,PT15M,Germany,-499.612


In [20]:
df_evening.head(n=3) #includes data from 16-24 am

Unnamed: 0,Datetime,Resolution code,Control area,Physical Flow Value
0,2021-12-01 23:45:00,PT15M,Germany,-1001.788
1,2021-12-01 23:30:00,PT15M,Germany,-1002.104
2,2021-12-01 23:15:00,PT15M,Germany,-1001.796


The shape of your new data sets is the follwing: 

In [21]:
print("Shape of Morning Dataframe: " + str(df_morning.shape))
print("Shape of Noon Dataframe: " + str(df_noon.shape))
print("Shape of Evening Dataframe: " + str(df_evening.shape))

Shape of Morning Dataframe: (32, 4)
Shape of Noon Dataframe: (32, 4)
Shape of Evening Dataframe: (32, 4)


&#128077; nice! ... maybe you have already discovered that the data sets are from the same day - but different times of day. So let's put them together with the new function .concat()

In [23]:
df_concat = pd.concat([df_morning, df_noon, df_evening ]).sort_values("Datetime").reset_index(drop=True)
print("Shape of df_concat: " + str(df_concat.shape))

Shape of df_concat: (96, 4)


## Additional parameters in concat()
<br> 

As always, there are many more things you can do with concat(). [See for yourself](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) or use the help function: 

In [142]:
#help(pd.concat)

Let's redo the last example and verify integrity, which means, checking for duplicates in the three dataframes: 

In [106]:
df_concat = pd.concat([df_morning, df_noon, df_evening], verify_integrity=True)

You might wonder, why "nothing has happend". Well, let's create a data frame WITH duplicates. <br>
The following dataset also includes morning daytimes from 0 pm until 9 am but overlaps with the `df_noon` data: 

In [113]:
df_duplicates = pd.read_csv("data/energy/PhysicalFlow_0-9.csv", index_col = 0)

df_duplicates.head()

Unnamed: 0,Datetime,Resolution code,Control area,Physical Flow Value
59,2021-12-01 09:00:00,PT15M,Germany,-511.74
60,2021-12-01 08:45:00,PT15M,Germany,-614.928
61,2021-12-01 08:30:00,PT15M,Germany,-624.604
62,2021-12-01 08:15:00,PT15M,Germany,-624.016
63,2021-12-01 08:00:00,PT15M,Germany,-596.836


If you now concatenate them and check for duplicates, an error message will appear: 

In [115]:
df_concat_dupl = pd.concat([df_duplicates, df_noon, df_evening], verify_integrity=True)

ValueError: Indexes have overlapping values: Int64Index([59, 60, 61, 62, 63], dtype='int64')

See, you will get an error message if there are duplicates within your dataframes. 

### Exercise

In the following, two dataframes (`dp1_df`, and `dp2_df`) are given. They both describe a number of employees working on a data related project together. They share the same column names. 

- Concatenate the dataframes and save them in a new dataframe called `data_project_df`.
- When you concatenated the dataframes, have a look at the index - it is not continuous. 
- Add the parameter "ignore_index = True" to create a continuous index.

In [148]:
dp1_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Data Scientist', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Data Engineer', 'Location': 'Washington Avenue'}])
dp2_df = pd.DataFrame([{'Name': 'James', 'Role': 'Analyst', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'Role': 'Regulations', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'Role': 'MlOps', 'Location': '512 Wilson Crescent'}])