<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Data Cleaning & Manipulation
<br> 

In order to work with your data set, you oftentimes need to prepare your data. For instance, you might need to get rid of missing values, rename columns or recode values so that your data makes sense. This step is called **data cleaning**.
<br>
But don't worry, with Pandas, you can do more than just select data that is already there. You can add new columns to your datasets, apply functions, iterate through each row in the dataframe, and more. This is where you move from "pandas for exploring our data" to **"pandas for getting your data ready to fit into models"**.

<img src="./image/Icons/cleaning.png" width = 200>

First, let's import one package that you are going to need in this section.

In [2]:
import pandas as pd

As well as two data sets, one of which you already know: 

In [3]:
pv_power = pd.read_csv('data/energy/ods032.csv', parse_dates=True, sep = ";", index_col=0)
energy = pd.read_csv('data/energy/elia_load_2019_01_15.csv', parse_dates=True, sep = ";", index_col=0)

Before you start to clean them up, always take a look at them. This is super important in order to understand your data and how it is structured.

In [5]:
pv_power.head(n=3)

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:45:00+01:00,PT15M,Flemish-Brabant,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,551.946,
2022-12-12 23:45:00+01:00,PT15M,Flanders,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4546.761,
2022-12-12 23:45:00+01:00,PT15M,Luxembourg,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,154.782,


In [6]:
energy.head(n=3)

Unnamed: 0_level_0,Resolution code,Elia Grid Load
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-15 23:45:00+01:00,PT15M,9100.73
2019-01-15 23:30:00+01:00,PT15M,9479.93
2019-01-15 23:15:00+01:00,PT15M,9645.24


## Dealing with Missing Values

The first thing you want to do when cleaning your data is to check whether it consists of missing values. Missing values are blank cells as well as NaN or n/a values. In Pandas, these entries will be treated as null vallues. In Python, you can...

1. identify missing values
2. drop missing values
3. or replace or fill missing values

With the syntax: `df_name.isnull().any().any()` you can check your data frame and see whether there are missing values. <br> But first, let's go through the command step by step and see what happens:

In [7]:
pv_power.isnull()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [8]:
pv_power.isnull().any()

Resolution code            False
Region                     False
Measured & Upscaled         True
Most recent forecast       False
Most recent P10            False
Most recent P90            False
Day Ahead 11AM forecast    False
Day Ahead 11AM P10         False
Day Ahead 11AM P90         False
Day-ahead 6PM forecast     False
Day-ahead 6PM P10          False
Day-ahead 6PM P90          False
Week-ahead forecast        False
Week-ahead P10             False
Week-ahead P90             False
Monitored capacity         False
Load factor                 True
dtype: bool

In [9]:
pv_power.isnull().any().any()

True

In [12]:
print('Does the pv_power df contain nulls?:', 'y' if pv_power.isnull().any().any() else 'n')
print('Does the energy df contain nulls?:', 'y' if energy.isnull().any().any() else 'n')

Does the pv_power df contain nulls?: y
Does the energy df contain nulls?: n


**Question:** why did we use ".any()" twice when asking this question?

As expected, `pv_power` contains some missing values. Let's check which columns contain them and how many:

In [13]:
pv_power.isnull().sum()

Resolution code              0
Region                       0
Measured & Upscaled        378
Most recent forecast         0
Most recent P10              0
Most recent P90              0
Day Ahead 11AM forecast      0
Day Ahead 11AM P10           0
Day Ahead 11AM P90           0
Day-ahead 6PM forecast       0
Day-ahead 6PM P10            0
Day-ahead 6PM P90            0
Week-ahead forecast          0
Week-ahead P10               0
Week-ahead P90               0
Monitored capacity           0
Load factor                378
dtype: int64

Now you need to deal with the null values in the `pv_power` table. How you deal with them depends on the data type of the column in which null values are found and of course how you intend to use the data. For instance, if your `pv_power` data doesn't have a `Resolution code` listed, you can still get a lot of information about it from things like its `Region`. But if you want to look at the photovoltaic power of Belgium and there is no `Monitored capacity`, it might be diffcult to interpret your results. <br>

When you replace, delete or fill missing values, it might affect your analysis at a later stage because you change the dataset. This is why it is important for others and for yourself that you document well the underlying assumptions.
You have different options on how to deal with this missing values. You could:
- delete every missing row `.dropna(subset = ["column"])`
- copy the value from the time stamp before `.fillna(method = "ffill", inplace = True)`
- copy the value from the time stamp after `.fillna(method = "bfill", inplace = True)`
- insert the overall mean `.fillna(df["column"].mean(), inplace = True)`
- and more.  <br>  

("column" needs to be replaced by the column name in which you like to modify the missing values.)  
After analysing the data, you decide to **drop the rows with null values** in `Measured & Upscaled` and **replace the null values** in `Load factor` with the value from the time stamp before.

### Drop Null Values

Let's drop the rows with null values in the column `Measured & Upscaled`. To do so...

1. Look at the column your choice and check its null values, as well as how many rows are effected:

In [62]:
pv_power[pv_power["Measured & Upscaled"].isnull()] # select rows of "Measured & Upscaled" with NaN values and display that subset of rows

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:45:00+01:00,PT15M,Flemish-Brabant,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,551.946,
2022-12-12 23:45:00+01:00,PT15M,Flanders,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4546.761,
2022-12-12 23:45:00+01:00,PT15M,Luxembourg,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,154.782,
2022-12-12 23:45:00+01:00,PT15M,West-Flanders,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,975.362,
2022-12-12 23:45:00+01:00,PT15M,Wallonia,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1669.249,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-16 23:45:00+01:00,PT15M,Flanders,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4501.912,
2022-11-16 23:45:00+01:00,PT15M,East-Flanders,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1169.656,
2022-11-16 23:45:00+01:00,PT15M,Antwerp,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1169.619,
2022-11-16 23:45:00+01:00,PT15M,Liège,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,680.867,


In [20]:
isnull_rows = pv_power[pv_power['Measured & Upscaled'].isnull()].shape[0]
print(f"Number of is-null \"Measured & Upscaled\" rows: {isnull_rows}")

Number of is-null "Measured & Upscaled" rows: 378


When you look at the data frame, you can see, that the "newer" data misses values (data from the 21th of June 2022). 

2. Continue and check how many rows the original data frame has: 

In [21]:
pv_power["Measured & Upscaled"].shape[0]

56448

3. Now, out of total **56448** rows drop all the **378** `Measured & Upscaled` rows with NaN values and save it into the data frame before: 

In [25]:
pv_power = pv_power.dropna(subset=['Measured & Upscaled']) # override dataframe with a new one that has NaN values removed from the "Measured & Upscaled" column

In [26]:
pv_power.shape[0]

56070

&#128077; Well done, now **378** rows where a NaN value existed for `Measured & Upscaled` have been dropped.

### Replace Null Values

To not delete all rows and possibly end up with no data left, you can also try to replace null values - **if it makes sense**. <br>
Let's first restore the original data of `pv_power` and check again for null values: 

In [63]:
pv_power = pd.read_csv('data/energy/ods032.csv', parse_dates=True, sep = ";", index_col=0)
pv_power.isnull()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2022-12-12 23:45:00+01:00,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2022-11-01 00:00:00+01:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [64]:
pv_power.isnull().sum()

Resolution code              0
Region                       0
Measured & Upscaled        378
Most recent forecast         0
Most recent P10              0
Most recent P90              0
Day Ahead 11AM forecast      0
Day Ahead 11AM P10           0
Day Ahead 11AM P90           0
Day-ahead 6PM forecast       0
Day-ahead 6PM P10            0
Day-ahead 6PM P90            0
Week-ahead forecast          0
Week-ahead P10               0
Week-ahead P90               0
Monitored capacity           0
Load factor                378
dtype: int64

So let's replace all null values with the last non missing value before the missing value but this time we focus on the column `Load factor` to do this:

In [70]:
pv_power[~pv_power["Load factor"].isnull()]

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:30:00+01:00,PT15M,Antwerp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1195.798,0.0
2022-12-12 23:30:00+01:00,PT15M,Liège,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,699.821,0.0
2022-12-12 23:30:00+01:00,PT15M,Flemish-Brabant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,551.946,0.0
2022-12-12 23:30:00+01:00,PT15M,West-Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,975.362,0.0
2022-12-12 23:30:00+01:00,PT15M,Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4546.761,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00+01:00,PT15M,Walloon-Brabant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,129.512,0.0
2022-11-01 00:00:00+01:00,PT15M,West-Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,972.944,0.0
2022-11-01 00:00:00+01:00,PT15M,Hainaut,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,455.654,0.0
2022-11-01 00:00:00+01:00,PT15M,Limburg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,643.325,0.0


In [71]:
pv_power["Load factor"].ffill(inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  pv_power["Load factor"].ffill(inplace=True)


In [52]:
pv_power.isnull().any()
pv_power[~pv_power["Load factor"].isnull()]

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Most recent P10,Most recent P90,Day Ahead 11AM forecast,Day Ahead 11AM P10,Day Ahead 11AM P90,Day-ahead 6PM forecast,Day-ahead 6PM P10,Day-ahead 6PM P90,Week-ahead forecast,Week-ahead P10,Week-ahead P90,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-12-12 23:30:00+01:00,PT15M,Antwerp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1195.798,0.0
2022-12-12 23:30:00+01:00,PT15M,Liège,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,699.821,0.0
2022-12-12 23:30:00+01:00,PT15M,Flemish-Brabant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,551.946,0.0
2022-12-12 23:30:00+01:00,PT15M,West-Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,975.362,0.0
2022-12-12 23:30:00+01:00,PT15M,Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4546.761,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00+01:00,PT15M,Walloon-Brabant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,129.512,0.0
2022-11-01 00:00:00+01:00,PT15M,West-Flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,972.944,0.0
2022-11-01 00:00:00+01:00,PT15M,Hainaut,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,455.654,0.0
2022-11-01 00:00:00+01:00,PT15M,Limburg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,643.325,0.0


<br>
&#128515; Amazing, now you know how to deal with missing values!

## Creating new columns from existing ones
<br>

The last thing you have to know about data cleaning and manipulation, is how to create new columns from existing ones. This can come in handy in many cases. For instance, you might need to calculate the mean load factor and save these values in a new column. Can you think of other use cases? <br>

In pandas, it's easy to make a new column from existing ones. Just look at the example below and check the syntax yourself: 

In [None]:
pv_power.loc[:, 'New_column'] = pv_power.loc[:, 'Resolution code'] + '_' + pv_power.loc[:, 'Region']

In [None]:
pv_power.head()

Or maybe you want to convert MW into kW for the column `Monitored capacity`: 

In [None]:
pv_power.loc[:, 'Monit_cap_kW'] = pv_power.loc[:, 'Monitored capacity'] * 1000

In [None]:
pv_power.head()

In [72]:
import numpy as np
import pandas as pd

nums = {'Load factor': [0.0, np.nan, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0]}
nums_df = pd.DataFrame(nums)

nums_df.head(n=3)

Unnamed: 0,Load factor
0,0.0
1,
2,0.0


<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's recap what you have learned so far:**

- Data Cleaning is one of the most important steps when working with your data 
- You can drop missing values with: `df_name.dropna(subset=['column_name'])`
- you have different options to replace missing values with: `df_name.fillna()`
- To check for missing values you can use `df_name.isnull().any().any()`
- **iloc** works **positionally** with numbers (in our case the index)
- **loc** searches for **labels**
        
</div>

## Extra resources
To know more functions available in a dataframe read:
- [Dataframe documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)