<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Data Cleaning & Manipulation
<br> 

In order to work with your data set, you oftentimes need to prepare your data. For instance, you might need to get rid of missing values, rename columns or recode values so that your data makes sense. This step is called **data cleaning**.
<br>
But don't worry, with Pandas, you can do more than just select data that is already there. You can add new columns to your datasets, apply functions, iterate through each row in the dataframe, and more. This is where you move from "pandas for exploring our data" to **"pandas for getting your data ready to fit into models"**.

<img src="./image/Icons/cleaning.png" width = 200>

First, let's import one package that you are going to need in this section.

In [1]:
import pandas as pd

As well as two data sets, one of which you already know: 

In [66]:
pv_power = pd.read_csv('data/energy/ods032.csv', parse_dates=True, sep = ";", index_col=0)
energy = pd.read_csv('data/energy/elia_load_2019_01_15.csv', parse_dates=True, sep = ";", index_col=0)

Before you start to clean them up, always take a look at them. This is super important in order to understand your data and how it is structured.

In [3]:
pv_power.head()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Day Ahead 11AM forecast,Day-ahead 6PM forecast,Week-ahead forecast,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-06-21 23:45:00+02:00,PT15M,Liège,,0.0,0.0,0.0,0.0,634.933,0.0
2022-06-21 23:45:00+02:00,PT15M,Hainaut,,0.0,0.0,0.0,0.0,431.127,0.0
2022-06-21 23:45:00+02:00,PT15M,Flemish-Brabant,,0.0,0.0,0.0,0.0,514.584,0.0
2022-06-21 23:45:00+02:00,PT15M,East-Flanders,,0.0,0.0,0.0,0.0,1127.039,0.0
2022-06-21 23:45:00+02:00,PT15M,Namur,,0.0,0.0,0.0,0.0,195.154,0.0


In [4]:
energy.head()

Unnamed: 0_level_0,Resolution code,Elia Grid Load
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-15 23:45:00+01:00,PT15M,9100.73
2019-01-15 23:30:00+01:00,PT15M,9479.93
2019-01-15 23:15:00+01:00,PT15M,9645.24
2019-01-15 23:00:00+01:00,PT15M,9691.24
2019-01-15 22:45:00+01:00,PT15M,9717.09


## Dealing with Missing Values

The first thing you want to do when cleaning your data is to check whether it consists of missing values. Missing values are blank cells as well as NaN or n/a values. In Pandas, these entries will be treated as null vallues. In Python, you can...

1. identify missing values
2. drop missing values
3. or replace or fill missing values

With the syntax: `df_name.isnull().any().any()` you can check your data frame and see whether there are missing values. <br> But first, let's go through the command step by step and see what happens:

In [47]:
pv_power.isnull()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Day Ahead 11AM forecast,Day-ahead 6PM forecast,Week-ahead forecast,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-06-21 23:45:00+02:00,False,False,True,False,False,False,False,False,False
2022-06-21 23:45:00+02:00,False,False,True,False,False,False,False,False,False
2022-06-21 23:45:00+02:00,False,False,True,False,False,False,False,False,False
2022-06-21 23:45:00+02:00,False,False,True,False,False,False,False,False,False
2022-06-21 23:45:00+02:00,False,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
2019-04-09 22:15:00+02:00,False,False,False,False,False,False,False,False,False
2019-04-09 22:15:00+02:00,False,False,False,False,False,False,False,False,False
2019-04-09 22:15:00+02:00,False,False,False,False,False,False,False,False,False
2019-04-09 22:15:00+02:00,False,False,False,False,False,False,False,False,False


In [48]:
pv_power.isnull().any()

Resolution code            False
Region                     False
Measured & Upscaled         True
Most recent forecast       False
Day Ahead 11AM forecast     True
Day-ahead 6PM forecast     False
Week-ahead forecast         True
Monitored capacity         False
Load factor                False
dtype: bool

In [7]:
pv_power.isnull().any().any()

True

In [8]:
print('Does the pv_power df contain nulls?:', pv_power.isnull().any().any())
print('Does the energy df contain nulls?:', energy.isnull().any().any())

Does the pv_power df contain nulls?: True
Does the energy df contain nulls?: False


**Question:** why did we use ".any()" twice when asking this question?

As expected, `pv_power` contains some missing values. Let's check which columns contain them and how many:

In [67]:
pv_power.isnull().sum()

Resolution code                0
Region                         0
Measured & Upscaled         1302
Most recent forecast           0
Day Ahead 11AM forecast    10752
Day-ahead 6PM forecast         0
Week-ahead forecast        16128
Monitored capacity             0
Load factor                    0
dtype: int64

Now you need to deal with the null values in the `pv_power` table. How you deal with them depends on the data type of the column in which null values are found and of course how you intend to use the data. For instance, if your `pv_power` data doesn't have a `Resolution code` listed, you can still get a lot of information about it from things like its `Region`. But if you want to look at the photovoltaic power of Belgium and there is no `Monitored capacity`, it might be diffcult to interpret your results. <br>

When you replace, delete or fill missing values, it might affect your analysis at a later stage because you change the dataset. This is why it is important for others and for yourself that you document well the underlying assumptions.
You have different options on how to deal with this missing values. You could:
- delete every missing row `.dropna(subset = ["column"])`
- copy the value from the time stamp before `.fillna(method = "ffill", inplace = True)`
- copy the value from the time stamp after `.fillna(method = "bfill", inplace = True)`
- insert the overall mean `.fillna(df["column"].mean(), inplace = True)`
- and more.  <br>  

("column" needs to be replaced by the column name in which you like to modify the missing values.)  
After analysing the data, you decide to **drop the rows with null values** in `Measured & Upscaled` and **replace the null values** in `Day Ahead 11am forecast` and `Week-ahead forecast` with the value from the time stamp before.

### Drop Null Values

Let's drop the rows with null values in the column `Measured & Upscaled`. To do so...

1. Look at the column your choice and check its null values, as well as how many rows are effected:

In [58]:
pv_power[pv_power["Measured & Upscaled"].isnull()]

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Day Ahead 11AM forecast,Day-ahead 6PM forecast,Week-ahead forecast,Monitored capacity,Load factor
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-06-21 23:45:00+02:00,PT15M,Liège,,0.0,0.0,0.0,0.0,634.933,0.0
2022-06-21 23:45:00+02:00,PT15M,Hainaut,,0.0,0.0,0.0,0.0,431.127,0.0
2022-06-21 23:45:00+02:00,PT15M,Flemish-Brabant,,0.0,0.0,0.0,0.0,514.584,0.0
2022-06-21 23:45:00+02:00,PT15M,East-Flanders,,0.0,0.0,0.0,0.0,1127.039,0.0
2022-06-21 23:45:00+02:00,PT15M,Namur,,0.0,0.0,0.0,0.0,195.154,0.0
...,...,...,...,...,...,...,...,...,...
2022-06-21 00:45:00+02:00,PT15M,Belgium,,0.0,0.0,0.0,0.0,6004.922,0.0
2022-06-21 00:45:00+02:00,PT15M,Flanders,,0.0,0.0,0.0,0.0,4339.614,0.0
2022-06-21 00:45:00+02:00,PT15M,Antwerp,,0.0,0.0,0.0,0.0,1129.311,0.0
2022-06-21 00:45:00+02:00,PT15M,Limburg,,0.0,0.0,0.0,0.0,624.636,0.0


In [59]:
pv_power[pv_power["Measured & Upscaled"].isnull()].shape[0]

1302

When you look at the data frame, you can see, that the "newer" data misses values (data from the 21th of June 2022). 

2. Continue and check how many rows the original data frame has: 

In [60]:
pv_power["Measured & Upscaled"].shape[0]

1569887

3. Now, drop all the 1302 rows with NaN values and save it into the data frame before: 

In [68]:
pv_power = pv_power.dropna(subset=['Measured & Upscaled'])

In [62]:
pv_power.shape[0]

1568585

&#128077; Well done, now 1302 rows where a NaN value existed for `Measured & Upscaled` have been dropped.

### Replace Null Values

To not delete all rows and end up with basically no data left, you can replace null values - **if it makes sense**. <br>
Let's check where null values are left: 

In [69]:
pv_power.isna().any()

Resolution code            False
Region                     False
Measured & Upscaled        False
Most recent forecast       False
Day Ahead 11AM forecast     True
Day-ahead 6PM forecast     False
Week-ahead forecast         True
Monitored capacity         False
Load factor                False
dtype: bool

In [70]:
pv_power.isna().sum()

Resolution code                0
Region                         0
Measured & Upscaled            0
Most recent forecast           0
Day Ahead 11AM forecast    10752
Day-ahead 6PM forecast         0
Week-ahead forecast        16128
Monitored capacity             0
Load factor                    0
dtype: int64

So let's replace all null values with the last non missing value before the missing value:

In [71]:
pv_power["Day Ahead 11AM forecast"].fillna(method = "ffill", inplace=True)
pv_power["Week-ahead forecast"].fillna(method = "ffill", inplace=True)

In [72]:
pv_power.isnull().any()

Resolution code            False
Region                     False
Measured & Upscaled        False
Most recent forecast       False
Day Ahead 11AM forecast    False
Day-ahead 6PM forecast     False
Week-ahead forecast        False
Monitored capacity         False
Load factor                False
dtype: bool

<br>
&#128515; Amazing, now you know how to deal with missing values!

## Creating new columns from existing ones
<br>

The last thing you have to know about data cleaning and manipulation, is how to create new columns from existing ones. This can come in handy in many cases. For instance, you might need to calculate the mean load factor and save these values in a new column. Can you think of other use cases? <br>

In pandas, it's easy to make a new column from existing ones. Just look at the example below and check the syntax yourself: 

In [19]:
pv_power.loc[:, 'New_column'] = pv_power.loc[:, 'Resolution code'] + '_' + pv_power.loc[:, 'Region']

In [20]:
pv_power.head()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Day Ahead 11AM forecast,Day-ahead 6PM forecast,Week-ahead forecast,Monitored capacity,Load factor,New_column
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-06-21 00:30:00+02:00,PT15M,Brussels,0.0,0.0,0,0.0,0,128.471,0.0,PT15M_Brussels
2022-06-21 00:30:00+02:00,PT15M,West-Flanders,0.0,0.0,0,0.0,0,944.044,0.0,PT15M_West-Flanders
2022-06-21 00:30:00+02:00,PT15M,Flemish-Brabant,0.0,0.0,0,0.0,0,514.584,0.0,PT15M_Flemish-Brabant
2022-06-21 00:30:00+02:00,PT15M,Antwerp,0.0,0.0,0,0.0,0,1129.311,0.0,PT15M_Antwerp
2022-06-21 00:30:00+02:00,PT15M,Liège,0.0,0.0,0,0.0,0,634.933,0.0,PT15M_Liège


Or maybe you want to convert MW into kW for the column `Monitored capacity`: 

In [21]:
pv_power.loc[:, 'Monit_cap_kW'] = pv_power.loc[:, 'Monitored capacity'] * 1000

In [22]:
pv_power.head()

Unnamed: 0_level_0,Resolution code,Region,Measured & Upscaled,Most recent forecast,Day Ahead 11AM forecast,Day-ahead 6PM forecast,Week-ahead forecast,Monitored capacity,Load factor,New_column,Monit_cap_kW
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022-06-21 00:30:00+02:00,PT15M,Brussels,0.0,0.0,0,0.0,0,128.471,0.0,PT15M_Brussels,128471.0
2022-06-21 00:30:00+02:00,PT15M,West-Flanders,0.0,0.0,0,0.0,0,944.044,0.0,PT15M_West-Flanders,944044.0
2022-06-21 00:30:00+02:00,PT15M,Flemish-Brabant,0.0,0.0,0,0.0,0,514.584,0.0,PT15M_Flemish-Brabant,514584.0
2022-06-21 00:30:00+02:00,PT15M,Antwerp,0.0,0.0,0,0.0,0,1129.311,0.0,PT15M_Antwerp,1129311.0
2022-06-21 00:30:00+02:00,PT15M,Liège,0.0,0.0,0,0.0,0,634.933,0.0,PT15M_Liège,634933.0


<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's recap what you have learned so far:**

- Data Cleaning is one of the most important steps when working with your data 
- You can drop missing values with: `df_name.dropna(subset=['column_name'])`
- you have different options to replace missing values with: `df_name.fillna()`
- To check for missing values you can use `df_name.isnull().any().any()`
- **iloc** works **positionally** with numbers (in our case the index)
- **loc** searches for **labels**
        
</div>