<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Data Cleaning & Manipulation
<br> 

In order to work with your data set, you oftentimes need to prepare your data. For instance, you might need to get rid of missing values, rename columns or recode values so that your data makes sense. This step is called **data cleaning**.
<br>
But don't worry, with Pandas, you can do more than just select data that is already there. You can add new columns to your datasets, apply functions, iterate through each row in the dataframe, and more. This is where you move from "Pandas for exploring our data" to **"Pandas for getting your data ready to fit into models"**.

<img src="./image/Icons/cleaning.png" width = 200>

First, let's organise our data sets that we will need for this chapter!

In [None]:
import pandas as pd

pv_power = pd.read_csv('data/energy/ods032.csv', parse_dates=True, sep = ";", index_col=0)
energy = pd.read_csv('data/energy/elia_load_2019_01_15.csv', parse_dates=True, sep = ";", index_col=0)

Before you start to clean them up, always take a look at them first. This is super important in order to understand your data and how it is structured before you start to manipulate them!

In [None]:
pv_power.head(n=3)

In [None]:
energy.head(n=3)

## Dealing with Missing Values

The first thing you want to do when cleaning your data is to check whether it consists of missing values. Missing values are `blank cells` as well as `NaN` or `n/a` values. In Pandas, these entries will be treated as `null` values. 

With Pandas, you can...

1. identify missing values
2. drop missing values
3. or replace or fill missing values

With the syntax: `<df_name>.isnull().any().any()` you can check your data frame and see whether there are missing values. 
<br>

But first, let's go through the command step by step to see what will happen:

In [None]:
pv_power.isnull()

In [None]:
pv_power.isnull().any()

In [None]:
pv_power.isnull().any().any()

In [None]:
print('Does pv_power contains any null values?:', 'y' if pv_power.isnull().any().any() else 'n')
print('Does energy contains any null values?:', 'y' if energy.isnull().any().any() else 'n')

**Bonus question❓:** Why did we use `any()` twice when asking these questions?

As expected, `pv_power` contains **some missing values**. Let's check which columns contain them and how many with:

In [None]:
pv_power.isnull().sum()

Now you need to deal with these `null` values in the `pv_power` DataFrame. How you deal with them depends on the data type of the column in which `null` values are found and of course how you intend to use the data. 

For instance, if your data in `pv_power` doesn't have a `Resolution code` listed, you can still get a lot of information about it from things like its `Region`. But if you want to look at the **photovoltaic power of Belgium** and there is no value for `Monitored capacity`, then it might be difficult to interpret your results 😕. 
<br>

When you **replace, remove or fill** missing values, it might affect your analysis at a later stage because you change the original data set. This is why it is important for others but also for yourself that you document very well the underlying assumptions when you deal with missing values!!!

You have different options on how to deal with missing values:
- replace missing values by propagating the last valid observation to next valid with [pandas.DataFrame.ffill](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html#pandas-dataframe-ffill)
- replace missing values by using the next valid observation to fill the gap with [pandas.DataFrame.bfill](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html#pandas-dataframe-bfill)
- remove missing values with [pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas-dataframe-dropna)
- fill missing values using e.g. the overall mean with [pandas.DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas-dataframe-fillna) 
- and [more](https://pandas.pydata.org/docs/user_guide/missing_data.html#working-with-missing-data)

### [Dropping missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html#dropping-missing-data)

Let's try the option to drop **378 rows** of `pv_power` with null values in the column `Measured & Upscaled`.

1. Check the null values in the column, as well as how many rows are effected:

In [None]:
pv_power[pv_power["Measured & Upscaled"].isnull()]

In [None]:
isnull_rows = pv_power[pv_power['Measured & Upscaled'].isnull()].shape[0]
print(f"Number of rows where \"Measured & Upscaled\" is null: {isnull_rows}")

2. Continue and check how many rows the **original** DataFrame has: 

In [None]:
pv_power["Measured & Upscaled"].shape[0]

3. Out of total **56448** rows drop all the **378** rows where the column `Measured & Upscaled` has null values and do it in place:

In [None]:
pv_power.dropna(subset=['Measured & Upscaled'], inplace=True)
pv_power.shape[0]

&#128077; Well done, now **378** rows where null values existed for `Measured & Upscaled` have been dropped from `pv_power`. But conveniently `pv_power` now contains also no further null values because the only other column with null values was `Load factor` and the same rows as `Measured & Upscaled` were affected!

In [None]:
print('Does pv_power contains any null values?:', 'y' if pv_power.isnull().any().any() else 'n')

### [Filling missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-data)

To not delete all rows and possibly end up with no data left, you can also try to replace the null values - **if it makes sense**. 
<br>

Let's first restore the original data of `pv_power` because previously after dropping all rows with null values in `Measured & Upscaled`, we also ended up conveniently with no other null values!

In [None]:
pv_power = pd.read_csv('data/energy/ods032.csv', parse_dates=True, sep = ";", index_col=0)
pv_power.isnull()

In [None]:
pv_power.isnull().sum()

So now that we have restored `pv_power` with all its null values, let's fill gaps forward by using the **last non missing value before the missing values**. The DataFrame is quite big so we have to slice it to see that the forward fill option indeed has filled gaps with the last non missing value until the next non missing value.

In [None]:
# before the forward fill
pv_power.iloc[2687:2703,:]

In [None]:
pv_power.ffill(inplace=True)

In [None]:
# after the forward fill
pv_power.iloc[2687:2703,:]

In [None]:
pv_power.isnull().sum()


The 14 rows with missing values in `pv_power` remain because there were no non missing value before the missing values that could be used for `ffill()`. But there is also the option to fill the remaining missing values by using the next valid observation with `bfill()`.

In [None]:
pv_power.bfill(inplace=True)
pv_power.isnull().sum()

<br>
&#128515; Amazing, now you know how to deal with missing values!

## Creating new columns from existing ones
<br>

The last thing you have to know about data cleaning and manipulation, is how to **create new columns from existing ones**. This can come in handy in many cases. For instance, you might need to calculate the *mean load factor* and save these values in a new column. Can you think of other use cases? 
<br>

In Pandas, it's easy to make a new column from existing ones. Just look at the example below and check the syntax yourself: 

In [None]:
pv_power.loc[:, 'New_column'] = pv_power.loc[:, 'Resolution code'] + '_' + pv_power.loc[:, 'Region']

In [None]:
pv_power.head()

Or maybe you want to convert MW into kW for the column `Monitored capacity`: 

In [None]:
pv_power.loc[:, 'Monit_cap_kW'] = pv_power.loc[:, 'Monitored capacity'] * 1000

In [None]:
pv_power.head()

<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's recap what you have learned so far:**

- Data Cleaning is one of the most important steps when working with your data 
- `dropna()` drops rows or columns with missing data
- You have different options to fill missing data with `fillna()`, `ffill()` or `bfill()`
- To check for missing values you can use `isnull().any().any()`
- **iloc** works **positionally** with numbers (in our case the index)
- **loc** searches for **labels**
        
</div>

## Extra resources
<br>

- Check out the [Pandas DataFrame documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for more functions