<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Conditional Selections: Filtering
<br>

One of the most beloved and useful function in excel is the filter function - and of course, you can do the same in Python. You can use conditional selections to select specific rows and narrow your analysis down. And to make things easier, you can save selections you plan to use often as their own variables. Let's get right to it!

<img src= "./image/conditional_selections.png" width = 300>

First, let's have a look at the dataset "physical flow" again: 

In [None]:
import pandas as pd

In [None]:
energy_flow = pd.read_csv("./data/energy/physical_flow_2021_1_01.csv", sep = ";", parse_dates = True, index_col = 0)
energy_flow.head()

As you can see, the data set describes different physical flow values from different countries (the neighbouring bidding zones of Belgium), measured at similar datetimes.
Let's check which countries are represented here:

In [None]:
energy_flow["Control area"].unique()

Imagine you want to take a closer look at France and Luxembourg. To do so, you need to select the column of interest and "filter" your area of choice with a conditional statement which returns either True or False. This is then used to filter your data set. 

In [None]:
france = energy_flow[energy_flow["Control area"] == "France"]

In [None]:
luxembourg = energy_flow[energy_flow["Control area"] == "Luxembourg"]

In [None]:
france.head(n=3)

In [None]:
luxembourg.head(n=3)

### Excercise

1. Look at the dataframe `energy_flow` above and select the control area "Germany". 
2. Save your selection into a variable called `germany`.
3. Look at the first three rows to see if it worked. 

Since the data set includes hourly data from one day, you could calculate the mean physical flow of that day per selected country:

In [None]:
print('Mean Physical Flow of France in MW: ', round(france['Physical Flow Value'].mean()))

In [None]:
print('Mean Physical Flow of Luxembourg in MW: ', round(luxembourg['Physical Flow Value'].mean()))

Since a positive figure means export from Belgium, you can now use these to calculate more targeted metrics: 

In [None]:
print('On that day, Belgium exports {} MW on average more to France compared to Luxembourg.'\
      .format(round(france['Physical Flow Value'].mean() - luxembourg['Physical Flow Value'].mean(),2)))

## Advanced Conditionals: Using Masks 
<br>

Sure, it is nice to filter just one thing. But what if you want **to filter on > 1 criteria**? Then it can be easier to use a mask. No, not a face mask. Rather a boolean mask. <br>

Imagine you would like to not just select France OR Luxembourg, but both countries as well as Germany. With a mask, you specify these multiple conditions. Your mask evaluates the different conditions and returns either TRUE/FALSE. In a second step this mask is used as a filter as you have already learned in Filtering.
The pipe operator `|` is used as on OR whereas the `&` is used as an AND. But enough talking, let's try it out! 

1. Create a variable that stores all the conditions you would like to choose

In [None]:
countries_mask = (energy_flow["Control area"] == "France") | (energy_flow["Control area"] == "Luxembourg") | (energy_flow["Control area"] == "Germany")

2. Look at your mask. It returns whether your conditions have been met for each row or not

In [None]:
countries_mask

3. Now you can directly access your mask, using: 

In [None]:
selected_countries = energy_flow[countries_mask]

In [None]:
selected_countries["Control area"].unique()

&#128526; nice, well done!

Let's create another mask just for fun. Now we want to have all the data related to a physical flow is higher 1000 MW and find out whether it is import or export. For that, let's have a look at the original dataframe again: 

In [None]:
energy_flow.head()

Let's define that in this case, a high physical flow means < - 1000 MW and > 1000 MW.

In [None]:
high_flow_mask = (energy_flow["Physical Flow Value"] < -1000) | (energy_flow["Physical Flow Value"] > 1000)

In [None]:
high_flow_mask

In [None]:
high_flow = energy_flow[high_flow_mask]

In [None]:
high_flow.head()

## Groupby
<br> 

One of the most flexible ways to group your data and aggregate in pandas is with `.groupby()`. So what does this actually mean? Let's have a look at the following example:

In [None]:
energy_flow.groupby("Control area").mean()

As you can see from the example above, `groupby()` groups your data by the column(s) that you hand over to the function. In this case "Control area". In addition, `groupby()` **only works with an aggregator** such as sum() or mean(). This means, you have to tell the function what to do with each group. In this case, calculate the mean using `mean()`. Also notice, that the **column you grouped on/by becomes your new index**!

**Question:**
Do you know why only the column "Physical Flow Value" is displayed in our example?

You can do many more cool things. If you want to change the order in which the aggregated values are displayed, you can just chain the command `.sort_values` to your groupby statement. In general, you can use `.sort_values` for sorting any  column of a DataFrame.

In [None]:
energy_flow.groupby("Control area").mean().sort_values(by=["Physical Flow Value"], ascending=False)

For our next example, let's have a look at a bigger and more complex data set. To do so, you first have to import the csv: 

In [None]:
pf_high_voltage = pd.read_csv("./data/energy/physical_flow_high_voltage_2022_may_30.csv", sep = ";", index_col = 0)

In [None]:
pf_high_voltage.head()

The data set above describes the physical flow on the Belgian 380-kV lines (high-voltage) and on the interconnections with the neighboring countries. The "Loading" indicates how heavily the line is loaded relative to the maximum possible line loading. For the purpose of this training, you look at the data from only one day - the 30th of May 2022. The "Physical Flow" is given in MW whereas the "Loading" is given in %. 

Let's use what we have learned so far: 

In [None]:
pf_high_voltage.groupby("Asset name").mean().head()

**Tip:** Sometimes, if you want to aggregate different columns in different ways and to make your code cleaner, it is best to move the aggregations out and store them as a dictionary. 

In [None]:
aggs = {
    'Physical Flow': 'mean',
    'Loading': 'max'
}

pf_high_voltage.groupby('Asset name').agg(aggs).sort_values(by=['Physical Flow','Loading']).tail()

If you need to aggregate a certain column in several different ways, you can store the column and the different aggregators as key-value pairs in a dictionary.

In [None]:
aggs2 = {
    'Loading': ['min', 'mean', 'max', 'std']
}

loading_stats = pf_high_voltage.groupby('Asset name').agg(aggs2)

In [None]:
loading_stats.head()

As you can see, this multi-aggregation creates a **multi-index**. Multi-indexes can be difficult to work with. But no worries, there is an easy way to deal with it. For instance, you can **drop the top level**. In this case "Loading": 

In [None]:
loading_stats.columns = loading_stats.columns.droplevel(level = 0)

In [None]:
loading_stats.head()

It might be necessary to rename the new "columns", so you keep in mind that they are all stats of "loading".

In [None]:
loading_stats.columns = ["loading_min", "loading_mean", "loading_max", "loading_std"]

In [None]:
loading_stats.head()

And now, you can sort by any of the columns. Here, by average loading:

In [None]:
loading_stats.sort_values(by='loading_mean', ascending=False).head()

## Selecting the max and min values with Index Max and Min
<br>

The last cool thing that is definitely worth learning in the beginning is `idxmin` and `idxmax`. In addition to `.max()` and `.min()`, which returns the maximum or minimum values, you can use `.idxmax()` and `.idxmin()` to return the *index* pertaining to the maximum and minimum values. <br>

For example, let's use `.idxmax()` to find the "Asset name" with the highest standard deviation in its loading:

In [None]:
loading_stats["loading_std"].idxmax()

<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's see what might be cool to keep in mind:**

- there are two ways to combine dataframes: `pd.merge()` and `pd.concat()`
- you can get quick stats with `df_name.describe()`
- `df_name["column_name"].unique()` lists all the unique values within a column 
- filters in Python are a boolean conditional selection: `df_name[df_name["column_name"] == "target_value"]`
- multiple filters can be linked together with `|` and  `&` statement
- you can define aggregators for several columns with a dictionary: <br>
    
    aggs = {
        'column_1': 'mean',
        'column_2': 'max'
    }
        
</div>