<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Conditional Selections: Filtering
<br>

One of the most beloved and useful functionality in Microsoft Excel is the filter function - and of course, you can do the same as well in Python. You can use conditional selections to select specific rows and narrow your analysis down. And to make things easier, you can save selections you plan to use often in their own variables. Let's get right to it!

<img src= "./image/conditional_selections.png" width = 300>

First, let's have a look at the dataset "physical flow" again: 

In [None]:
import pandas as pd

energy_flow = pd.read_csv("./data/energy/physical_flow_2021_1_01.csv", sep = ";", parse_dates = True, index_col = 0)
energy_flow.head()

As you can see, the data set describes different physical flow values from different countries (the neighbouring bidding zones of Belgium), measured at similar date and times.
Let's check which countries are represented here:

In [None]:
energy_flow["Control area"].unique()

Imagine you want to take a closer look at France and Luxembourg. To do so, you need to select the column of interest and "filter" your area of choice with a conditional statement which returns either `True` or `False`. This is then used to filter your data set. 

In [None]:
france = energy_flow[energy_flow["Control area"] == "France"]
luxembourg = energy_flow[energy_flow["Control area"] == "Luxembourg"]

In [None]:
france.head(n=3)

In [None]:
luxembourg.head(n=3)

### Excercise

1. Look at the DataFrame `energy_flow` above and select the **control area "Germany"**. 
2. Save your selection into a variable called `germany`.
3. Look at the **first three rows** to see if it worked. 

In [None]:
# delete this line and replace it with your solution

Since the data set includes hourly data from one day, you could calculate the **mean physical flow of that day per selected country**:

In [None]:
print('Mean Physical Flow of France in MW: ', round(france['Physical Flow Value'].mean()))

In [None]:
print('Mean Physical Flow of Luxembourg in MW: ', round(luxembourg['Physical Flow Value'].mean()))

Since a positive figure means export from Belgium, you can now use these to calculate more targeted metrics: 

In [None]:
print(f"On that day, Belgium exports {round(france['Physical Flow Value'].mean() - luxembourg['Physical Flow Value'].mean(),2)} MW on average more to France compared to Luxembourg.")

## Advanced Conditionals: Using Masks 
<br>

Sure, it is nice to filter just one thing. But what if you want **to filter on > 1 criteria**? Then it can be easier to use a mask.  No, not a face mask 😜! Rather a **boolean mask**. 
<br>

Imagine you would like to not just select France OR Luxembourg, but **both** countries as well as Germany. With a mask, you specify these multiple conditions. Your mask evaluates the different conditions and returns either `TRUE` or `FALSE`. In a second step, this mask is used as a filter as you have already learned previously in Filtering.
The pipe operator `|` is used as an OR whereas the `&` is used as an AND. But enough talking, let's try it out! 

1. Create a variable that stores all the conditions you would like to choose

In [None]:
countries_mask = (energy_flow["Control area"] == "France") | (energy_flow["Control area"] == "Luxembourg") | (energy_flow["Control area"] == "Germany")

2. Look at your mask. It returns whether your conditions have been met for each row or not

In [None]:
countries_mask

3. Now you can directly access your mask, using: 

In [None]:
selected_countries = energy_flow[countries_mask]

In [None]:
selected_countries["Control area"].unique()

Nice and well done! ✨ Let's create another mask just for fun. &#128526;

Now we want to have all the data related to a physical flow that is **higher 1000 MW** and find out whether it is an import or an export. For that, let's have a look at the original DataFrame `energy_flow` again: 

In [None]:
energy_flow.head()

Let's define that in this case, a high physical flow means < - 1000 MW and > 1000 MW.

In [None]:
high_flow_mask = (energy_flow["Physical Flow Value"] < -1000) | (energy_flow["Physical Flow Value"] > 1000)

In [None]:
high_flow_mask

In [None]:
high_flow = energy_flow[high_flow_mask]

In [None]:
high_flow.head()

## Groupby
<br> 

One of the most flexible ways to group your data and aggregate them in Pandas is using `.groupby()`. So what does this actually mean? Let's have a look at the following examples:

In [None]:
energy_flow.groupby('Control area').sum()

In [None]:
energy_flow.groupby('Control area')['Physical Flow Value'].mean()

As you can see from the examples above, `.groupby()` groups your data by the column(s) that you hand over to the function. In this case "Control area". In addition, `.groupby()` **only works with an aggregator** such as `.sum()` or `.mean()`. This means, you have to tell the function what to do with each group. Please note that the **column you grouped on/by becomes your new index**!

You can do many more cool things. If you want to change the order in which the aggregated values are displayed, you can just chain the command `.sort_values()` to your groupby statement. In general, you can use `.sort_values()` for sorting any column of a DataFrame.

In [None]:
energy_flow.groupby('Control area')['Physical Flow Value'].mean().sort_values(ascending=False)

In [None]:
pf_high_voltage = pd.read_csv("./data/energy/physical_flow_high_voltage_2022_may_30.csv", sep = ";", index_col = 0)
pf_high_voltage.head()

<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's see what might be cool to keep in mind:**

- There are two ways to combine DataFrames: [pandas.merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas-merge) and [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas-concat)
- You can get quick stats with [pandas.DataFrame.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas-dataframe-describe)
- `df_name["column_name"].unique()` lists all the unique values within a column
- Filters in Python are a boolean conditional selection: `df_name[df_name["column_name"] == "target_value"]`
- Multiple filters can be linked together with `|` and  `&` in a statement
        
</div>