<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Advanced Manipulation
<br>

It is time to learn some advanced data manipulation. In the following section, you will learn how to access your data through looping. Looping is a super important method to calculate new values based on your existing values. There are many different looping methods such as `iterrows()`, `apply()` or the vectorization method. The main difference between those is their speed. To simplify things, you will first focus on `apply()` and subsequently get to know vectorization. But don't worry. Vectorization sounds like hardcore math, but it isn't. 

<ins>You will learn:</ins>
1. How to access and manipulate data in Series / DataFrames with **apply()**.
2. How to speed up the process by using **vectorization** methods.

Let's get to it!

## Looping with apply()
<br>

A simple DataFrame consists of Series, which you know as columns. If you would now like to iterate and apply a function on your data, this would probably look a bit like this: 

`for row in range(len(df_short)): #  number of rows
    for col in range(len(df_short.columns)): # number of columns
        print(df_short.iloc[row, col])`
        
But there is a **way more easy way** than using nested loops! With `apply()` you can literally apply a function along a row or a column! First, let's have a look at Pandas Series before we move on to DataFrames:

### Accessing/Manipulating data in <ins>Series</ins> with **.apply()**

- `apply()` applies a function to each element in a Series e.g. calculate the string length of each value within a column
- check [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) for more input

<img src = "./image/Icons/apply_column_example.png" width = 200>

Let's do an example on how to apply the `len()` function using `apply()`.
<br>

&#128161; About `len()`: This function is a Python built-in function. It returns the length of an object. For instance, it returns the number of items in a list if applied on a list or the length of a string if applied on a string.

In [None]:
import pandas as pd

physical_flow = pd.read_csv("./data/energy/physical_flow_2021_1_01.csv", sep = ";")
physical_flow.head()

Let's shorten the df for this example:

In [None]:
df_short = physical_flow.loc[:5, :"Control area"]
df_short

And create a Series: 

In [None]:
series_short = df_short["Control area"]
series_short

Now let's apply the `len()` function using `apply()`:

In [None]:
area_name_lenght = series_short.apply(len)
area_name_lenght

See, it returns the length of the string (in this case the length of each control area name) for each row of a Series.

### Accessing/Manipulating data in a <ins>DataFrame</ins> with **.apply()**

- you can apply the `apply()` function to either axis of a DataFrame, this means you can loop through rows **and** columns where
    - axis = 0 for column
    - axis = 1 for row.

**axis = 0:**

<img src = "./image/loop_column.png" width = 600>

**axis = 1:**

<img src = "./image/loop_row.png" width = 600>

So let's check out our Dataframe again...

In [None]:
physical_flow.head()

... and shorten it, so that it is better manageable for this example

In [None]:
ph_flow_short = physical_flow.loc[:10,"Resolution code":"Physical Flow Value"]
ph_flow_short

First, let's write a function that returns `True` if the "Physical Flow Value" was positive otherwise `False` because a positive figure means export from Belgium, while a negative figure means import into Belgium.

In [None]:
def is_export_from_BE(row):
    if row["Physical Flow Value"] >= 0:
        return True
    else:
        return False

Now create a new column called `BE Export Check` that applies this newly defined function `is_export_from_BE` on each row:

In [None]:
ph_flow_short.loc[:, "BE Export Check"] = ph_flow_short.apply(is_export_from_BE, axis = 1)
ph_flow_short.head()

&#128515; Can you spot the newly column with the previously defined function applied? Amazing, well done!

### Exercise

Let's do another one. 

In [None]:
import pandas as pd

physical_flow = pd.read_csv("./data/energy/physical_flow_2021_1_01.csv", sep = ";")
ph_flow_exercise = physical_flow.loc[:,"Resolution code":"Physical Flow Value"]
ph_flow_exercise


1. First, create a new function called `flow_check` that returns **">200 MW"** if the `Physical Flow Value` is bigger than 200 MW and **"<200 MW"** if it is less than 200 MW.


In [None]:
# delete this line and replace it with your solution

2. Create a new column called `Flow Size` that applies this function onto your DataFrame `ph_flow_exercise`

In [None]:
# delete this line and replace it with your solution

## Additional functions: Series.map()/ DataFrame.Map()
<br>

Since you now know how to loop through functions, let's check out some additional functions that really come in handy!

- `Series.map()` is used to substitute each value with another value
- `DataFrame.map()` is used for element-wise operations across the whole DataFrame

So what does this mean...?

### Map

`map()` is a Series method. It allows us to map existing values of a Series to a different set of values. For instance, imagine you have a set of data with one column called "Gender" consisting of the string values "male, "female" and "diverse". You would like to translate those values, so that `male = 0`, `female = 1`, and `diverse = 2`. To do so, you can use `map()`. 

It...
- is a Series Method
- allows us to map an existing value of a Series to a different set of values

&#128526; Let's practice!

In the energy sector, you probably won't have to deal with gender data. However, `map()` still comes in handy. In the following example, you could e.g. replace `True` with `Export to` and `False` with `Import from`. 

In [None]:
ph_flow_short.head()

In [None]:
ph_flow_short.loc[:,"BE physical energy flow"] = ph_flow_short["BE Export Check"].map({True: "Export to", False: "Import from"})
ph_flow_short.head()

Does the Syntax look familiar to you? Right, the input of the `map()` function is based on key-value pairs. 

### DataFrame.map()

Another function that comes in handy is `DataFrame.map()`. It applies a function to **every element** of a DataFrame. <br>

<img src = "./image/applymap_example.png" width = 600>
<br>

Let's check out an example: 

In [None]:
df_applymap = ph_flow_short.loc[:,["Physical Flow Value"]].map(int)
df_applymap.head()

As you can see, with `DataFrame.map()` you can e.g. select all the columns with numeric values in your dataframe and change their data type from float to int. 

## Speed up the process - **Vectorization**

If you do simple data manipulation, then `apply()` works just fine. However, it is not the fastest way to access and manipulate your data. You can speed up the process by using the vectorization method. In order to use vectorization, you have to `import numpy` first. [Numpy](https://numpy.org/) is an open source Python library that’s used frequently in science and engineering. If you work with numeric data, you will definitely learn to love Numpy!

<ins>What are vectorization methods?</ins>
- Applying a manipulation to a whole array aka vector, instead of single values.
- You have been indirectly using vectorization when you used e.g. `groupby()`!

<ins>Why should we use it?</ins>
- You can use it to avoid looping row by row over the data set and hence save a lot of time if the data set is huge!


Now imagine, you would like to add a new column to your dataframe called `new column`. This new column simply combines the two existing columns `BE physical energy flow` and `Control area`:

In [None]:
ph_flow_short.head()

In [None]:
ph_flow_short.loc[:,"new column"] = ph_flow_short["BE physical energy flow"] + " " + ph_flow_short["Control area"]
ph_flow_short

So let's turn one of the previous functions into a vectorization. <br>

The first guess would be, to pass the whole vector aka Series (instead of single rows as before) into the to-be-applied function and then make the calculation. But...

In [None]:
def flow_check(series):
    print(series)
    if series >200:
        return ">200 MW"
    else:
        return "<200 MW"

ph_flow_short.loc[:,"Flow Size"] = flow_check(ph_flow_short["Physical Flow Value"])

**This throws us a ValueError** because Python does not know how to tell if a whole column is greater than 25. This is where Numpy comes in handy.

## np.where()
<br>

First, let's check out `np.where()`. This method is like the "if statement" in Excel.

- Syntax: 
           np.where(
                conditional statement -> bool array,
                series/array/function()/scalar if True,
                series/array/function()/scalar if False
           )

Let's try it out: 

In [None]:
import numpy as np

ph_flow_short.loc[:, "Flow Size by vec"] = np.where(
    ph_flow_short["Physical Flow Value"]>200, # <-- condition
    ">200 MW", # <-- return if true
    "<200 MW" #<-- return if false
    )

ph_flow_short.head(3)

## np.select()
<br>

If you have multiple conditions, you can simply use `np.select()`.
- Syntax:
``` 
        conditions = [
            condition1
            condition2
            etc.
        ]

        choices = [
            value1
            value2
            etc.
        ]

df["new column"] = np.select(conditions, choices, default="NA")
```

Let's try out another example in regard to the size of the physical flow: 

In [None]:
conditions = [
    ph_flow_short["Physical Flow Value"] == 0, # first condition to test: if true return choice1, if false check next condition
    ph_flow_short["Physical Flow Value"] < 200, # second condition to test: if true return choice2, if false check next condition
    ph_flow_short["Physical Flow Value"] < 1000 # third condition to test: if true return choice3, if false default value is returned
]

choices = [
    "None",  # choice1
    "Small", # choice2
    "Big",   # choice3
]

ph_flow_short.loc[:,"Flow_grouped"] = np.select(conditions, choices, default="Large") #  default value is the value if none of the conditions are true
print(ph_flow_short.Flow_grouped.value_counts())
ph_flow_short.tail(10)

<br>

## Recap, Tips & Takeaways &#128161;

<br>

<div class="alert alert-block alert-success">

**Let's recap what you have learned in this section:**

- You can use `apply()` to access and manipulate Series and DataFrames
- In a DataFrame, you can loop through rows and columns, you just need to specify the axis: 
    - axis = 0 for column 
    - axis = 1 for row
- You can speed up the process by using vectorization
    - For that you need to `import numpy as np`
    - `np.where()` is basically like an if statement in Excel
    - `np.select()` can be used to define multiple if statements with its conditions and choices 
 
</div>