<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Initial Data Exploration 
<br> 

In the following subsection, you will learn how to **select** data and **slice** your data set. Anyone who has worked with data knows how important it is to be able to select particular columns or rows or to slice your data into specific pieces. In data science, slicing means to precisely select certain data snippets by row or column.
<br/> 

Let's get started by importing Pandas and read another data set of *photovoltaic power production estimation and forecast on the Belgian grid* from a `csv` file into a Pandas DataFrame:

In [None]:
import pandas as pd

pv_power = pd.read_csv("./data/energy/ods032.csv", sep = ';')
pv_power

## Selecting by column name
There are actually two ways to select data by column name. <br>

1. with `''` quotes

In [None]:
pv_power['Region']

2. with a `.` dot as separator (**NOTE - ONLY works when column names have no spaces!**)

In [None]:
pv_power.Region


You want to select two columns and just want to see the first three rows? No problem! You can also concatenate commands in Pandas. <br> 
Note, that if you want to select two columns of your dataframe, you need to pass them as a list: `df_name[["column_1", "column_2"]]`.

In [None]:
pv_power[["Region", "Monitored capacity"]].head(n=3)

### Exercise 📝

1. Display the **column names** of the DataFrame.


In [None]:
# delete this line and replace it with your solution

2. Select the columns `Resolution code` and `Datetime`.
3. Display the **first 5 rows** of the selected columns.

In [None]:
# delete this line and replace it with your solution

## Slicing - Precise Selection by Row and Column
- `df.loc` gets rows (or columns) with particular labels from the index.
- `df.iloc` gets rows (or columns) at particular positions in the index so it only takes **integers**.

<br>

<div class="alert alert-block alert-warning">

&#128161; **When using `loc` and `iloc`, the row is ALWAYS 1st and column is ALWAYS second!!!**
    
</div>
    
<br>

The syntax for both is `[start_row : end_row, start_column : end_column]`. 

If you put nothing for `start_row` and `end_row` (or `start_column` and `end_column`) but only a `:` then Python would select all rows and all columns, respectively.
<br>

For example, if you want to select the first 5 rows of `pv_power` with `iloc` then you use the following syntax:

In [None]:
pv_power.iloc[:5,:]

Let's do the same thing with the `loc` syntax instead of `iloc` and check out what will happen: 

In [None]:
pv_power.loc[:5,:]

<br>
Please take a break to spot the differences &#9749;!
<br>

Although, the syntax looks almost identical, there is a huge difference between these two commands and hence their output: 

- `iloc` works **positionally** with numbers (in our case the indexes)
- `loc` searches for **labels**

What does this mean? When you used `iloc` on `pv_power`, it selected the first five rows (from index 0 until index 4). When using `loc` though, you basically said "I want to select all the rows until the **label** 5". Which resulted in 6 rows, from index 0 until index 5!

<br>

<div class="alert alert-block alert-warning">

&#128161;
**If you want to select multiple rows/columns that are not consecutive, you need to pass a list with the respectives names/indexes.** 
    
**loc:** [["row1", "row5", "row8"], ["column2", "column5", "column6"]]
    
**iloc:** [[1, 5, 8], [2, 5, 6]] 
</div>
    
<br>

Let's recap by checking out the following examples:

In [None]:
pv_power.iloc[2:5,:3]

In [None]:
pv_power.loc[2:5, :"Region"]

### Exercise 📝


In [None]:
import pandas as pd

pv_power = pd.read_csv("./data/energy/ods032.csv", sep = ';')
pv_power


1. Select the **first 7 rows**, and **first 2 columns** of `pv_power`, **positionally**.


In [None]:
# delete this line and replace it with your solution

2. Select **rows** `10`, `15`, and `30`, and the **columns** `Resolution code` and `Region`.


In [None]:
# delete this line and replace it with your solution


&#9200; Still time left? Then try this one:

3. Select the **first 5 rows** and **columns** `Datetime` and `Monitored capacity` of the `pv_power` DataFrame. 

In [None]:
# delete this line and replace it with your solution

## Extra resources

To know more about `iloc` and `loc`, please check out:
- [iloc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)
- [loc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)