<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Python for Data Wrangling
<br> 

## Intro to Pandas 🐼

Welcome back! In this section you will learn about one of the most important packages for data scientists: [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). Pandas is a [package](https://data36.com/python-import-built-in-modules-data-science/) in Python used for data formatting, analysis and manipulation. It helps you to deal with 2-D table-like data structures, which is what business people and other stakeholders are used to see.<br> 

Pandas gives you all the tools to..
- **clean up** and organize your data (i.e. resampling, filling missing values)
- "**automate the boring stuff**" and quickly add value to any team
- work with other packages and libraries like [matplotlib](https://matplotlib.org/) or [Scikit-Learn](https://scikit-learn.org/stable/)
- actually make sense of your data set!

The cool thing about Pandas is that it allows you way more flexibility compared to Excel. Also, it empowers you to work with massive data sets without getting lost or having trouble to upload it. <br>

Sounds good? Let's do this.

<img src="./image/Icons/Icon_2019_WEB_progression.png">

### When to use Pandas

Maybe you already have an idea when Pandas comes in handy. The following figure shows you the entire data science pipeline. Can you spot Pandas?

<br>

<img src= "./image/ds_path.png">

<br>

[source](https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955)

### Importing Pandas

As being said, Pandas gives you a great way to deal with 2-D data structures (like SQL or Excel tables) in Python, which isn't "native" to the language. To add the features of Pandas to Python, you need to install it by running `pip install pandas`. [Pip](https://packaging.python.org/en/latest/tutorials/installing-packages/#use-pip-for-installing) is a package installer that is recommended to help you installing new libraries to your working environment.

After installation you can `import` the package just like the [ones that came already installed with Python](https://docs.python.org/3.11/py-modindex.html).  By `import`-ing the package, you allow the current `*.py` or `*.ipynb` file you're working on to use the functionality of the package. Additionally, shortening pandas to `pd` is common. <br>

In [None]:
import pandas as pd 

In [None]:
#check for the current version like this 
pd.__version__

### Additional Data Types in Pandas

Let's recap the datatypes you are already familiar with:

In [None]:
float(16)

In [None]:
int(16)

In [None]:
str(16)

In [None]:
bool(16)

In [None]:
list([1,2,3])

When working with Pandas, you will make use of additional datatypes: 

- **pd.Series**:
    - 1-D data structure; single columns.

- **pd.DataFrame**:
    - 2-D data structure (better known as a table with *named columns* & *numbered rows*).
    
Pandas adds those two more datatypes to the ones that are native to Python. <br>
In this training, you will mainly **focus on pd.DataFrame**, since they are super important in data science (you cannot train a model with only 1 single array of data).

### Series

Per definition, [Pandas Series](https://pandas.pydata.org/docs/reference/series.html#series) are one-dimensional data structures which can hold data such as strings, integers and other [Python objects](https://docs.python.org/3/reference/datamodel.html); just like a column in a table. In Python, a Pandas Series can be easily created using `pandas.Series()`. 
<br>

Take a look at the following example: 

In [None]:
import pandas as pd

In [None]:
grid_load = [1182.06, 1067.81, 7538.02]

grid_load_series = pd.Series(grid_load)

In [None]:
print(grid_load_series)

In comparison, this is how a list looks like: 

In [None]:
print(grid_load)

You can access each item of a Series by using its index:

In [None]:
grid_load_series[0]

Or create even new indexes: 

In [None]:
grid_load_index = pd.Series(grid_load, index = ["Jan", "Feb", "March"])

In [None]:
grid_load_index

### DataFrame

So, the first question you probably ask yourself is "What is a DataFrame made of?". A DataFrame is like a 2-D data structure (tabular data) and therefore consists of rows and columns.
<br/> 

Let's create a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/frame.html#dataframe) from a list of lists with `pandas.DataFrame()` and check its components:

In [None]:
list_of_lists = [["Wallonia", 4339.61],["Flanders",1536.84],["Flemish-Brabant",514.58]]
list_of_lists

In [None]:
df = pd.DataFrame(list_of_lists,columns = ["Region", "Monitored_Capacity"])
df

Same way, you can create a DataFrame from a list of dictionaries or even from dictionary of lists! 

*Note that you can omit the columns parameter here because the dictionaries came with column labels that are used for the resulting frames.*

In [None]:
list_of_dicts = [{"Region":"Wallonia", "Monitored_Capacity":4339.61}, {"Region":"Flanders", "Monitored_Capacity":1536.84}, {"Region":"Flemish-Brabant", "Monitored_Capacity":514.58}]
df_lodicts = pd.DataFrame(list_of_dicts)
df_lodicts

In [None]:
dict_of_lists = {"Region":["Wallonia", "Flanders", "Flemish-Brabant"], "Monitored_Capacity":[4339.61, 1536.84, 514.58]}
df_dictol = pd.DataFrame(dict_of_lists)
df_dictol

As you can see, with `pd.DataFrame()` you can easily create frames from a list of lists or dictionaries. Now, let's have a look at the different components of such a DataFrame:

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.values[0]

## Pandas for Data Understanding and Analysis

### Importing a DataFrame

Let's have a look at a way bigger data set. You will use the original data set of *photovoltaic power production estimation and forecast on the Belgian grid*. This data set is orginally from the [Elia Open Source API]("https://opendata.elia.be/explore/dataset/ods087/information/"). But no worries, the data set is already downloaded into your [data/energy](data/energy) folder for your convenience.
<br>

In [None]:
import pandas as pd

1. Use the `read_csv` function to load your data from a `csv` file into a DataFrame:

In [None]:
# since the csv is separated by a ';' delimiter this has to be clarified in the read_csv function
pv_power = pd.read_csv("./data/energy/ods087.csv", sep = ';') 

2. Inspect `pv_power` to get a first overview of your data, there are several things you can do. `<dataframe_name>.head()` shows the first n rows (5 by default) in order of imported - this is important as it's not sorted!. So the first row you see may not be the "actual" first row, or the first row you expect to see. Please keep that in mind!

In [None]:
pv_power.head()

- 2.a) If you, for instance, would like to see the first 10 rows, you need to set n to 10:

In [None]:
pv_power.head(n=10)

3. You can also look at the **last** n rows with `<dataframe_name>.tail()` which shows last n rows in order of imported (we limit it to three in this example):

In [None]:
pv_power.tail(n=3)

4. If you would like to know **how many rows or columns** your data set has in total, you can apply `<dataframe_name>.shape`:

In [None]:
# The first entry stands for the number of rows, the second entry for the number of columns
pv_power.shape 

In [None]:
rows = pv_power.shape[0] 
rows

5. You can also quickly look at the data types of each of our columns:
- each column has a single data type
- Pandas has figured out some of our column's data types for us

In [None]:
pv_power.dtypes

6. Last but not least, get basic statistics about our DataFrame! `<dataframe_name>.describe` - a pretty useful method which generates summary statistics of the columns:

In [None]:
pv_power.describe()

### Exercise
<br> 

Alright, let's practise! &#128512;

In [None]:
import pandas as pd

pv_power = pd.read_csv("./data/energy/ods087.csv", sep = ';')

1. Get the last **25 rows** of the DataFrame `pv_power`:

In [None]:
# replace this line with your solution

2. Get the **number of columns** in the DataFrame `pv_power`, save this as a new variable called `columns` and then print the result:

In [None]:
# replace this line with your solution

## Extra resources
<br/>

To know more about Pandas Series and DataFrame, please visit:
- [Pandas Series vs. DataFrame](https://composingprograms.com/pages/14-designing-functions.html)