# Session 6: Use Pandas to index, split, apply, and combine data
## Assignment 2

## [EAA - ARC Python Primer for Accounting Research](https://martien.netlify.app/book/example/)

#### EBA data and the evolution of [Fair-Value-Hierarchy](https://ifrscommunity.com/knowledge-base/fair-value-hierarchy/) data.

---

This assignment requires you examine the evolution of Level 1, 2, and 3 assets of European banks. 

The examination requires you to download and munge data from the EBA Risk Dashboard, which is part of the regular risk assessment conducted by the EBA and complements the Risk Assessment Report. 

The EBA Risk Dashboard summarizes the main risks and vulnerabilities in the banking sector in the European Union (EU) by looking at the evolution of Risk Indicators (RI) among a sample of banks across the EU.

The [EBA Risk Dashboard pdf](https://www.eba.europa.eu/sites/default/documents/files/document_library/Risk%20Analysis%20and%20Data/Risk%20dashboard/Q3%202021/1025829/EBA%20Dashboard%20-%20Q3%202021%20v2.pdf?retry=1) has lots of tables, but for research purposes it is better to get the data in machine readable form.

Luckily the EBA thought about us. Under the name [the intractive tool](https://www.eba.europa.eu/sites/default/documents/files/document_library/Risk%20Analysis%20and%20Data/Risk%20dashboard/Q3%202021/1025834/EBA%20Interactive%20Dashboard%20-%20Q3%202021%20-%20Protected.xlsm) they offer an Excel file with a treasure trove of data.

This time we need the data from the `Statistical annex` of the EBA Risk Dashboard.

---

**Required (1)**: From the [EBA Risk Dashboard website](https://www.eba.europa.eu/risk-analysis-and-data/risk-dashboard), **download the [interactive tool](https://www.eba.europa.eu/sites/default/documents/files/document_library/Risk%20Analysis%20and%20Data/Risk%20dashboard/Q3%202021/1025834/EBA%20Interactive%20Dashboard%20-%20Q3%202021%20-%20Protected.xlsm)**: `EBA Interactive Dashboard - Q3 2021 - Protected.xlsm`.


**Save** the file to a folder on your drive, e.g. `D:/users/my_user_name_here/EAA_python/data/`. See this [link](https://www.youtube.com/watch?v=hUW5MEKDtMM) and this [link](https://www.youtube.com/watch?v=7ABkcHLdG_A) for explanations of folders and directories.

**Open the file using Excel**, to quickly get an overview of the data, specifically the data in the statistical annex. See sheets `Annex database`, `Data Annex`, and `Mapping`.

In [None]:
# The usual preamble
import pandas as pd
import numpy as np
import os

if os.name=='nt':  # for Windows users
    os.chdir('D:/users/my_user_name_here/EAA_python/data/')  # note the forward slashes, change 'martien' to your user name
else:
    os.chdir('/home/my_user_name_here/EAA_python/data/')  # For Linux or Mac

In [None]:
# Set the file name:
fn = 'EBA Interactive Dashboard - Q3 2021 - Protected.xlsm'

---

**Required (2)**: We know the data definition are hard to understand. Therefore: create a data frame with definitions from the sheet 'Mapping' in the Excel file. 

The data frame should have these column names `['Label', 'Item']`, the former should become the index.

Use the `clean_text` function to eliminate line breaks form the `Item` variable.

In [None]:
def clean_text(s):
    return s.replace('\n', ' ').strip() # Get rid of line breaks and trim leading and lagging spaces. 

def annex_data_definitions(fn, sn):
    #... read_excel here
    #... name df.columns
    #... clean df['Item']
    #... set_index here
    return df

df_defs = annex_data_definitions(fn, 'Mapping')

**For Jupyter notebook users only**: To show all frame rows in your notebook, use the following setting ([from Stackoverflow](https://stackoverflow.com/questions/47022070/display-all-dataframe-columns-in-a-jupyter-python-notebook)).

In [None]:
pd.options.display.max_rows = 200

In [None]:
df_defs

---

**Required (3a)**: Create a data frame from the sheet: `Data Annex`, columns `L:M,O:AQ`. Name that data frame `df`.

- While you are at it, include this parameter setting in your `pd.read_excel` statement: `na_values = 'n.a.'`. This converts all `n.a.` cells into a properly coded missing value.
- Using Pandas [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) function, change columns names as follows: `lbl`  to  `Label` and `NSA` to `Country`.
- Eliminate rows from "country" EU.
- Set `Label` and `Country` columns as index.



In [None]:
df

---

**Required (3b)**: Use `melt` to create a long data frame from `df`. Name that data frame `dfm`.

- `id_vars` should be `['Label', 'Country']`.
- `value_vars` should be the list of column names of `df`.
- `var_name` should be `Date`, 
- `value_name` should be `value`.

In next steps:
- use `Label` and `Country` as index
- use `dropna()` to eliminate missing `value` observations from `dfm`.
Note, to make it work, apply `melt` to the the reset data frame: `df.reset_index()`

In [None]:
# Melt


In [None]:
dfm.head(3) 

In [None]:
# Dropna


In [None]:
dfm.tail(3) 

In [None]:
# Set index


In [None]:
dfm.head(3) 

In [None]:
dfm.tail(3) 

In [None]:
# Using dfm, check Return on Equity ('T28_1') for German banks (DE) - ignore the error warning
print(df_defs.loc[...])
dfm.loc[('.....', '..')]['value'].plot()

---

**Required (3b)**: Using `pd.pivot_table`, create a pivoted data frame (`dfp`) from `dfm`.

- `values` should be `value`, 
- `index` should be `Country` and `Date`, 
- `columns` should be `Label`.



In [None]:
# Pivot table:
...
dfp

Check total assets for Dutch banks `NL`.

In [None]:
dfp['T02_1'].loc['NL'].plot(kind='bar')

**Required (3c)**: Check Level 3 assets `T14_3` for Greek banks (GR). But note, get rid of the entries with values of zero. Do so by using `.replace(0, np.NaN).dropna()`.

In [None]:

dfp['.....'].loc['..']. ... .plot(kind='bar')

---

In [None]:
# Variables of interest:

print(df_defs.loc['T14_1'])
print(df_defs.loc['T14_1'])
print(df_defs.loc['T14_1'])

---

**Required (4a)**: Group variables of the fair value hierarchy (`T14_1`, `T14_2`, `T14_3`) in `dfp` by `Country` and plot the means of these variables using a bar-plot with figsize=(18,6).

In [None]:
dfp[['T14_1','T14_2','T14_3']].groupby('Country').mean().plot(kind='bar', figsize=(18,6))

---

**Required (4b)**: Create a `groupby` object and use it to plot the means of the variables (`T14_1`, `T14_2`, `T14_3`) by `Date` using a bar-plot.

Follow these steps:

- first create the `groupby` object named `df_gp_date`,
- using `df_gp_date` check the mean values of the three variables (`T14_1`, `T14_2`, `T14_3`),
- using `df_gp_date` check the mean values of the three variables - and assign the resulting frame to a new data frame named: `df_levels`,
- replace all zero-values  in `df_levels` from the frame by NaNs,
- eliminate empty rows from the frame using `dropna()`,
- plot the resulting data frame.

In [None]:
# Create the groupby object 


In [None]:
# Check its means for df_gp_date[['T14_1','T14_2','T14_3']]


In [None]:
# Create df_levels from the groupby object 


In [None]:
# plot the non-zero rows
