# Empirical Project 10

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code, Google Colab, or Jupyter Notebook, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- [todo]

You'll also be using the **warnings** and **pathlib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings
import matplotlib_inline.backend_inline

# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make output charts in 'svg' format
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 10.1

**Importing an Excel spreadsheet into Python**

Before loading an Excel spreadsheet into Python, it can be helpful to open it in Excel to understand the structure of the spreadsheet and the data it contains. In this case we can see that detailed descriptions of all variables are in the first tab ('Definitions and Sources'). Make sure to read the definitions for the indicators listed in Figure 10.1.

The spreadsheet contains a number of other worksheets, but the data that we need is in the tab called 'Data – June 2016'. You can see that the variable names are all in the first row and missing values are simply empty cells. We can therefore proceed to import the data into Python using the `pd.read_excel` function without any additional options.

We're going to assume here that you've donwloaded the Excel file as "GlobalFinancialDevelopmentDatabaseJune2017.xlsx", and saved it in a subfolder of your working directory called "data".

In [None]:
gfdd = pd.read_excel(
    Path("data/GlobalFinancialDevelopmentDatabaseJune2017.xlsx"),
    sheet_name="Data - June 2016",
)
gfdd.head()

## Python Walkthrough 10.2

**Making box and whisker plots**

Box and whisker plots were introduced in Empirical Project 6. We can use the same process here, after ensuring that the data is in the correct format.

Some plotting libraries expects data like we have ingested to be in ‘long’ (aka tidy) format (where each row is a value for a single variable and year), whereas our data is in ‘wide’ format (each row contains a single variable but multiple years). We transform the data from wide to long format using the

So for the Depth indicators:

In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators = ["private_credit", "bank_assets"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(
    columns={"GFDD.DI.01": indicators[0], "GFDD.DI.02": indicators[1]}
)

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()
gfdd_long

Now we can plot it, which we'll do using the **seaborn** package:

In [None]:
import seaborn as sns

sns.boxplot(data=gfdd_long, x="indicator", y="value");

**Figure 10.2 Box and whisker plot for ‘Private credit by deposit money banks to GDP (%)’ (`private_credit`) and ‘Deposit money banks’ assets to GDP (%)’ (`bank_assets`).**

We could repeat the process for each topic and plot all indicators together. However, the range for the `GFDD.AI.01` variable (Bank accounts per 1,000 adults) is far larger than the other variables in this group, so it makes sense to plot this separately. We use the same process as before, as shown below.

In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators_big = ["bank_accounts"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns={"GFDD.AI.01": indicators_big[0]})

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators_big, var_name="indicator"
).dropna()
sns.boxplot(data=gfdd_long, x="indicator", y="value");

Now we'd like to do this for a bunch more cases. A key principle of coding is 'DRY: Don't Repeat Yourself'. We shouldn't have to type this out multiple times, and it's more likely that something could go wrong if we do that. Instead, we're going to list all of the indicators in one go:

In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators = [
    "private_credit",
    "bank_assets",
    "bank_branches",
    "firms_credit",
    "small_firms_credit",
]

# dictionary mapping old names to new names
old_names = ["GFDD.DI.01", "GFDD.DI.02", "GFDD.AI.01", "GFDD.AI.03", "GFDD.AI.04"]
dict_of_new_names = {k: w for k, w in zip(old_names, indicators)}

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns=dict_of_new_names)

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()

# put all of our indicators of interest into a box plot
fig, ax = plt.subplots()
sns.boxplot(data=gfdd_long, y="indicator", x="value", ax=ax)
ax.set_xlim(0, 1e3);

**Figure 10.4 Box and whisker plot for our indicators of interest**

You can repeat the process for the indicators on bank stability by copying the above code and adding indicator variable names accordingly.

## Python Walkthrough 10.3

**Tabulating and visualizing time trends**

In this walk-through we will use the indicators for ‘Deposit money banks’ assets to GDP (%)’ and ‘Bank accounts per 1,000 adults’ as examples (`bank_assets` and `bank_accounts` respectively).

Obtaining the average indicator value for each year and region is straightforward using the `group_by` and `agg` functions, but again we have to select the relevant years (using `.query`) and remove any observations that have a missing value for the indicator being analysed (using `dropna`). We save the final output as `deposit_region`.


In [None]:
deposit_region = (
    gfdd.rename(columns=dict_of_new_names)
    .query("Year > 1999 & Year < 2015")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Region"])["bank_assets"]
    .agg(["mean", "count"])
)
deposit_region

At this stage the summary data is stored in long format. This format is useful for plotting the data, but to produce the required table (with `Region` as the column variable and `Year` as the row variable), we need to reshape the data into wide format. While we previously used `melt` to move from wide to long, we can use the `pivot` function to achieve the opposite and transform the data from long to wide.

There is a short-cut to `pivot` though: if we only wish to move one variable from a row to a column (and it is part of the index), we can simply use the `.unstack` method:

In [None]:
deposit_region.unstack()

Note how we get two sub-tables: one for mean, and one for count.

At this point you could just print or view the data, however using one of many **pandas** export functions produces output that is visually easier to read and can be copied and pasted into your analysis document. Here are some examples with just the first few rows and just the first few columns:

In [None]:
# to markdown, the popular text format
print(deposit_region.iloc[:5, :3].to_markdown())

In [None]:
# to latex, for writing papers
print(deposit_region.iloc[:5, :3].to_latex())

In [None]:
# to html, for the web
print(deposit_region.iloc[:5, :3].to_html())

We can use **seaborn** to plot a line chart using the long format data (`deposit_region`), with year on the horizontal axis. We specify `color = "Region"`.

In [None]:
import seaborn.objects as so

(
    so.Plot(deposit_region, x="Year", y="mean", color="Region")
    .add(so.Line(linewidth=2))
    .label(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%), 2000-2014, by region",
    )
)

**Figure 10.5 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by region.**

The process can be repeated for income group rather than region.

In [None]:
deposit_income = (
    gfdd_new_names.query("Year > 1999")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Income Group"])["bank_assets"]
    .agg(["mean", "count"])
)

(
    so.Plot(deposit_income, x="Year", y="mean", color="Income Group")
    .add(so.Line(linewidth=2))
    .label(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%), 2000-2014, by income group",
    )
)

**Figure 10.6 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by income group.**

You can repeat the process for the indicator 'Bank accounts per 1,000 adults' by replacing the variable name `bank_assets` with `bank_accounts` in the above code, again by region and then by income group.