# Empirical Project 10

## Getting Started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import seaborn as sns
import seaborn.objects as so  # installing seaborn installs this
import pingouin as pg
import warnings


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)
# Make seaborn work consistently with this
so.Plot.config.theme.update(mpl.rcParams)
# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 10.1

**Importing an Excel spreadsheet into Python**

Before loading an Excel spreadsheet into Python, it can be helpful to open it in Excel to understand the structure of the spreadsheet and the data it contains. In this case we can see that detailed descriptions of all variables are in the first tab ('Definitions and Sources'). Make sure to read the definitions for the indicators listed in Figure 10.1.

The spreadsheet contains a number of other worksheets, but the data that we need is in the tab called 'Data – June 2016'. You can see that the variable names are all in the first row and missing values are simply empty cells. We can therefore proceed to import the data into Python using the `pd.read_excel` function without any additional options.

We're going to assume here that you've donwloaded the Excel file as "GlobalFinancialDevelopmentDatabaseJune2017.xlsx", and saved it in a subfolder of your working directory called "data".

In [None]:
gfdd = pd.read_excel(
    Path("data/GlobalFinancialDevelopmentDatabaseJune2017.xlsx"),
    sheet_name="Data - June 2016",
)
gfdd.head()

## Python Walkthrough 10.2

**Making box and whisker plots**

Box and whisker plots were introduced in Empirical Project 6. We can use the same process here, after ensuring that the data is in the correct format.

Some plotting libraries expects data like we have ingested to be in ‘long’ (aka tidy) format (where each row is a value for a single variable and year), whereas our data is in ‘wide’ format (each row contains a single variable but multiple years). We transform the data from wide to long format using the

So for the Depth indicators:

In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators = ["private_credit", "bank_assets"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(
    columns={"GFDD.DI.01": indicators[0], "GFDD.DI.02": indicators[1]}
)

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()
gfdd_long

Now we can plot it, which we'll do using the **seaborn** package:

In [None]:
import seaborn as sns

sns.boxplot(data=gfdd_long, x="indicator", y="value");

**Figure 10.2 Box and whisker plot for ‘Private credit by deposit money banks to GDP (%)’ (`private_credit`) and ‘Deposit money banks’ assets to GDP (%)’ (`bank_assets`).**

We could repeat the process for each topic and plot all indicators together. However, the range for the `GFDD.AI.01` variable (Bank accounts per 1,000 adults) is far larger than the other variables in this group, so it makes sense to plot this separately. We use the same process as before, as shown below.

In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators_big = ["bank_accounts"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns={"GFDD.AI.01": indicators_big[0]})

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators_big, var_name="indicator"
).dropna()
sns.boxplot(data=gfdd_long, x="indicator", y="value");

Now we'd like to do this for a bunch more cases. A key principle of coding is 'DRY: Don't Repeat Yourself'. We shouldn't have to type this out multiple times, and it's more likely that something could go wrong if we do that. Instead, we're going to list all of the indicators in one go, and that's what the below code is going to do.

However, we do need a trick for this. To change the names of the variables to sensible names we will use a built-in type of object in Python called a *dictionary*. Dictionaries provide a map from one set of values to another. A simple one might look like this:

```python
fruit_dict = {
    "Jazz": "Apple",
    "Owari": "Satsuma",
    "Seto": "Satsuma",
    "Pink Lady": "Apple",
}
```

which maps varieties of fruit into types of fruit. A dictionary is super helpful here because we can use it to map the old column names to the new ones in a statement like `gfdd_new_names = gfdd.rename(columns=dict_of_new_names)`. When we're creating our dictionary below, we could do it like in the fruit example above, and there's nothing wrong with that. But for convenience, because we might use them later, we're going to instead create two lists (one for the new names and one for the old) and then bring those lists together to create our dictionary. We've already seen *list comprehensions* and the *zip* function; in the below we bring these ideas together to form a *dictionary comprehension* in the line:

```python
dict_of_new_names = {k: w for k, w in zip(old_names, indicators)}
```


In [None]:
# for convenience, create a list of the indicators we're interested in:
indicators = [
    "private_credit",
    "bank_assets",
    "bank_accounts",
    "bank_branches",
    "firms_credit",
    "small_firms_credit",
    "risk_weighted_assets",
]

# dictionary mapping old names to new names
old_names = [
    "GFDD.DI.01",
    "GFDD.DI.02",
    "GFDD.AI.01",
    "GFDD.AI.02",
    "GFDD.AI.03",
    "GFDD.AI.04",
    "GFDD.SI.05",
]
dict_of_new_names = {k: w for k, w in zip(old_names, indicators)}

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns=dict_of_new_names)

# create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()

# put all of our indicators of interest into a box plot
fig, ax = plt.subplots()
sns.boxplot(data=gfdd_long, y="indicator", x="value", ax=ax)
ax.set_xlim(0, 1e3);

**Figure 10.4 Box and whisker plot for our indicators of interest**

You can repeat the process for the indicators on bank stability by copying the above code and adding indicator variable names accordingly.

## Python Walkthrough 10.3

**Tabulating and visualizing time trends**

In this walk-through we will use the indicators for ‘Deposit money banks’ assets to GDP (%)’ and ‘Bank accounts per 1,000 adults’ as examples (`bank_assets` and `bank_accounts` respectively).

Obtaining the average indicator value for each year and region is straightforward using the `group_by` and `agg` functions, but again we have to select the relevant years (using `.query`) and remove any observations that have a missing value for the indicator being analysed (using `dropna`). We save the final output as `deposit_region`.


In [None]:
deposit_region = (
    gfdd.rename(columns=dict_of_new_names)
    .query("Year > 1999 & Year < 2015")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Region"])["bank_assets"]
    .agg(["mean", "count"])
)
deposit_region

At this stage the summary data is stored in long format. This format is useful for plotting the data, but to produce the required table (with `Region` as the column variable and `Year` as the row variable), we need to reshape the data into wide format. While we previously used `melt` to move from wide to long, we can use the `pivot` function to achieve the opposite and transform the data from long to wide.

There is a short-cut to `pivot` though: if we only wish to move one variable from a row to a column (and it is part of the index), we can simply use the `.unstack` method:

In [None]:
deposit_region.unstack()

Note how we get two sub-tables: one for mean, and one for count.

At this point you could just print or view the data, however using one of many **pandas** export functions produces output that is visually easier to read and can be copied and pasted into your analysis document. Here are some examples with just the first few rows and just the first few columns:

In [None]:
# to markdown, the popular text format
print(deposit_region.iloc[:5, :3].to_markdown())

In [None]:
# to latex, for writing papers
print(deposit_region.iloc[:5, :3].style.to_latex())

In [None]:
# to html, for the web
print(deposit_region.iloc[:5, :3].to_html())

We can use **seaborn** to plot a line chart using the long format data (`deposit_region`), with year on the horizontal axis. We specify `color = "Region"`.

In [None]:
(
    so.Plot(deposit_region, x="Year", y="mean", color="Region")
    .add(so.Line(linewidth=2))
    .label(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%), 2000-2014, by region",
    )
    .show()
)

**Figure 10.5 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by region.**

The process can be repeated for income group rather than region.

In [None]:
deposit_income = (
    gfdd_new_names.query("Year > 1999")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Income Group"])["bank_assets"]
    .agg(["mean", "count"])
)

(
    so.Plot(deposit_income, x="Year", y="mean", color="Income Group")
    .add(so.Line(linewidth=2))
    .label(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%), 2000-2014, by income group",
    )
    .show()
)

**Figure 10.6 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by income group.**

You can repeat the process for the indicator 'Bank accounts per 1,000 adults' by replacing the variable name `bank_assets` with `bank_accounts` in the above code, again by region and then by income group.

## Python Walkthrough 10.4

**Creating weighted averages**



As we only require the weighted averages for the years 2004–2014, we will create a new dataframe (called `weighted_gfdd`) to save our results in. The weights are required for each country within each region for each year, but only if there is a value for the `GFDD.AI.01` (`bank_accounts`) indicator, so we:
 - filter results by years of interest using `.query`
 - select only columns of interest using `.loc`
 - drop any invalid entries for `bank_accounts` using `.dropna`

With our new dataframe, we group by year and then region (using a `.groupby`) and then generate the weight for each country by dividing the population of each country by the sum of populations of all countries within a region (and year). To return the results in the same shape (index) as the data we began with, we use the `.transform` method. Remember:

- Use `.agg` when using a groupby, but you want your groups to become the new index (here, this would give a year-region index)
- Use `.transform` when using a groupby, but you want to retain your original index (here, numbered entries)
- Use `.apply` when using a groupby, but you want to perform operations that will leave neither the original index nor an index of groups

In [None]:
gfdd_weighted = (
    gfdd_new_names.query("Year > 2003 & Year < 2015")
    .loc[:, ["Year", "Country", "Region", "bank_accounts", "SP.POP.TOTL"]]
    .dropna(subset=["bank_accounts"])
)

gfdd_weighted["weight"] = gfdd_weighted.groupby(["Year", "Region"])[
    "SP.POP.TOTL"
].transform(lambda x: x / x.sum())

gfdd_weighted

Of course we want to check that this actually works as weight! If it does, then summing over each year-region will produce a value of unity. Let's see:

In [None]:
gfdd_weighted.groupby(["Year", "Region"])["weight"].sum()

This is correct, so we can proceed to calculate the required weighted indicator values by year and region. We start by creating a new variable with the weighted indicator value (`bank_accounts_weighted`), and then sum up the weighted indicator values by year and region. Recall that when calculating the weighted average, you sum all of the weighted observations rather than taking the mean (which would calculate the simple average instead).

In [None]:
(
    gfdd_weighted.assign(
        bank_accounts_weighted=lambda x: x["bank_accounts"] * x["weight"]
    )
    .groupby(["Year", "Region"])
    .sum()
    .round(2)["bank_accounts_weighted"]
)

As ever, you can change this table by unstacking it, or just export it like this.

## Python Walkthrough 10.5

**Dealing with extreme values**

In this example we use ‘Bank accounts per 1,000 adults’ (`bank_accounts`). The 95th and 5th percentiles can be obtained using the quantiles function. We save the output into a dataframe so we can refer to the values in later calculations.


In [None]:
q_5_95 = (
    gfdd_new_names.query("Year == 2010")
    .dropna(subset=["bank_accounts"])["bank_accounts"]
    .quantile([0.05, 0.95])
)
q_5_95

We can compare the value of the indicator with these upper and lower bounds using the `np.where` function from numerical library **numpy**. The way `np.where` works is that it has the following syntax:

```text
np.where(condition, value if condition is true, value if condition is false)
```

But the beauty of `np.where` is that we need not just pass it single values in either of its three arguments: we can pass vectors to all of its arguments, creating a vector-valued return column. 

In the below, we make use of `np.where` to first replace all values below the 5th percentile with the value for the 5th percentile, and then to replace all values above the 95th percentile with the value for the 95th percentile.

Gotcha warning! The index values for retrieving data from `q_5_95` are of integer type (as opposed to strings). Another gotcha! Because we refer to "`bank_accounts`" multiple times, we need to define it first in a separate step (or, alternatively use *lambda expressions*).

In [None]:
gfdd_2010 = gfdd_new_names.query("Year == 2010").dropna(subset=["bank_accounts"])

bank_2010 = gfdd_2010.assign(
    bank_accounts=np.where(
        gfdd_2010["bank_accounts"] < q_5_95[0.05],
        q_5_95[0.05],
        np.where(
            gfdd_2010["bank_accounts"] > q_5_95[0.95],
            q_5_95[0.95],
            gfdd_2010["bank_accounts"],
        ),
    )
)

Next we can obtain our summary statistics and print out the ‘Winsorized’ averages (use `gfdd_2010` to see the original averages).

In [None]:
bank_2010.groupby("Income Group").agg(avg_2010=("bank_accounts", "mean")).round(2)

## Python Walkthrough 10.6

**Calculating confidence intervals**

In Python walkthroughs 3.6 and 8.10 we used the t-test function from the **pingouin** package to obtain differences in means and confidence intervals (CIs) for two groups of data. Here we need to obtain these statistics for the `GFDD.SI.05` indicator (renamed as `risk_weighted_assets`) between 2007 and 2014 for each region.

As we need to find the confidence intervals for a number of regions, we can use a *vectorised operation* to perform the same calculation for each region in turn. For this, we do need to reshape our data, however. We want to make it so that the 2007 and 2014 values of risk weighted assets appear as columns while regions (and countries) are columns. To do this, we're going to first select both the values and variables we're interested in using the `.loc` command. We'll then use `pivot` to re-order the data into the shape we want.


In [None]:
rwa_07_14 = gfdd_new_names.loc[
    gfdd_new_names["Year"].isin([2007, 2014]),
    ["Year", "Region", "Country", "risk_weighted_assets"],
].pivot(index=["Region", "Country"], columns=["Year"], values=["risk_weighted_assets"])

rwa_07_14

Now we have the shape, but we've unwittingly created quite a complex structure, especially of our index. To access columns in a hierarchical or multindex index, use a tuple (tuples have curvy brackets). You can see the precise name of the columns by running `.columns` like this:

In [None]:
rwa_07_14.columns

Now, we want to get a t-test for each region. We'll use the **pingouin** package for the ttest, we'll groupby by region, and we'll use the apply function, which allows us to run functions that combine different columns.

In [None]:
import pingouin as pg

rwa_07_14_ttest = rwa_07_14.groupby("Region").apply(
    lambda row: pg.ttest(
        row[("risk_weighted_assets", 2007)], row[("risk_weighted_assets", 2014)]
    )
)
rwa_07_14_ttest

Let's explode out the confidence interval into two separate columns (it's one column of lists of two entries here). We'll also add in some other stats based on the t-test confidence intervals; the mean, and the width. Note that the mean is just the `.mean` across each row (`axis=1`), and the difference can be computed using `.diff` and then taking the one valid entry by using `dropna`.

To assemble all of this, we're going to *concatenate* multiple dataframes using `pd.concat`. The syntax of this command is `pd.concat([list of dataframes], axis=<axis you want to stick dataframes together by)`. In the below, the two dataframes in our list are the original `rwa_07_14_ttest` and a new, second dataframe that we create in line that has the confidence intervals as its values and two columns, "upper" and "lower".

Further below, when we wish to add in means and widths, we can do this by just declaring a new column (eg by writing `rwa_07_14_ttest["mean"] = ...`) and then applying a function to the two columns of interest.

In [None]:
rwa_07_14_ttest = pd.concat(
    [
        rwa_07_14_ttest,
        pd.DataFrame(
            rwa_07_14_ttest["CI95%"].tolist(),
            columns=["lower", "upper"],
            index=rwa_07_14_ttest.index,
        ),
    ],
    axis=1,
)
rwa_07_14_ttest["mean"] = rwa_07_14_ttest[["lower", "upper"]].mean(axis=1)
rwa_07_14_ttest["width"] = (
    rwa_07_14_ttest[["lower", "upper"]].diff(axis=1).dropna(axis=1) / 2
)
rwa_07_14_ttest

The same process can be repeated for income groups and for the indicator "GFDD.SI.01" (Bank Z-score).

## Python Walkthrough 10.7

**Plotting column charts with error bars**

Again we use the `GFDD.SI.05` indicator (`risk_weighted_assets`) for Region as an example. You can repeat the following steps by region and for the `risk_weighted_assets` variable by changing the variable name(s) in Python walkthrough 10.6 accordingly, then running the code below.

In [None]:
import seaborn.objects as so  # installing seaborn installs this

(
    so.Plot(
        rwa_07_14_ttest.reset_index(),
        y="Region",
        x="mean",
        color="Region",
        xmin="lower",
        xmax="upper",
    )
    .add(so.Bar(edgecolor="k"))
    .add(so.Range(), so.Est(errorbar="ci"))
    .add(so.Dash(width=0.3), x="lower")
    .add(so.Dash(width=0.3), x="upper")
    .label(
        title="Differences in risk weighted assets between 2007 and 2014",
        x="Difference",
    )
    .show()
)

**Figure 10.7 Column chart with error bars for ‘Bank regulatory capital to risk-weighted assets (%)’ (`risk_weighted_assets`).**

This chapter used the following packages

In [None]:
%load_ext watermark
%watermark --iversions