# Empirical Project 5

## Python-specific learning objectives

In addition to the learning objectives for this project, in Part 5.1 you will learn how to use loops and list comprehensions to repeat specified tasks for a list of values.

## Getting started in Python

For this project, you will need the following packages:

- **pandas** for data analysis
- **matplotlib** for data visualisation
- **numpy** for numerical methods

You'll also be using the **warnings** and **pathblib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotnine import *
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150
theme_set(theme_seaborn)

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## TODO

- Go to the Globalinc website and download the Excel file containing the data by clicking ‘xlsx’.
- Save it in a subfolder of the directory you are coding in such that that the relative path is `data/GCIPrawdata.xlsx`.
- Import the data into Python as explained in Python Walkthrough 5.1.

## Python Walkthrough 5.1

**Importing the Excel file (`.xlsx` or `.xls`) into Python**


As we are importing an Excel file, we use the `pd.read_excel` function from the **pandas** package. The file is called "GCIPrawdata.xlsx". Before you import the file into Python, open the datafile in Excel to understand its structure. You will see that the data is all in one worksheet (which is convenient), and that the headings for the variables are in the third row. Hence we will use the `skiprows=2` option in the `pd.read_excel` function to skip the first two rows.

Now let's import the data using the `Path` module to create the path to the data, and look at the first few rows with `head()`:

In [None]:
df = pd.read_excel(Path("data/GCIPrawdata.xlsx"), skiprows=2)
df.head()

The data is now in a pandas dataframe, which is the primary object for data analysis in Python. You can always tell the type of object you are dealing with in Python by running `type` on it:

In [None]:
type(df)

In the data, each row represents a different country-year combination. The first row is for Afghanistan in 1980, and the first value (in the third column) is 206, for the variable Decile 1 Income. This value indicates that the mean annual income of the poorest 10% in Afghanistan was the equivalent of 206 USD (in 1980, adjusted using purchasing power parity). Looking at the next column, you can see that the mean income of the next richest 10% (those in the 11th to 20th percentiles for income) was 350.

To see the list of variables, we use the `df.info()` method.

In [None]:
df.info()

In addition to the country, year, and the ten income deciles, we have mean income and the population.

## Python Walkthrough 5.2

**Calculating cumulative shares using the `cumsum` function**

Before we calculate cumulative income shares, we need to calculate the total income for each country-year combination using the mean income and the population size.

In [None]:
df["total_income"] = df["Mean Income"]*df["Population"]

Here we have chosen China (a country that recently underwent enormous economic changes) and the US (a developed country). We use the `.loc` function to create a new dataset (called `xf`) containing only the countries and years we need.

In [None]:
# Create lists for the years and countries we'd like
sel_year = [1980, 2014]
sel_country = ["United States", "China"]

xf = df.loc[ (df["Year"].isin(sel_year)) & (df["Country"].isin(sel_country)), :]
xf

These numbers are very large, so for our purpose it is easier to assume that there is only one person in each decile, in other words the total income is 10 times the mean income. This simplification works because, by definition, each decile has exactly the same number of people (10% of the population).

We will be using the very useful `cumsum` function (short for ‘cumulative sum’) to calculate the cumulative income. To see what this function does, look at this simple example.

In [None]:
test_series = pd.Series([2, 4, 10, 22])
test_series.cumsum()

You can see that each number in the sequence is the sum of all the preceding numbers (including itself), for example, we got the third number, 16, by adding 2, 4, and 10. We now apply this function to calculate the cumulative income shares for China (1980) and save them as `cum_inc_share_c80`.

In [None]:
query = (xf["Year"] == 1980) & (xf["Country"] == "China")
decs_c80 = xf.loc[query, [x for x in xf.columns if "Decile" in x]]
# Give the total income, assuming a population of 10
total_inc = 10*xf.loc[query, "Mean Income"]
cum_inc_share_c80 = decs_c80.cumsum() / total_inc.values[0]
cum_inc_share_c80

Now although this showed clearly exactly what we were doingfor China in 1980, what if we want to do it for all year-country combinations? We are able to that by defining a function:

In [None]:
def create_cumulative_income_shares(data, year, country):
    query = (data["Year"] == year) & (data["Country"] == country)
    decs = data.loc[query, [x for x in data.columns if "Decile" in x]]
    # Give the total income, assuming a population of 10
    total_inc = 10*data.loc[query, "Mean Income"]
    cum_inc_share = decs.cumsum(axis=1) / total_inc.values[0]
    cum_inc_share.index = [country + ", " + str(year)]
    cum_inc_share.columns = range(1, len(cum_inc_share.columns)+1)
    return cum_inc_share


Now we need to pass in all combinations of countries and years (this could be automated too, but it would only be worth it for many combinations so we'll just enter the different combinations manually):

In [None]:
cum_inc_share_c14 = create_cumulative_income_shares(xf, 2014, "China")
cum_inc_share_us80 = create_cumulative_income_shares(xf, 1980, "United States")
cum_inc_share_us14 = create_cumulative_income_shares(xf, 2014, "United States")
cum_inc_share_c80 = create_cumulative_income_shares(xf, 1980, "China")

## Python Walkthrough 5.3

**Drawing Lorenz curves**

Let us plot the cumulative income shares for China (1980), which we previously stored in the variable `cum_inc_share_c80`. We'll use the standard `fig, ax = plt.subplots` method of constructing an axis to plot the data on.

In [None]:
fig, ax = plt.subplots()
cum_inc_share_c80.T.plot(ax=ax)
ax.plot(cum_inc_share_c80.columns, [x/10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(1, 10)
ax.legend([])
ax.set_title("Lorenz curve, China, 1980")
ax.set_xlabel("Income Decile")
ax.set_ylabel("Cumulative income share")
plt.show();

**Figure 5.1** Lorenz curve, China, 1980.

The blue line is the Lorenz curve. The Gini coefficient is the ratio of the area between the two lines and the total area under the black line. We will calculate the Gini coefficient in Python walkthrough 5.4.

Now we add the other Lorenz curves to the chart using the lines function. We use the col= option to specify a different colour for each line, and the lty option to make the line pattern solid for 2014 data and dashed for 1980 data. Finally, we use the legend function to add a chart legend in the top left corner of the chart.

In [None]:
fig, ax = plt.subplots()
for line, style in zip([cum_inc_share_c80, cum_inc_share_us80, cum_inc_share_us14, cum_inc_share_c14], ["-", "-.", "dashed", ":"]):
    line.T.plot(ax=ax, linestyle=style)
ax.plot(cum_inc_share_c80.columns, [x/10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(1, 10)
ax.set_title("Lorenz curves, China and the US (1980 and 2014)")
ax.set_xlabel("Income Decile")
ax.set_ylabel("Cumulative income share")
plt.show();

**Figure 5.2** Lorenz curves, China and the US (1980 and 2014).

As the chart shows, the income distribution has changed more clearly for China (from the orange dotted line to the purple line) than for the US (from the green dashed line to the red dash-dotted line).

## Python Walkthroujgh 5.4

**Calculating Gini coefficients**

The Gini coefficient is graphically represented by dividing the area between the perfect equality line and the Lorenz curve by the total area under the perfect equality line (see [Section 5.9](https://www.core-econ.org/espp/book/text/05.html#59-measuring-economic-inequality) of *Economy, Society, and Public Policy* for further details). Let's first write a function that can compute Gini coefficients on input data. We'll call the function that calculates Gini coefficients from a vector of numbers `gini_coefficient`, and we apply it to the income deciles in our data (as seen in Python Walkthrough 5.3).

In [None]:
def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    x = np.double(x.values)
    x = x / x.sum()
    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad/np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g


Let's now demonstrate using this on the four cases we saw earlier. As before, we'll get a helping hand by defining a function that just returns the income deciles for a given year-country pair.

In [None]:
def grab_deciles_for_year_country_pair(data, year, country):
    query = (data["Year"] == year) & (data["Country"] == country)
    decs = data.loc[query, [x for x in data.columns if "Decile" in x]]
    return decs


gini_c14 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 2014, "China"))
gini_us80 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 1980, "United States"))
gini_us14 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 2014, "United States"))
gini_c80 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 1980, "China"))

Let's check one of the Gini coefficients:

In [None]:
print(f"The Gini coefficient for the US in 1980 is {gini_c80:.2f}")

Now we make the same line chart as in Python Walkthrough 5.3, but use the annotate function to label curves with their respective Gini coefficients.

In [None]:
fig, ax = plt.subplots()
for line, style in zip([cum_inc_share_c80, cum_inc_share_us80, cum_inc_share_us14, cum_inc_share_c14], ["-", "-.", "dashed", ":"]):
    line.T.plot(ax=ax, linestyle=style)
ax.plot(cum_inc_share_c80.columns, [x/10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(1, 10)
ax.set_title("Lorenz curves, China and the US (1980 and 2014)")
ax.set_xlabel("Income Decile")
ax.set_ylabel("Cumulative income share")
# Find four points along the lines to use for labels
no_points = len(ax.lines[0].get_ydata())
points_to_label = np.rint(np.linspace(0, no_points-1, num=4)).astype(int)
for line, name, point in zip(ax.lines, [gini_c80, gini_us80, gini_us14, gini_c14], points_to_label):
    y = line.get_ydata()[point]  # NB: to use start value, set [-1] to [0] instead
    x = line.get_xdata()[point]
    text = ax.annotate(
        f"{name:.2f}",
        xy=(x, y),
        xytext=(x+1.5, y+0.2/x),
        color=line.get_color(),
        textcoords="data",
        fontweight="bold",
        backgroundcolor='white',
        arrowprops=dict(arrowstyle="->", connectionstyle="angle3"),
    )
plt.show();

The Gini coefficients for both countries have increased, confirming what we already saw from the Lorenz curves that in both countries the income distribution has become more unequal.

## Extension Python Walkthrough 5.5

**Calculating Gini coefficients for all countries and all years**

In this extension walk-through, we show you how to calculate the Gini coefficient for all countries and years in your dataset.

This sounds like a tedious task, and indeed if we were to use the same method as before it would be mind-numbing. However, we have a powerful programming language at hand, and this is the time to use it.

Here we use a very useful programming tool you may not have come across yet: vectorised operations. These have some analogies with `for` loops, which iterate over the same code chunk while something changes.

As a reminder, this is what a for loop that prints the square of the numbers from 0 to 9 looks like:

In [None]:
for i in range(10):
    print(i**2)

In the above command, `range(10)` creates a vector of numbers from 0 to 9 (0, 2, 3, …, 9). The command `for i in range(10):` defines the variable i initially as 0, then iterates for everything in the given range. Here our command prints the value of $i^2$ for each value of $i$. Check that you understand the syntax above by modifying it to print only the first 5 square numbers only, or adding 2 to the numbers from 0 to 9 (instead of squaring these numbers).

We can achieve a similar feat using **pandas** series and *vectorised operations*:

In [None]:
number_series = pd.Series(range(10))
number_series.apply(lambda x: x**2)

`apply(lambda x: x**2)` tells Python to apply the operation $x^2$ to every element in the given series.

Note that, for this simple example, there is a shorter way to achieve the same effect, `number_series.pow(2)`, but for anything outside of a set of standard functions, you'll need to use `apply`.

Let's now move on to computing the Gini coefficient for all country-year pairs in the dataset.

In [None]:
df["gini"] = df.apply(lambda row: gini_coefficient(row[[x for x in df.columns if "Decile" in x]]), axis=1)
df.head()

Using this apply approach, we have 4,799 Gini coefficients in one line. We can even look at some summary statistics for the gini column across the year-country pairs:

In [None]:
df["gini"].describe().round(2)

The average Gini coefficient is 0.46, the maximum is 0.74, and the minimum 0.18. Let’s look at these extreme cases.

First we will look at the extremely equal income distributions (those with a Gini coefficient smaller than 0.20):

In [None]:
small_gini = df.loc[df["gini"] < 0.2, ["Country", "Year", "gini"]]
small_gini

These correspond to eastern European countries before the fall of communism.

Now the most unequal countries (those with a Gini coefficient larger than 0.73):

In [None]:
big_gini = df.loc[df["gini"] > 0.73, ["Country", "Year", "gini"]]
big_gini

## Extension Python Walkthrough 5.6

**Plotting time series of Gini coefficients**

In this extension walk-through, we show you how to make time series plots (time on the horizontal axis, the variable of interest on the vertical axis) with Gini coefficients for a list of countries of your choice.

There are many ways to plot data in Python, but the *imperative* plotting tool **matplotlib** is the most widely used (and extended). It is widely used in science and academia, most famously to help create the [first ever image of a black hole](https://numpy.org/case-studies/blackhole-image/). Although **matplotlib** is the core tool and can do almost any visualisation (if you know how), you may want to check out some other packages, with different strengths and weaknesses [here](https://aeturrell.github.io/coding-for-economists/vis-intro.html#libraries-for-data-visualisation).

First we use the subset function to select a small list of countries and save their data. As an example, we have chosen four anglophone countries: the UK, the US, Ireland, and Australia.

In [None]:
countries = ["United Kingdom", "United States", "Ireland", "Australia"]
plot_df = df.loc[df["Country"].isin(countries), ["Country", "Year", "gini"]]
plot_df.head()

Let's now plot these as a time series.

In [None]:
fig, ax = plt.subplots()
for country, style in zip(countries, ["-", "-.", "dashed", ":"]):
    plot_df_c = plot_df.loc[plot_df["Country"] == country]
    ax.plot(plot_df_c["Year"], plot_df_c["gini"], label=country, linestyle=style)
ax.set_xlim(1970, None)
ax.set_title("Gini coefficients for anglophone countries")
ax.set_xlabel("Year")
ax.set_ylabel("Gini")
for line, country in zip(ax.lines, countries):
    y = line.get_ydata()[0]  # NB: to use start value, set [-1] to [0] instead
    x = line.get_xdata()[0]
    text = ax.annotate(
        country,
        xy=(x, y),
        fontsize=8,
        xytext=(-5, 0),
        color=line.get_color(),
        textcoords="offset points",
        fontweight="bold",
        ha="right",
    )
plt.show();

We asked **matplotlib** to use the `plot_df` dataframe, with Year on the horizontal axis (`plot_df_c["Year"]`) and gini on the vertical axis (`plot_df_c["gini"]`). The `style=` option indicates which variable we use to make it clear the lines are different; **matplotlib** automatically cycles through colours unless we tell it not to. (Why don't you see what happens when you change the `xytext=` options.)

**matplotlib** is extremely powerful, and if you want to produce a variety of different charts, you may want to read more about that package and other packages for making different kinds of charts. You can find out more about **matplotlib** on the [official documentation](https://matplotlib.org/), and you can find a long [list of commonly used plots here](https://aeturrell.github.io/coding-for-economists/vis-common-plots.html).

## Python Walkthrough 5.7

**Importing `csv` files into Python**

Before importing, make sure the `.csv` file is saved in the `data` sub-folder of your current working directory. After importing (using the `pd.read_csv` function from **pandas**), use the `df.info()` function to check that the data was imported correctly.

In [None]:
df = pd.read_csv(Path("data/inequality-of-life-as-measured-by-mortality-gini-coefficient-1742-2002.csv"))
df.info()

The variable `"Entity"` is the country and the variable `"Gini coefficients for lifetime inequality (Peltzman (2009))"` is the health Gini. Let’s change these variable names (to `"country"` and `"health"`, respectively) to clarify what they actually refer to, which will help when writing code (and if we go back to read this code at a later date).

In [None]:
df = df.rename(columns={"Entity": "country", "Gini coefficients for lifetime inequality (Peltzman (2009))": "health"})
df.head()

There is another quirk in the data that you may not have noticed in this initial data inspection: All countries have a short code (`"Code"`), except for England and Wales (currently blank `nan` in the dataframe). Let's map those onto a new code, "ENW", using `.fillna`.

In [None]:
df["Code"] = df["Code"].fillna("ENW")

## Python Walkthrough 5.8

**Creating line graphs with *matplotlib***

As shown in Python Walkthrough 5.7, the data can be looped over in order to plot lines by country (over time). Most of the code below is similar to our use of **matplotlib** in previous walk-throughs. As there are many lines close together, we will use a legend instead of labelling lines on the chart itself. We'll also ensure that the line styles loop around enough times by multiplying them by 10.



In [None]:
countries = df["country"].unique()
fig, ax = plt.subplots()
for country, style in zip(countries, ["-", "-.", "dashed", ":"]*10):
    plot_df_c = df.loc[df["country"] == country]
    ax.plot(plot_df_c["Year"], plot_df_c["health"], label=country, linestyle=style)
ax.set_xlim(1750, 2020)
ax.set_ylim(0, 0.6)
ax.set_title("Mortality inequality in Gini coefficient")
ax.set_xlabel("Year")
ax.set_ylabel("Gini")
ax.legend(bbox_to_anchor=(1, 1.05))
plt.show();

**Figure 5.6** Mortality inequality Gini coefficients.

## Python Walkthrough 5.9

**Drawing a column chart with sorted values**

*Plot a column chart for 1952*

First we use `.loc` to provide convenient access to the data for 1952 only, and store it in a temporary dataset called `df_52`, and then we rearrange that by the health gini.

In [None]:
year = 1952
df_subset = df.loc[df["Year"]==1952]
df_subset = df_subset.sort_values(by="health")
df_subset

The rows are now ordered according to health, in ascending order. Let’s use **maplotlib** again for the chart.

In [None]:
fig, ax = plt.subplots()
ax.bar(df_subset["Code"], df_subset["health"], 0.35)
ax.set_xlabel("Country code")
ax.set_ylabel("Mortality inequality Gini coefficient")
ax.set_title(f"Mortality Gini ({year})")
plt.show()

**Figure 5.7** Mortality Gini coefficients (1952).

*Plot a column chart for 2002*

Now we'd like to do the same for 2002. Rather than re-specify everything, we can write a function that accepts a year, and our data, and does this for us for arbitrary years.

In [None]:
def plot_bar_chart_health_gini(data, year):
    df_subset = data.loc[data["Year"]==year]
    df_subset = df_subset.sort_values(by="health")
    fig, ax = plt.subplots()
    ax.bar(df_subset["Code"], df_subset["health"], 0.35)
    ax.set_xlabel("Country code")
    ax.set_ylabel("Mortality inequality Gini coefficient")
    ax.set_title(f"Mortality Gini ({year})")
    plt.show()

Now let's use it on 2002:

In [None]:
plot_bar_chart_health_gini(df, 2002)

**Figure 5.9** Mortality Gini coefficients (2002).

Let's now plot both years in a split bar chart design. To ensure we get both years in the same order, we'll use the 1952 order, declare that the country column is an ordered categorical variable, and then sort the values by that order.

In [None]:
countries_in_order = df.loc[df["Year"]==1952, :].sort_values(by="health")["country"]
df["country"] = df["country"].astype("category")
df["country"] = df["country"].cat.set_categories(countries_in_order)
df = df.sort_values(by="country")
df.head()

Now we can plot both years:

In [None]:
x = np.arange(len(df["Code"].unique()))  # the label locations
year1, year2 = 1952, 2002
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
ax.bar(x - width / 2, df.loc[df["Year"] == year1, "health"], width, label=str(year1))
ax.bar(x + width / 2, df.loc[df["Year"] == year2, "health"], width, label=str(year2))
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel("Yield")
ax.set_xticks(x)
ax.set_xticklabels(df["Code"].unique())
ax.legend(frameon=False)
ax.set_xlabel("Country code")
ax.set_ylabel("Mortality inequality Gini coefficient")
ax.set_title(f"Mortality Gini, {year1} vs {year2}")
ax.set_xlim(-width, len(x))
plt.show()

**Figure 5.10** Mortality Gini coefficients (1952 and 2002).

## Python Walkthrough 5.10

**Drawing a column chart with sorted values**

For this walkthrough, we downloaded the "Median availability of selected generic medicines" data, which you can find [here](https://apps.who.int/gho/data/view.main.660). We saved it as the default name "MDG_0000000010,WHS6_101.csv", in the data subdirectory of our working directory. Looking at the spreadsheet in Excel, Numbers, OpenOffice, or LibreOffice, you can see that the actual data starts in the third row, meaning that there are two header rows. So let’s skip the first row when opening it.

In [None]:
df = pd.read_csv(Path("data/MDG_0000000010,WHS6_101.csv"), skiprows=1)
df.head()

Having inspected the dataset in a spreadsheet programme and opened it with **pandas**, we know that the 2nd and 3rd columns don't have particularly informative column names. From the spreadsheet, you know that they should related to "Private access %" and "Public access %", respectively. So let's rename the columns to give them the right labels.

The columns of a dataframe, `df.columns`, are *immutable*, meaning we cannot change individual entries with an assignment statetment (using `=`), but we can either use the `.rename` method or replace all the column names. Here, it's more convenient to replace all the column names:

In [None]:
df.columns = ["country", "private_access", "public_access"]
df["country"] = df["country"].astype("category")
df.head(2)

To find details about these variables, click the column headers of the tables shown on the website. You can see, for example, that "Median availability of selected generic medicines (%)" has a method of measurement given by:

> A standard methodology has been  developed by WHO and Health Action International (HAI). Data on the availability of a specific list of medicines are collected in at least four geographic or administrative areas in a sample of medicine dispensing points. Availability is reported as the percentage of medicine outlets where a medicine was found on the day of the survey.

Before we produce charts of the data, let's look at some summary measures of the variables using the **skimpy** package. You may need to install this package to use it (you can do this by running `pip install skimpy` on your computer's command line).

In [None]:
from skimpy import skim

skim(df)

On average, private sector patients have better access to generic medication.

From the summary statistics for the "public_access" variable, you can see that there are two missing observations. Here, we will keep these observations because leaving them in doesn’t affect the following analysis.

There are a number of interesting aspects to look at. We shall produce a bar chart comparing the private and public access in countries, ordered according to values of private access (largest to smallest). First, we need to reformat the data into ‘long’ format (so there is a single variable containing all the values we want to plot), then use **matplotlib** to make the chart.

In [None]:
df = df.sort_values(by="private_access")
melt_df = pd.melt(df, id_vars="country", value_name="percent", var_name="access")
melt_df.head()

In [None]:
y = np.arange(len(melt_df["country"].unique()))  # the label locations
width = 0.3  # the width of the bars

fig, ax = plt.subplots(figsize=(6, 10))
ax.barh(y - width / 2, melt_df.loc[melt_df["access"] == "private_access", "percent"], width, label="Private")
ax.barh(y + width / 2, melt_df.loc[melt_df["access"] == "public_access", "percent"], width, label="Public")
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel("Country")
ax.set_yticks(y)
ax.set_yticklabels(melt_df["country"].unique())
ax.legend(frameon=False)
ax.set_xlabel("Percent of patients with access to generic medication")
ax.set_title(f"Access to generic medication")
ax.set_ylim(-width, len(y))
ax.set_xlim(0, 100)
plt.show()

**Figure 5.11** Access to essential medication.

Let’s find the extreme values, starting with the two countries where public sector patients have access to all (100%) essential medications (which you can also see in the chart).

In [None]:
df.loc[df["public_access"] == 100, :]

Let’s see which countries provide 0% access to essential medication for people in the public sector.

In [None]:
df.loc[df["public_access"] == 0, :]

## Python Walkthrough 5.11

**Using line and bar charts to illustrate changes in time**

*Import data and plot a line chart*

First we download [data on the gender gap in primary education](https://ourworldindata.org/educational-mobility-inequality#in-primary-education) from Our World In Data and save it in a subdirectory of our working directory called "data/". To find the data on the Our World In Data website, click on the download button under the chart. Now let's import it into our Python session and check the structure:

In [None]:
# Open the csv file from the data directory

df = pd.read_csv(Path("data/gender-gap-in-primary-education.csv"))
df.info()

The data is now in the dataframe `df`. The variable of interest, `"Primary education, pupils (% female)"`, has a very long name so we will shorten it to `"PFE"`.

In [None]:
df = df.rename(columns={"Primary education, pupils (% female)": "PFE"})

As usual, ensure that you understand the definition of the variables you are using. In the Our World in Data website, look at the ‘Sources’ tab underneath the graph for a definition:

> Female pupils as a percentage of total pupils at primary level include enrollments in public and private schools...percentage of female enrollment is calculated by dividing the total number of female students at a given level of education by the total enrollment at the same level, and multiplying by 100.

This definition implies that if the primary-school-age population was 50% male and 50% female and all children were enrolled in school, the female enrolment would be 50%.

Before choosing ten countries, we check which countries (`"Entity"`) are in the dataset using the unique function. Here we also use the random choice function, `np.random.choice`, from **numpy** to only show the first ten countries.

In [None]:
np.random.choice(df["Entity"].unique(), 10)

You can find nearly all the countries in the world in this list (plus some sub- and supra-country entities, like OECD countries, which explains why the variable wasn’t initially called ‘Country’).

*Plot a line chart for a selection of countries*

We now make a selection of ten countries. (You can of course make a different selection, but ensure that you get the spelling right!).

In [None]:
countries_to_select = ["Albania", "China", "France", "India", "South Korea", "Switzerland", "United Arab Emirates", "United Kingdom", "Zambia", "Norway"]

df_sub = df[df["Entity"].isin(countries_to_select)]

Now we plot the data, following similar steps to earlier in the chapter, in Python Walkthrough 5.8.

In [None]:
fig, ax = plt.subplots()
for country, style in zip(countries_to_select, ["-", "-.", "dashed", ":"]*10):
    plot_df_c = df_sub.loc[df_sub["Entity"] == country]
    ax.plot(plot_df_c["Year"], plot_df_c["PFE"], label=country, linestyle=style)
ax.set_xlim(df_sub["Year"].min(), df_sub["Year"].max())
ax.set_title("Female pupils\n(% enrolment in primary education)")
ax.set_xlabel("Year")
ax.set_ylabel("Percent")
ax.legend(bbox_to_anchor=(1, 1), frameon=False)
plt.show();

*Figure 5.12* Female pupils as a percentage of total enrolment in primary education.

*Plot a column chart with sorted values*

To calculate the change in the value of this measure between 1980 and 2010 for each country chosen, we have to manipulate the data so that we have one entry (row) for each entity (or country), but two different variables for the percentage of female enrolment `"PFE"` (one for each year).

We'll do this using the `pd.pivot` function to pivot years to columns; then we can subtract one year from another before filtering to just the columns we want.

In [None]:
df_sub_piv = pd.pivot(df_sub, ["Entity", "Code"], columns=["Year"], values='PFE')
# Note that existing column titles are integers
df_sub_piv["2010—1980"] = df_sub_piv[2010] - df_sub_piv[1980]
# Filter to our new column and re-number index
df_sub_piv = df_sub_piv["2010—1980"].reset_index()
# Sort rows by size of gap
df_sub_piv = df_sub_piv.sort_values(by="2010—1980")
df_sub_piv.head()

Now we can plot this as a bar chart by country.

In [None]:
fig, ax = plt.subplots()
ax.bar(df_sub_piv["Code"], df_sub_piv["2010—1980"], 1, alpha=0.8)
ax.set_xlabel("Country code")
ax.set_ylabel("Percentage points")
ax.set_title("Change in female pupils’ share of total enrolment in\nprimary education", size=12)
ax.annotate('Source: https://ourworldindata.org/educational-mobility-inequality#in-primary-education', (0, 0), (-20, -40), fontsize=6, 
             xycoords='axes fraction', textcoords='offset points', va='top', ha="left")
plt.show()

*Figure 5.13* Change in percentage of female enrolment in primary school from 1980 to 2010.

It is apparent that some countries saw very little or no change (the countries that already had very high PFE). The countries with initially low female participation have significantly improved.