# Empirical Project 5

## Python-specific learning objectives

In addition to the learning objectives for this project, in Part 5.1 you will learn how to use loops to repeat specified tasks for a list of values.

## Getting started in Python

TODO (list packages needed here)


- Go to the Globalinc website and download the Excel file containing the data by clicking ‘xlsx’.
- Save it in a subfolder of the directory you are coding in such that that the relative path is `data/GCIPrawdata.xlsx`.
- Import the data into Python as explained in Python Walkthrough 5.1.

## Python Walkthrough 5.1

**Importing the Excel file (`.xlsx` or `.xls`) into Python**

First, we need to import the packages and settings we'll be using:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotnine import *
import seaborn as sns
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150
theme_set(theme_seaborn)

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

As we are importing an Excel file, we use the `pd.read_excel` function from the **pandas** package. The file is called "GCIPrawdata.xlsx". Before you import the file into Python, open the datafile in Excel to understand its structure. You will see that the data is all in one worksheet (which is convenient), and that the headings for the variables are in the third row. Hence we will use the `skiprows=2` option in the `pd.read_excel` function to skip the first two rows.

Now let's import the data using the `Path` module to create the path to the data, and look at the first few rows with `head()`:

In [None]:
df = pd.read_excel(Path("data/GCIPrawdata.xlsx"), skiprows=2)
df.head()

The data is now in a pandas dataframe, which is the primary object for data analysis in Python. You can always tell the type of object you are dealing with in Python by running `type` on it:

In [None]:
type(df)

In the data, each row represents a different country-year combination. The first row is for Afghanistan in 1980, and the first value (in the third column) is 206, for the variable Decile 1 Income. This value indicates that the mean annual income of the poorest 10% in Afghanistan was the equivalent of 206 USD (in 1980, adjusted using purchasing power parity). Looking at the next column, you can see that the mean income of the next richest 10% (those in the 11th to 20th percentiles for income) was 350.

To see the list of variables, we use the `df.info()` method.

In [None]:
df.info()

In addition to the country, year, and the ten income deciles, we have mean income and the population.

## Python Walkthrough 5.2

**Calculating cumulative shares using the `cumsum` function**

Before we calculate cumulative income shares, we need to calculate the total income for each country-year combination using the mean income and the population size.

In [None]:
df["total_income"] = df["Mean Income"]*df["Population"]

Here we have chosen China (a country that recently underwent enormous economic changes) and the US (a developed country). We use the `.loc` function to create a new dataset (called `xf`) containing only the countries and years we need.

In [None]:
# Create lists for the years and countries we'd like
sel_year = [1980, 2014]
sel_country = ["United States", "China"]

xf = df.loc[ (df["Year"].isin(sel_year)) & (df["Country"].isin(sel_country)), :]
xf

These numbers are very large, so for our purpose it is easier to assume that there is only one person in each decile, in other words the total income is 10 times the mean income. This simplification works because, by definition, each decile has exactly the same number of people (10% of the population).

We will be using the very useful `cumsum` function (short for ‘cumulative sum’) to calculate the cumulative income. To see what this function does, look at this simple example.

In [None]:
test_series = pd.Series([2, 4, 10, 22])
test_series.cumsum()

You can see that each number in the sequence is the sum of all the preceding numbers (including itself), for example, we got the third number, 16, by adding 2, 4, and 10. We now apply this function to calculate the cumulative income shares for China (1980) and save them as `cum_inc_share_c80`.

In [None]:
query = (xf["Year"] == 1980) & (xf["Country"] == "China")
decs_c80 = xf.loc[query, [x for x in xf.columns if "Decile" in x]]
# Give the total income, assuming a population of 10
total_inc = 10*xf.loc[query, "Mean Income"]
cum_inc_share_c80 = decs_c80.cumsum() / total_inc.values[0]
cum_inc_share_c80

Now although this showed clearly exactly what we were doingfor China in 1980, what if we want to do it for all year-country combinations? We are able to that by defining a function:

In [None]:
def create_cumulative_income_shares(data, year, country):
    query = (data["Year"] == year) & (data["Country"] == country)
    decs = data.loc[query, [x for x in data.columns if "Decile" in x]]
    # Give the total income, assuming a population of 10
    total_inc = 10*data.loc[query, "Mean Income"]
    cum_inc_share = decs.cumsum(axis=1) / total_inc.values[0]
    cum_inc_share.index = [country + ", " + str(year)]
    cum_inc_share.columns = range(1, len(cum_inc_share.columns)+1)
    return cum_inc_share


Now we need to pass in all combinations of countries and years (this could be automated too, but it would only be worth it for many combinations so we'll just enter the different combinations manually):

In [None]:
cum_inc_share_c14 = create_cumulative_income_shares(xf, 2014, "China")
cum_inc_share_us80 = create_cumulative_income_shares(xf, 1980, "United States")
cum_inc_share_us14 = create_cumulative_income_shares(xf, 2014, "United States")
cum_inc_share_c80 = create_cumulative_income_shares(xf, 1980, "China")

## Python Walkthrough 5.3

**Drawing Lorenz curves**

Let us plot the cumulative income shares for China (1980), which we previously stored in the variable `cum_inc_share_c80`. We'll use the standard `fig, ax = plt.subplots` method of constructing an axis to plot the data on.

In [None]:
fig, ax = plt.subplots()
cum_inc_share_c80.T.plot(ax=ax)
ax.plot(cum_inc_share_c80.columns, [x/10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(1, 10)
ax.legend([])
ax.set_title("Lorenz curve, China, 1980")
ax.set_xlabel("Income Decile")
ax.set_ylabel("Cumulative income share")
plt.show();

**Figure 5.1** Lorenz curve, China, 1980.

The blue line is the Lorenz curve. The Gini coefficient is the ratio of the area between the two lines and the total area under the black line. We will calculate the Gini coefficient in Python walkthrough 5.4.

Now we add the other Lorenz curves to the chart using the lines function. We use the col= option to specify a different colour for each line, and the lty option to make the line pattern solid for 2014 data and dashed for 1980 data. Finally, we use the legend function to add a chart legend in the top left corner of the chart.

In [None]:
fig, ax = plt.subplots()
for line, style in zip([cum_inc_share_c80, cum_inc_share_us80, cum_inc_share_us14, cum_inc_share_c14], ["-", "-.", "dashed", ":"]):
    line.T.plot(ax=ax, linestyle=style)
ax.plot(cum_inc_share_c80.columns, [x/10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(1, 10)
ax.set_title("Lorenz curves, China and the US (1980 and 2014)")
ax.set_xlabel("Income Decile")
ax.set_ylabel("Cumulative income share")
plt.show();

**Figure 5.2** Lorenz curves, China and the US (1980 and 2014).

As the chart shows, the income distribution has changed more clearly for China (from the blue to the green line) than for the US (from the orange to the red line).