# Empirical Project 12

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code, Google Colab, or Jupyter Notebook, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- **pandas**
- **pingouin**
- **matplotlib**
- **seaborn**
- **numpy**

You'll also be using the **warnings** and **pathlib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings
import matplotlib_inline.backend_inline

# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make output charts in 'svg' format
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 12.1

**Importing a specified range of data from a spreadsheet**

We start by importing the data; you will need to download it and put it in a subfolder of your working directory called "data". The data provided in the Excel spreadsheet is not in the usual format of one variable per column (known as ‘long’ format). Instead the first tab contains two separate tables, and we need the second table. We can therefore use the `skiprows=` keyword argument of **pandas**' `pd.read_excel` function to specify the cells in the spreadsheet to import (note that variable headers for the years are included).

In [None]:
income = pd.read_excel(
    Path("data/doing-economics-project-12-datafile.xlsx"), skiprows=12
)

income

This still has some rows that we don't want in, so we need to get rid of them. There are multiple ways to do this depending on the context.

In this case, we have two columns we don't want that have a large number of invalid entries in, so we can make use of the row-wise `.dropna()` method to clean up the data. While we're at it, let's rename the first column to something more sensible.

In [None]:
income = income.dropna(axis=0).rename(columns={"Unnamed: 0": "Percentile"})
income

It would be good to have a quick plot of this data. We'll need to re-orient if we want to make use of the **seaborn** plotting package (as this expects long-format data).

In [None]:
pd.melt(income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year")

In [None]:
import seaborn.objects as so

(
    so.Plot(
        pd.melt(
            income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year"
        ),
        x="Year",
        y="value",
        color="Percentile",
    )
    .add(so.Line(linewidth=3))
    .label(y="Monthly real household income (HKD)")
)

**Figure 12.1 Monthly real household income over time.**

However, plotting all percentile groups using the same scale hides a lot of variation within each percentile group. So, we should create a separate chart for each percentile group. As an example, we will plot the chart for the 15th percentile.

First, we use the `pd.melt` function to reshape the data into the format that **seaborn** uses to plot charts. Then we use the `.loc` function to select data for the 15th percentile only. Finally, we can make the line chart as before.

In [None]:
income_long = pd.melt(
    income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year"
)

perct_to_use = "15th"

(
    so.Plot(
        income_long.loc[income_long["Percentile"] == perct_to_use, :],
        x="Year",
        y="value",
    )
    .add(so.Line())
    .add(so.Dot())
    .label(y=f"Monthly real household income for {perct_to_use} percentile")
)

**Figure 12.2 Monthly real household income for 15th percentile.**

## Python Walkthrough 12.2

**Calculating cumulative income shares and plotting a Lorenz curve**

Questions 2(a)–(c) can be completed in one stage by method chaining, although there are a number of steps.

First, we use the `.loc` method to get the relevant variables. Then, for the next step in the chain, we remove the ‘th’ suffix in the percentile column values using `.str.split("th", expand=True)[0]` to split all the strings, expand the resulting list into two columns and, with `[0]`, only takes the first part (before the "th"). This is cast to an integer type.

In the next step, we append on three zeros as the zeroth percentile entries for 2011 and 2012 (`.append`) followed by a sorting by percentile.

We then do a number of operations on columns using `.assign`:

- adding a column representing the number of households in each percentile group (assuming 100 households in the economy)
- a new variable called `handout_2012` that adds $6,000 to each value for the year 2012
- the economy-wide income for each percentile group, derived from multiplying the income values by the number of households and storing these in the variables `income_2011` and `income_2012`
- then we need the normalised cumulative income for each group

Finally we tidy up the data by ensuring that the 0th percentile has no share of the income and the 100th percentile has 100% of the cumulative income. We do this by adding a row of zero `.append` at the start `.sort_values`

In [None]:
cols_of_interest = ["Percentile", 2011, 2012]

income_shares = (
    income.loc[:, cols_of_interest]
    .assign(
        Percentile=lambda x: x["Percentile"]
        .str.split("th", expand=True)[0]
        .astype("int")
    )
    .append(pd.DataFrame([[0, 0, 0]], columns=cols_of_interest, index=[5]))
    .sort_values(by="Percentile")
    .assign(
        households=[15, 10, 25, 25, 10, 15],
        handout_2012=lambda x: x[2012] + 6000,
        income_2011=lambda x: x[2011] * x["households"],
        income_2012=lambda x: x["handout_2012"] * x["households"],
        rel_share_2011=lambda x: 100
        * x["income_2011"].cumsum()
        / x["income_2011"].sum(),
        rel_share_2012=lambda x: 100
        * x["income_2012"].cumsum()
        / x["income_2012"].sum(),
        Percentile=lambda x: x["households"] + x["Percentile"],
    )
    .loc[:, ["Percentile", "rel_share_2011", "rel_share_2012"]]
    .append(
        pd.DataFrame(
            [[0, 0, 0]],
            columns=["Percentile", "rel_share_2011", "rel_share_2012"],
            index=[6],
        )
    )
    .sort_values(by="Percentile")
)

income_shares.round(2)

Using the data from Questions 2(a)–(c) we can plot the Lorenz curve using the **seaborn** package. Note that we use the `Percentile` variable to draw the line of perfect equality.

We will need to have data in long format for this, so do a melt,

In [None]:
fig, ax = plt.subplots()

long_incomes_shares = pd.melt(
    income_shares,
    id_vars="Percentile",
    var_name="year",
    value_name="Cumulative share of income, %",
)

p1 = (
    so.Plot(
        long_incomes_shares,
        x="Percentile",
        y="Cumulative share of income, %",
        color="year",
    )
    .add(so.Line())
    .add(so.Dot())
    .label(x="Cumulative share of population")
)

ax.plot(
    long_incomes_shares["Percentile"],
    long_incomes_shares["Percentile"],
    linewidth=1.5,
    alpha=0.5,
    linestyle=":",
    color="grey",
)
p1.on(ax).show()