# Empirical Project 12

## Getting Started in Python

Head to the "Getting Started in Python" page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a *notebook* by downloading the relevant file from this [repository](https://github.com/aeturrell/core_python) and running it on your own computer. Alternatively, you can run pages online in your browser over at [Binder](https://mybinder.org/v2/gh/aeturrell/core_python/HEAD).

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html(no_js=True)

## Python Walkthrough 12.1

**Importing a specified range of data from a spreadsheet**

We start by importing the data; you will need to download it and put it in a subfolder of your working directory called "data". The data provided in the Excel spreadsheet is not in the usual format of one variable per column (known as ‘long’ format). Instead the first tab contains two separate tables, and we need the second table. We can therefore use the `skiprows=` keyword argument of **pandas**' `pd.read_excel` function to specify the cells in the spreadsheet to import (note that variable headers for the years are included).

In [None]:
income = pd.read_excel(
    Path("data/doing-economics-project-12-datafile.xlsx"), skiprows=12
)

income

This still has some rows that we don't want in, so we need to get rid of them. There are multiple ways to do this depending on the context.

In this case, the two rows we don't want have a large number of invalid entries in, so we can make use of the row-wise `.dropna()` method to clean up the data. While we're at it, let's rename the first column to something more sensible.

In [None]:
income = income.dropna(axis=0).rename(columns={"Unnamed: 0": "Percentile"})
income

It would be good to have a quick plot of this data. We'll need to re-orient if we want to make use of the **lets-plot** plotting package (as this expects long-format data). We'll use `pd.melt` to do this.

In [None]:
income_melted = pd.melt(
    income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year"
)

In [None]:
(
    ggplot(income_melted, aes(x=as_discrete("Year"), y="value", color="Percentile"))
    + geom_line(size=2)
    + labs(y="Monthly real household income (HKD)")
    + scale_x_continuous(format="d")
)

**Figure 12.1 Monthly real household income over time.**

However, plotting all percentile groups using the same scale hides a lot of variation within each percentile group. So, we should create a separate chart for each percentile group. As an example, we will plot the chart for the 15th percentile.

First, we use the `pd.melt` function to reshape the data into the format that **lets-plot** uses to plot charts. Then we use the `.loc` function to select data for the 15th percentile only. Finally, we can make the line chart as before.

In [None]:
perct_to_use = "15th"

(
    ggplot(
        income_melted.loc[income_melted["Percentile"] == perct_to_use, :],
        aes(x=as_discrete("Year"), y="value"),
    )
    + geom_line(size=1)
    + geom_point(size=3)
    + labs(
        title=f"Monthly real household income for {perct_to_use} percentile",
        y="Household income",
    )
    + scale_x_continuous(format="d")
)

**Figure 12.2 Monthly real household income for 15th percentile.**

## Python Walkthrough 12.2

**Calculating cumulative income shares and plotting a Lorenz curve**

Questions 2(a)–(c) can be completed in a few steps. To do them, though, we're going to make use of a bunch of techniques.

First, we use the `.loc` method to get the relevant variables. Remember that the syntax is `.loc[rows, columns]`, so we can create a list of just the columns we want to work with and use `:` to get all rows. Then we're going to use the `assign` function. This allows us to change dataframes *in-line*, ie without having to disrupt the flow and within a *chained set of methods*. As it's an important point, let's take a quick look at this idea in more detail.

Say we had a dataframe, `df`, with a column `number` that just had the numbers from 1 to 10. Imagine we wanted to create a second column called `number_add_one`, which has the numbers 2 to 11. There are two obvious ways to do this using `df`. The first is to explicitly create the new column:

```python
df["number_add_one"] = df["number"] + 1
```

The second is to use `assign`:

```python
df = (
    df.assign(
        number_add_one=lambda x: x["number"] + 1
    )
)
```

Neither of these ways is inherently better than the other, they're just different. However, if you're doing lots of steps in a row, it can make the code more readable to have multiple `assign`ment statements. As a reminder, `lambda` functions are functions that you don't have to give a name to. They work by defining a variable, by convention `x`, but you could call it anything. In the case of a dataframe assign statement, writing `lambda x:` means that `x` stands in for the dataframe itself, so that writing `lambda x: x["number"]` is the "number" column.

Back to the task at hand, we're first going to remove the ‘th’ suffix in the percentile column values using `.str.split("th", expand=True)[0]` to split all the strings, expand the resulting list into two columns and, with `[0]`, only take the first part (before the "th"). This is cast to an integer type.

In [None]:
cols_of_interest = ["Percentile", 2011, 2012]

percentiles = income.loc[:, cols_of_interest].assign(
    Percentile=lambda x: x["Percentile"].str.split("th", expand=True)[0].astype("int")
)
percentiles

In the next step, we're going to concatenate a dataframe with three zeros to form the zeroth percentile entries for 2011 and 2012: we need this to ensure our data go through 0.

In [None]:
percentiles = pd.concat(
    [
        percentiles,
        pd.DataFrame([[0, 0, 0]], columns=cols_of_interest, index=[5], dtype="float"),
    ],
    axis=0,
).sort_values(by="Percentile")
percentiles

Next up, we're going to do a number of operations on columns using `.assign`, including:

- adding a column representing the number of households in each percentile group (assuming 100 households in the economy)
- creating a new variable called `handout_2012` that adds $6,000 to each value for the year 2012
- adding the economy-wide income for each percentile group, derived from multiplying the income values by the number of households and storing these in the variables `income_2011` and `income_2012`
- creating the normalised cumulative income for each group

In [None]:
percentiles = percentiles.assign(
    households=[15, 10, 25, 25, 10, 15],
    handout_2012=lambda x: x[2012] + 6000,
    income_2011=lambda x: x[2011] * x["households"],
    income_2012=lambda x: x["handout_2012"] * x["households"],
    rel_share_2011=lambda x: 100 * x["income_2011"].cumsum() / x["income_2011"].sum(),
    rel_share_2012=lambda x: 100 * x["income_2012"].cumsum() / x["income_2012"].sum(),
    Percentile=lambda x: x["households"] + x["Percentile"],
).loc[:, ["Percentile", "rel_share_2011", "rel_share_2012"]]
percentiles

Finally we tidy up the data by ensuring that the 0th percentile has no share of the income and the 100th percentile has 100% of the cumulative income. We do this by adding a row of zeroes via concatenating another empty dataframe to `percentiles`. Finally, we sort by the values again.

In [None]:
income_shares = pd.concat(
    [
        percentiles,
        pd.DataFrame(
            [[0, 0, 0]],
            columns=["Percentile", "rel_share_2011", "rel_share_2012"],
            index=[6],
        ),
    ],
    axis=0,
).sort_values(by="Percentile")

income_shares.round(2)

Using the data from Questions 2(a)–(c) we can plot the Lorenz curve using the **lets-plot** package. Note that we use the `Percentile` variable to draw the line of perfect equality.

We will need to have data in long format for this, so we begin by doing a melt:

In [None]:
long_incomes_shares = pd.melt(
    income_shares,
    id_vars="Percentile",
    var_name="year",
    value_name="Cumulative share of income, %",
)

In [None]:
(
    ggplot(
        long_incomes_shares,
        aes(x="Percentile", y="Cumulative share of income, %", color="year"),
    )
    + geom_abline(slope=1, color="black", linetype=2, alpha=0.7)
    + geom_line(size=1)
    + geom_point(size=3)
    + labs(x="Cumulative share of population")
)

## Python Walkthrough 12.3

**Generating Gini coefficients**

### Create table containing percentiles

To create the 2011, 2012, and 2013 percentiles for every percentile in our 100 household economy we need to take the income for each percentile group and expand that for every household in the respective percentile group. For example, there are 15 households in the bottom percentile group having zero income for 2011 and 2013, and $6,000 in 2012. For the 15th percentile group there are 15 households that will share the same income value, and so on for the other percentile groups.

To achieve this expansion, there's a few steps. First, we're going to grab the years we're interested in and add in a row of zeros (for the zero percentiles). Note that the values in the dataframe are incomes.

In [None]:
years_of_interest = [2011, 2012, 2013]
raw_percentiles = income.loc[:, years_of_interest].sort_values(2011)
raw_percentiles = pd.concat(
    [
        pd.DataFrame(
            [[0, 0, 0]],
            columns=years_of_interest,
            index=[6],
        ),
        raw_percentiles,
    ],
).reset_index(drop=True)
raw_percentiles

Next, we create a list of the numbers of households in each group (`households = [15, 10, 25, 25, 10, 15]`) that will be matched with the six different percentiles we have available in the `raw_percentiles` dataframe. Then we do a few tricks with `for` loops:

- an inner `for` loop goes over each number in the households list above (the variable is called `num_hhds`) and takes the corresponding income value (`num_income`) and repeats `num_income` `num_hhds` times. This is achieved using `np.repeat` from the **numpy** package. Each one of these is concatenated until we have a dataframe with an income value for all 100 percentiles for a specific year
- an outer `for` loop that goes over the years

All of the results from both dataframes are combined together. We do a couple of other useful things too:
- `hh_percentiles_by_year.index % 100 + 1` creates the percentiles from 1 to 100. The `%` symbol represents the modulo operator which gives the remainder after division.
- for 2012, we add in the 6000 USD extra payment

Note that **pandas** uses `0` as its default column name, which is why it appears below.

In [None]:
households = [15, 10, 25, 25, 10, 15]

hh_percentiles = pd.DataFrame()

for year in raw_percentiles.columns:
    hh_percentiles_by_year = pd.DataFrame()
    for i, num_hhds in enumerate(households):
        num_income = raw_percentiles.loc[i, year]
        hh_percentiles_by_year = pd.concat(
            [hh_percentiles_by_year, pd.DataFrame(np.repeat(num_income, num_hhds))],
            axis=0,
        )
        hh_percentiles_by_year["year"] = year

    hh_percentiles_by_year = hh_percentiles_by_year.reset_index(drop=True)
    hh_percentiles_by_year["percentile"] = hh_percentiles_by_year.index % 100 + 1
    if year == 2012:
        hh_percentiles_by_year[0] = hh_percentiles_by_year[0] + 6000

    hh_percentiles = pd.concat([hh_percentiles, hh_percentiles_by_year], axis=0)

hh_percentiles.tail()

With all this, we can now `pivot` this to have columns as years and percentiles as the index:

In [None]:
hh_percentiles = hh_percentiles.pivot(columns="year", index=["percentile"], values=0)
hh_percentiles

Now we need only to compute the Gini coefficients for the different years. First we define a function that computes a Gini given a set of values.

In [None]:
def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    x = np.double(x.values)
    x = x / x.sum()
    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad / np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g

Second, we `apply` it to each column separately (using `axis=0`) to get our Gini values!

In [None]:
hh_percentiles.apply(lambda x: gini_coefficient(x), axis=0)

## Extension Python Walkthrough 12.4

**Converting nominal incomes to real incomes**

To obtain the real income values, we need to divide the income for each percentile group by the inflation index created in Question 5(a). Recall that we have only imported the real income data from the Excel spreadsheet and not the nominal income data, so we will first import the nominal income data (`nom_income`). Note that, in the code below, the inflation index is entered as a vector (list of numbers) called `inflation` (with the same number of elements as the number of years in the data) and this is multiplied (element-wise) by each row of the income data using the [todo] method.

In [None]:
# Import the nominal income data, drop everything beyond first table,
# transpose, and set Percentile as the index
nom_income = (
    pd.read_excel(
        Path("data/doing-economics-project-12-datafile.xlsx"),
        skiprows=2,
    )
    .loc[:4, :]
    .set_index("Percentile")
    .T
)

nom_income

Now we multiply through by the inflation data, taking the first element of the inflation data for the first row (excluding the year), the second for the second, and so on. Note that we actually need to divide by the inflation numbers, rather than multiply, which we do with a *list comprehension*. Finally, we transpose back to the original data shape.

`nom_income` is a dataframe with the nominal income observations (years in rows and percentles in columns). We wish to divide all 2009 data by the first price index number (`1.0`), all 2010 observations by the 2nd (`1.024`), etc. If your dataframe consists of numerical data only, then this can be achieved using the `.div` method. As we want to divide each of the eight rows by a different number we need to feed in 8 different numbers. This is done by feeding in the inflation list and specifying that we want to apply these one for each row `(axis = 0)`.

In [None]:
inflation = [1, 1.024, 1.078, 1.122, 1.171, 1.222, 1.259, 1.289]
nom_income = nom_income.div(inflation, axis=0).T
nom_income.round(2)

## Python Walkthrough 12.5

**Importing data directly from a website, aka webscraping**

Originally, this exercise downloaded data directly from the HKU POP website. It is no longer possible using the original code because the HKPOP website has changed. However, to demonstrate the principles, we'll webscrape some data from a table on Wikipedia.

Webscraping is a way of grabbing information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort, and only then when permitted by the terms and conditions of a website.

If you're getting data from the internet, it's much better to use an API (application programming interface) whenever you can: grabbing information in a structured way is exactly why APIs exist. APIs should also be more stable than websites, which may change frequently. Typically, if an organisation is happy for you to grab their data, they will have made an API expressly for that purpose. It's pretty rare that there's a major website that does permit webscraping but which doesn't have an API; for these websites, if they don't have an API, chances are that scraping is against their terms and conditions. Those terms and conditions may be enforceable by law (there are different rules in different countries here, and you really need legal advice if it's not unambiguous as to whether you can scrape or not.)

In Python, you are spoiled for choice when it comes to webscraping. There are five very strong libraries that cover a real range of user styles and needs: **requests**, **lxml**, **beautifulsoup**, **selenium**, and **scrapy**.

For quick and simple webscraping, a combination of **requests** (which does little more than go and grab the HTML of a webpage) and **beautifulsoup**, which then helps you to navigate the structure of the HTML page, are good. However, in *this* example, we're going to see an even easier way, using **pandas**!

We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. Note that this method in **pandas** has a dependency on the **html5lib** package. The function we'll use is `read_html`, which returns a list of dataframes of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.

The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:

In [None]:
df_list = pd.read_html(
    "https://simple.wikipedia.org/wiki/FIFA_World_Cup", match="Fourth place"
)
# Retrieve first and only entry from list of dataframes
df = df_list[0]
df.head()

To continue on with the analysis in this section, head to the [PORI website](https://www.pori.hk), navigate to the POP Polls page, and then to the page on "People’s Satisfaction with the HKSAR Government". You should see some time series of polls. Switch to the half-yearly average tab and then scroll down. You can get the data from the "Download CSV" button.


## Python Walkthrough 12.6

**Cleaning imported data**

Save the data in a subfolder called "data" with the filename ("datatables.csv" by default).

In [None]:
overall = pd.read_csv("Data/datatables.csv")
overall.head()

The code above forced the columns `start_date` and `end_date` to be in **Pandas**' `datetime64[ns]` format. The function will automatically convert the string into a date. But you may see a warning message so, when working with dates, you should always double check that the translation did work. (If so you can ignore the warning message.)

If you look at the data, the names of each column from the imported data are long and contain a mixture of alphabet sets. We also have many columns with percentage signs in, and (if you check using `overall.info()`) we have many columns with the wrong data types: "object" datatypes where we should have numeric datatypes, for example. We're going to clean up all of this in the next step.

We usually rename variables using the `rename` method. However, in this case we'll just do a wholescale replacement of column names using the "in-place" operation `overall.columns = `.

To remove all of those percentage and comma signs, we'll use a dataframe-wide `.replace` function with a dictionary mapping "%" and "," to "" (an empty string).

And, for the data types, we'll map the dates to datetimes and all the other columns (from position 2 onwards) to float type.


In [None]:
new_col_names = [
    "start_date",
    "end_date",
    "cases",
    "subsample",
    "response_rate",
    "very_satisfied",
    "quite_satisfied",
    "satisfied",
    "half_half",
    "quite_dissatisfied",
    "very_dissatisfied",
    "dissatisfied",
    "dkhs",
    "total",
    "netvalue",
    "meanvalue",
    "base",
    "meanerror",
]

overall.columns = new_col_names

overall = overall.replace({"%": "", ",": ""}, regex=True)

types_dict = {
    "start_date": "datetime64[ns]",
    "end_date": "datetime64[ns]",
}
types_dict.update({k: "float" for k in overall.columns[2:]})
overall = overall.astype(types_dict)
overall.head()

## Python Walkthrough 12.7

**Cleaning data and setting dates**


For Question 3(a), before we can plot the imported data, any date variables need to be suitably formatted. We actually already did this in the previous step when we implicitly ran:

```python
overall["start_date"] = overall["start_date"].astype("datetime64[ns]")
```

In general, the command for converting data in a column to datetime format, especially when they are less nice than in this case, is

```python
overall["start_date"] = pd.to_datetime(overall["start_date"])
```

For more knarly cases, you should look at the documentation for `pd.to_datetime`. Note that the standard format for dates is YYYY-MM-DD.

Let's now plot this data

In [None]:
(
    ggplot(overall.query("start_date > 2006"), aes(x="start_date", y="netvalue"))
    + geom_line(size=2)
    + labs(x="Date", y="Net satisfaction", title="Overall satisfication with HKSARG")
    + scale_x_datetime()
)

**Figure 12.7 Net public satisfaction with the government’s performance over time.**

For Question 3(b), in this example we use the variable "Improving People's Livelihood". Repeating the import and cleaning processes from Python walkthrough 12.5. In this case, we'll save it as "satisfaction_datatables.csv" in a sub-directory called "data".

In [None]:
improvement = pd.read_csv("Data/satisfaction_datatables.csv")
improvement = (
    improvement.replace({"%": "", ",": ""}, regex=True)
    .dropna(axis=1)
    .rename(columns={k: v for k, v in zip(improvement.columns, new_col_names)})
    .astype(types_dict)
)

In [None]:
(
    ggplot(improvement.query("start_date > 2006"), aes(x="start_date", y="netvalue"))
    + geom_line(size=2)
    + labs(
        x="Date",
        y="Net satisfaction",
        title="HKSARG's Performance in Improving Livelihood",
    )
    + scale_x_datetime()
)

**Figure 12.8 Net public satisfaction with the government’s ability to improve people’s livelihood over time.**

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions