# Empirical Project 12

## Getting Started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import seaborn as sns
import seaborn.objects as so
import pingouin as pg
import warnings


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make seaborn work consistently with this
so.Plot.config.theme.update(mpl.rcParams)
# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 12.1

**Importing a specified range of data from a spreadsheet**

We start by importing the data; you will need to download it and put it in a subfolder of your working directory called "data". The data provided in the Excel spreadsheet is not in the usual format of one variable per column (known as ‘long’ format). Instead the first tab contains two separate tables, and we need the second table. We can therefore use the `skiprows=` keyword argument of **pandas**' `pd.read_excel` function to specify the cells in the spreadsheet to import (note that variable headers for the years are included).

In [None]:
income = pd.read_excel(
    Path("data/doing-economics-project-12-datafile.xlsx"), skiprows=12
)

income

This still has some rows that we don't want in, so we need to get rid of them. There are multiple ways to do this depending on the context.

In this case, we have two columns we don't want that have a large number of invalid entries in, so we can make use of the row-wise `.dropna()` method to clean up the data. While we're at it, let's rename the first column to something more sensible.

In [None]:
income = income.dropna(axis=0).rename(columns={"Unnamed: 0": "Percentile"})
income

It would be good to have a quick plot of this data. We'll need to re-orient if we want to make use of the **seaborn** plotting package (as this expects long-format data). We'll use `pd.melt` to do this.

In [None]:
(
    so.Plot(
        pd.melt(
            income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year"
        ),
        x="Year",
        y="value",
        color="Percentile",
    )
    .add(so.Line(linewidth=3))
    .label(y="Monthly real household income (HKD)")
    .show()
)

**Figure 12.1 Monthly real household income over time.**

However, plotting all percentile groups using the same scale hides a lot of variation within each percentile group. So, we should create a separate chart for each percentile group. As an example, we will plot the chart for the 15th percentile.

First, we use the `pd.melt` function to reshape the data into the format that **seaborn** uses to plot charts. Then we use the `.loc` function to select data for the 15th percentile only. Finally, we can make the line chart as before.

In [None]:
income_long = pd.melt(
    income, id_vars="Percentile", value_vars=range(2009, 2016), var_name="Year"
)

perct_to_use = "15th"

(
    so.Plot(
        income_long.loc[income_long["Percentile"] == perct_to_use, :],
        x="Year",
        y="value",
    )
    .add(so.Line())
    .add(so.Dot())
    .label(y=f"Monthly real household income for {perct_to_use} percentile")
    .show()
)

**Figure 12.2 Monthly real household income for 15th percentile.**

## Python Walkthrough 12.2

**Calculating cumulative income shares and plotting a Lorenz curve**

Questions 2(a)–(c) can be completed in one stage by method chaining, although there are a number of steps.

First, we use the `.loc` method to get the relevant variables. Then, for the next step in the chain, we remove the ‘th’ suffix in the percentile column values using `.str.split("th", expand=True)[0]` to split all the strings, expand the resulting list into two columns and, with `[0]`, only takes the first part (before the "th"). This is cast to an integer type.

In the next step, we append on three zeros as the zeroth percentile entries for 2011 and 2012 (`.append`) followed by a sorting by percentile.

We then do a number of operations on columns using `.assign`:

- adding a column representing the number of households in each percentile group (assuming 100 households in the economy)
- a new variable called `handout_2012` that adds $6,000 to each value for the year 2012
- the economy-wide income for each percentile group, derived from multiplying the income values by the number of households and storing these in the variables `income_2011` and `income_2012`
- then we need the normalised cumulative income for each group

Finally we tidy up the data by ensuring that the 0th percentile has no share of the income and the 100th percentile has 100% of the cumulative income. We do this by adding a row of zero `.append` at the start `.sort_values`

In [None]:
cols_of_interest = ["Percentile", 2011, 2012]

income_shares = (
    income.loc[:, cols_of_interest]
    .assign(
        Percentile=lambda x: x["Percentile"]
        .str.split("th", expand=True)[0]
        .astype("int")
    )
    .append(pd.DataFrame([[0, 0, 0]], columns=cols_of_interest, index=[5]))
    .sort_values(by="Percentile")
    .assign(
        households=[15, 10, 25, 25, 10, 15],
        handout_2012=lambda x: x[2012] + 6000,
        income_2011=lambda x: x[2011] * x["households"],
        income_2012=lambda x: x["handout_2012"] * x["households"],
        rel_share_2011=lambda x: 100
        * x["income_2011"].cumsum()
        / x["income_2011"].sum(),
        rel_share_2012=lambda x: 100
        * x["income_2012"].cumsum()
        / x["income_2012"].sum(),
        Percentile=lambda x: x["households"] + x["Percentile"],
    )
    .loc[:, ["Percentile", "rel_share_2011", "rel_share_2012"]]
    .append(
        pd.DataFrame(
            [[0, 0, 0]],
            columns=["Percentile", "rel_share_2011", "rel_share_2012"],
            index=[6],
        )
    )
    .sort_values(by="Percentile")
)

income_shares.round(2)

Using the data from Questions 2(a)–(c) we can plot the Lorenz curve using the **seaborn** package. Note that we use the `Percentile` variable to draw the line of perfect equality.

We will need to have data in long format for this, so do a melt,

In [None]:
fig, ax = plt.subplots()

long_incomes_shares = pd.melt(
    income_shares,
    id_vars="Percentile",
    var_name="year",
    value_name="Cumulative share of income, %",
)

p1 = (
    so.Plot(
        long_incomes_shares,
        x="Percentile",
        y="Cumulative share of income, %",
        color="year",
    )
    .add(so.Line())
    .add(so.Dot())
    .label(x="Cumulative share of population")
)

ax.plot(
    long_incomes_shares["Percentile"],
    long_incomes_shares["Percentile"],
    linewidth=1.5,
    alpha=0.5,
    linestyle=":",
    color="grey",
)
p1.on(ax).show()

## Extension Python Walkthrough 12.4

**Converting nominal incomes to real incomes**

To obtain the real income values, we need to divide the income for each percentile group by the inflation index created in Question 5(a). Recall that we have only imported the real income data from the Excel spreadsheet and not the nominal income data, so we will first import the nominal income data (`nom_income`). Note that, in the code below, the inflation index is entered as a vector (list of numbers) called `inflation` (with the same number of elements as the number of years in the data) and this is multiplied (element-wise) by each row of the income data using the [todo] method.

In [None]:
# Import the nominal income data, drop everything beyond first table,
# transpose, and set Percentile as the index
nom_income = (
    pd.read_excel(
        Path("data/doing-economics-project-12-datafile.xlsx"),
        skiprows=2,
    )
    .loc[:4, :]
    .set_index("Percentile")
    .T
)

nom_income

Now we multiply through by the inflation data, taking the first element of the inflation data for the first row (excluding the year), the second for the second, and so on. Note that we actually need to divide by the inflation numbers, rather than multiply, which we do with a *list comprehension*. Finally, we transpose back to the original data shape.

In [None]:
inflation = [1, 1.024, 1.078, 1.122, 1.171, 1.222, 1.259, 1.289]
nom_income = nom_income.mul([1 / num for num in inflation], axis=0).T
nom_income.round(2)

## Python Walkthrough 12.5

**Importing data directly from a website, aka webscraping**

Originally, this exercise downloaded data directly from the HKU POP website. It is no longer possible using the original code because the HKPOP website has changed. However, to demonstrate the principles, we'll webscrape some data from a table on Wikipedia.

Webscraping is a way of grabbing information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort, and only then when permitted by the terms and conditions of a website.

If you're getting data from the internet, it's much better to use an API (application programming interface) whenever you can: grabbing information in a structured way is exactly why APIs exist. APIs should also be more stable than websites, which may change frequently. Typically, if an organisation is happy for you to grab their data, they will have made an API expressly for that purpose. It's pretty rare that there's a major website that does permit webscraping but which doesn't have an API; for these websites, if they don't have an API, chances are that scraping is against their terms and conditions. Those terms and conditions may be enforceable by law (there are different rules in different countries here, and you really need legal advice if it's not unambiguous as to whether you can scrape or not.)

In Python, you are spoiled for choice when it comes to webscraping. There are five very strong libraries that cover a real range of user styles and needs: **requests**, **lxml**, **beautifulsoup**, **selenium**, and **scrapy**.

For quick and simple webscraping, a combination of **requests** (which does little more than go and grab the HTML of a webpage) and **beautifulsoup**, which then helps you to navigate the structure of the HTML page, are good. However, in *this* example, we're going to see an even easier way, using **pandas**!

We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. Note that this method in **pandas** has a dependency on the **html5lib** package. The function we'll use is `read_html`, which returns a list of dataframes of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.

The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:

In [None]:
df_list = pd.read_html(
    "https://simple.wikipedia.org/wiki/FIFA_World_Cup", match="Fourth place"
)
# Retrieve first and only entry from list of dataframes
df = df_list[0]
df.head()

To continue on with the analysis in this section, head to the [PORI website](https://www.pori.hk), navigate to the POP Polls page, and then to the page on "People’s Satisfaction with the HKSAR Government". You should see some time series of polls. Switch to the half-yearly average tab and then scroll down. You can get the data from the "Download CSV" button.


## Python Walkthrough 12.6

**Cleaning imported data**

Save the data in a subfolder called "data" with the filename ("datatables.csv" by default).

In [None]:
overall = pd.read_csv("Data/datatables.csv")
overall.head()

If you look at the data, the names of each column from the imported data are long and contain a mixture of alphabet sets. We also have many columns with percentage signs in, and (if you check using `overall.info()`) we have many columns with the wrong data types: "object" datatypes where we should have numeric datatypes, for example. We're going to clean up all of this in the next step.

We usually rename variables using the `rename` method. However, in this case we'll just do a wholescale replacement of column names using the "in-place" operation `overall.columns = `.

To remove all of those percentage and comma signs, we'll use a dataframe-wide `.replace` function with a dictionary mapping "%" and "," to "" (an empty string).

And, for the data types, we'll map the dates to datetimes and all the other columns (from position 2 onwards) to float type.


In [None]:
new_col_names = [
    "start_date",
    "end_date",
    "cases",
    "subsample",
    "response_rate",
    "very_satisfied",
    "quite_satisfied",
    "satisfied",
    "half_half",
    "quite_dissatisfied",
    "very_dissatisfied",
    "dissatisfied",
    "dkhs",
    "total",
    "netvalue",
    "meanvalue",
    "base",
    "meanerror",
]

overall.columns = new_col_names

overall = overall.replace({"%": "", ",": ""}, regex=True)

types_dict = {
    "start_date": "datetime64[ns]",
    "end_date": "datetime64[ns]",
}
types_dict.update({k: "float" for k in overall.columns[2:]})
overall = overall.astype(types_dict)
overall.head()

## Python Walkthrough 12.7

**Cleaning data and setting dates**


For Question 3(a), before we can plot the imported data, any date variables need to be suitably formatted. We actually already did this in the previous step when we implicitly ran:

```python
overall["start_date"] = overall["start_date"].astype("datetime64[ns]")
```

In general, the command for converting data in a column to datetime format, especially when they are less nice than in this case, is

```python
overall["start_date"] = pd.to_datetime(overall["start_date"])
```

For more knarly cases, you should look at the documentation for `pd.to_datetime`. Note that the standard format for dates is YYYY-MM-DD.

Let's now plot this data

In [None]:
(
    so.Plot(overall.query("start_date > 2006"), x="start_date", y="netvalue")
    .add(so.Line(linewidth=2))
    .label(x="Date", y="Net satisfaction", title="Overall satisfication with HKSARG")
    .show()
)

**Figure 12.7 Net public satisfaction with the government’s performance over time.**

For Question 3(b), in this example we use the variable "Improving People's Livelihood". Repeating the import and cleaning processes from Python walkthrough 12.5. In this case, we'll save it as "satisfaction_datatables.csv" in a sub-directory called "data".

In [None]:
improvement = pd.read_csv("Data/satisfaction_datatables.csv")
improvement = (
    improvement.replace({"%": "", ",": ""}, regex=True)
    .dropna(axis=1)
    .rename(columns={k: v for k, v in zip(improvement.columns, new_col_names)})
    .astype(types_dict)
)

(
    so.Plot(improvement.query("start_date > 2006"), x="start_date", y="netvalue")
    .add(so.Line(linewidth=2))
    .label(
        x="Date",
        y="Net satisfaction",
        title="HKSARG's Performance in Improving Livelihood",
    )
    .show()
)

**Figure 12.8 Net public satisfaction with the government’s ability to improve people’s livelihood over time.**

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions