# Empirical Project 4


## Getting Started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import seaborn as sns
import seaborn.objects as so
import pingouin as pg
import warnings


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make seaborn work consistently with this
so.Plot.config.theme.update(mpl.rcParams)
# Ignore warnings to make nice output
warnings.simplefilter("ignore")


- Go to the United Nations’ [National Accounts Main Aggregates Database website](https://tinyco.re/7226184). On the right-hand side of the page, under ‘Data Availability’, click ‘Downloads’.
- Under the subheading ‘GDP and its breakdown at constant 2015 prices in US Dollars’, select the Excel file ‘All countries for all years – sorted alphabetically’.
- Save it in a subfolder of the directory you are coding in such that that the relative path is `data/Download-GDPconstant-USD-all.xlsx`.

## Python Walkthrough 4.1

**Importing the Excel file (`.xlsx` or `.xls`) into Python**

First, make sure you move the saved the data to a folder called `data` that is a subfolder of your working directory. The working directory is the folder that your code 'starts' in, and the one that you open when you start Visual Studio Code. Let's say you called it `core`, then the file and folder structure of your working directory would look like this:

```bash
📁 core
│──📁data
   └──Download-GDPconstant-USD-all.xlsx
│──empirical_project_4.py
```

This is similar to what you should see in Visual Studio Code under the explorer tab (although the working directory, `core`, won't appear). You can check your working directory by running

```python
import os
os.getcwd()
```

in Visual Studio Code.

Before importing the file into Python, open the file in Excel, OpenOffice, LibreOffice, or Numbers to see how the data is organized in the spreadsheet, and note that:

- There is a heading that we don’t need, followed by a blank row.
- The data we need starts on row three.

Armed with this knowledge, we can import the data using the `Path` module to create the path to the data:

In [None]:
df = pd.read_excel(Path("data/Download-GDPconstant-USD-all.xlsx"), skiprows=2)
df.head()

## Python Walkthrough 4.2

**Making a frequency table**

We want to create a table showing how many years of `Final consumption expenditure` data are available for each country.

Looking at the dataset’s current format, you can see that countries and indicators (for example, `Afghanistan` and `Final consumption expenditure`) are row variables, while year is the column variable. This data is organized in ‘wide’ format (each individual’s information is in a single row).

For many data operations and making charts it is more convenient to have indicators as column variables, so we would like `Final consumption expenditure` to be a column variable, and year to be the row variable. Each observation would represent the value of an indicator for a particular country and year. This data is organized in ‘long’ format (each individual’s information is in multiple rows). This is also called 'tidy' data and it can be recognised by having variable per column and one observation per row. Many data scientists consider keeping data in tidy format good practice.

To change data from wide to long format, we use the `pd.melt` method. The `melt` method is very powerful and useful, as you will find many large datasets are in wide format. In this case, `pd.melt` takes the data from all columns not specified as being `id_vars` (via a list of column names), and uses them to create two new columns: one contains the name of the row variable created from the former column names, which is the year here; we can set that new column's name with `var_name="year"`. The second new column contains the values that were in the columns we unpivoted and is automatically given the name `value`. (We could have set a new name for this column by passing `value_name=` too.)

Compare `df_long` to the wider `df` to understand how the melt command works. To learn more about organizing data in Python, see the [Working with Data](https://aeturrell.github.io/coding-for-economists/data-intro.html) section of 'Coding for Economists'. 

In [None]:
df_long = pd.melt(
    df, id_vars=["Area/CountryID", "Area/Country", "IndicatorName"], var_name="year"
)
df_long.head()

To create the required table, we only need `Final consumption expenditure` of each country, which we extract using the `.loc` function. We'd like all columns so we pass the condition in the first position of `.loc` and leave the second entry as `:` for all columns.

In [None]:
cons = df_long.loc[df_long["IndicatorName"] == "Final consumption expenditure", :]

Now let's create our table. 

In [None]:
year_count = cons.groupby("Area/Country").agg(available_years=("year", "count"))
year_count

Translating the code in words: Take the variable `cons` and group the observations by area and country (`.groupby(Area/Country")`), then take this result and aggregate `.agg` it such that a new variable called available years (`available_years=`) is created that sees the column year counted (`("year", "count")`).

Now we can establish how many of the 250 countries and areas in the dataset have complete information. A dataset is complete if it has the maximum number of available observations (given by `year_count["available_years"].max()`).

In [None]:
sum(year_count["available_years"] == year_count["available_years"].max())

In this case, the full set of data are available for all countries and areas.

## Python Walkthrough 4.3

**Creating new variables**

We will use Brazil, the US, and China as examples.

Before we select these three countries, we will calculate the net exports (exports minus imports) for all countries, as we need that information in Python walkthrough 4.4. We will also shorten the names of the variables we need, to make the code easier to read. We will use a dictionary to map names into shorter formats. A dictionary is a built-in object type in Python and always has the structure `{key1: value1, key2: value2, ...}` where the keys and values could have any type (eg string, int, dataframe). In our case, both keys and values will be strings. We will use a convention for our naming that is known as "snake case". This means all lower case with spaces replaced by underscores (it looks a bit like a snake!). There are packages that will auto-rename long variables for you, but let's see how to do it manually here.

In [None]:
short_names_dict = {
    "Final consumption expenditure": "final_expenditure",
    "Household consumption expenditure (including Non-profit institutions serving households)": "hh_expenditure",
    "General government final consumption expenditure": "gov_expenditure",
    "Gross capital formation": "capital",
    "Imports of goods and services": "imports",
    "Exports of goods and services": "exports",
}
# rename these values
df_long["IndicatorName"] = df_long["IndicatorName"].replace(short_names_dict)

`df_long` still has several rows for a particular country and year (one for each indicator). We will reshape this data using the `.pivot` method to ensure that we have only one row per country/area and per year. Note that `pivot` preserves the list of columns we pass as the `index=` and pivots the columns we pass to `columns=` out so that they are wide.

In [None]:
df_table = df_long.pivot(
    index=["Area/CountryID", "Area/Country", "year"], columns=["IndicatorName"]
)
df_table.head()

Now we create a `net_exports` column based on the existing columns (exports - imports), and we can know that this will be a unique country/area and year combination for each row. First we need to drop the top level of the column index, which is currently called `value`: we don't need this anymore. This will allow for direct access to the `exports` and `imports` columns. We'll also reset the index to row numbers rather than those three columns we used in the pivot. We'll also remove the name of the columns as we won't need that any longer.

In [None]:
df_table.columns = df_table.columns.droplevel()
df_table = df_table.reset_index()
df_table.columns.name = ""
df_table["net_exports"] = df_table["exports"] - df_table["imports"]

Let us select our three chosen countries to check that we calculated net exports correctly.

In [None]:
sel_countries = ["Brazil", "United States", "China"]
cols_to_keep = ["Area/Country", "year", "exports", "imports", "net_exports"]

df_sel_un = df_table.loc[df_table["Area/Country"].isin(sel_countries), cols_to_keep]
df_sel_un.head()

## Python Walkthrough 4.4

**Plotting and annotating time series data**

*Extract the relevant data*

We will work with the `df_long` dataset, as the long format is well suited to produce charts with the **seaborn** package. In this example, we use the US and China (which we will now save as the dataframe `comp`).

In [None]:
# take a copy of df_long that is just US and China
comp = df_long.loc[df_long["Area/Country"].isin(["United States", "China"]), :].copy()
# Convert value to billion USD
comp["value"] = comp["value"] / 1.0e9
# Filter down to certain cols and values
comp = comp.loc[
    comp["IndicatorName"].isin(
        ["gov_expenditure", "hh_expenditure", "capital", "imports", "exports"]
    ),
    ["Area/Country", "year", "IndicatorName", "value"],
]
comp.head()

*Plot a line chart*

We can now plot this data using the **seaborn** data visualisation library. We'll subset to US' data.

In [None]:
(
    so.Plot(
        comp.loc[comp["Area/Country"] == "United States", :],
        x="year",
        y="value",
        color="IndicatorName",
        linestyle="IndicatorName",
    )
    .add(so.Line())
    .show()
)

*Figure 4.2 The US’s GDP components (expenditure approach).*

There are plenty of problems with this chart:

- the vertical axis label is uninformative
- there is no chart title
- the y-axis dips below zero
- the legend is uninformative.


To improve this chart, we add features to the figure by creating the axis, `ax`, explicitly. We'll also use a trick where we invert the dictionary from earlier and use this to supply full names to the legend via a new "indicator" column.

In [None]:
# reverse the dictionary from earlier
rev_name_dict = {v: k.split("(")[0] for k, v in short_names_dict.items()}
# create a new col with the original names
comp["Indicator"] = comp["IndicatorName"].replace(rev_name_dict)
# plot data
fig, ax = plt.subplots()
ax.annotate("Great Recession", (2008, 0.05e4), size=8, ha="center")
(
    so.Plot(
        comp.loc[comp["Area/Country"] == "United States", :],
        x="year",
        y="value",
        color="Indicator",
        linestyle="Indicator",
    )
    .add(so.Line())
    .label(y="Billions USD", title="GDP components over time", color="Component")
    .scale(y=so.Continuous().label(like="{x:,g}"))
    .limit(y=(0, None))
    .on(ax)
    .show()
)

***Figure 4.3** The US’s GDP components (expenditure approach), amended chart.*

We can make a chart for more than one country simultaneously by switching to a **seaborn** `relplot` with otherwise the same settings.

In [None]:
(
    so.Plot(
        comp,
        x="year",
        y="value",
        color="Indicator",
        linestyle="Indicator",
    )
    .facet("Area/Country")
    .add(so.Line())
    .label(
        y="Billions USD",
        # title="GDP components over time",
        color="GDP Component",
    )
    .scale(y=so.Continuous().label(like="{x:,g}"))
    .limit(y=(0, None))
    .show()
)

***Figure 4.4** GDP components over time (China and the US).*

## Python Walkthrough 4.5

**Calculating new variables and plotting time series data**

*Calculate proportion of total GDP*

We will use the `comp` dataset created in Python Walkthrough 4.4. First we will calculate net exports, as that contributes to GDP. As the data is currently in long format, we will reshape the data into wide format so that the variables we need are in separate columns instead of separate rows (using the `pivot` method, as in Python Walkthrough 4.3), calculate net exports, then transform the data back into long format using the `melt` method.

On the way, we'll end up dropping the Indicator Names, and dropping the top level "value"

In [None]:
comp_wide = comp.drop("Indicator", axis=1).pivot(
    index=["Area/Country", "year"], columns="IndicatorName"
)
comp_wide.columns = comp_wide.columns.droplevel()
comp_wide = comp_wide.reset_index()
comp_wide.head()

Add the new column for net exports = exports – imports

In [None]:
comp_wide["net_exports"] = comp_wide["exports"] - comp_wide["imports"]
comp_wide.head()

Return to long format with the household expenditure, capital, and net export variables

In [None]:
comp2_wide = comp_wide.loc[
    :, [x for x in comp_wide.columns if x not in ["exports", "imports"]]
]
comp2 = pd.melt(
    comp2_wide,
    id_vars=["year", "Area/Country"],
    var_name="indicator",
    value_name="2015_bn_usd",
)
comp2.head()

Now we will create a new dataframe (`props`) also containing the proportions for each GDP component (`proportion`), using method chaining to link functions together.

In [None]:
props = comp2.assign(
    proportion=comp2.groupby(["Area/Country", "year"])["2015_bn_usd"].transform(
        lambda x: x / x.sum()
    )
)

In words, we did the following: Take the `comp2` dataframe and add in a new column called `proportion` (this bit starts with `.assign(proportion=`) that, within area and year groups (`.groupby(["Area/Country", "year"])`) takes the value (`["2015_bn_usd"]`) and divides it by the total value for that group (`.transform(lambda x: x/x.sum())`). For example, the first row gives the proportion of capital for China in 1970.

The result is then saved in props. Look at the props dataframe to confirm that the above command has achieved the desired result. (You can check the answer with `props.groupby(["Area/Country", "year"])["proportion"].sum()`.)

*Plot a line chart*

Now we redo the line chart from Python Walkthrough 4.4 using the variable `props`.

In [None]:
# Update dictionary for net exports, which is new
rev_name_dict.update({"net_exports": "Net exports"})
props["Component"] = props["indicator"].map(rev_name_dict)
(
    so.Plot(props, x="year", y="proportion", color="Component", linestyle="Component")
    .facet("Area/Country")
    .add(so.Line())
    .limit(y=(0, 1))
)

***Figure 4.5** GDP component proportions over time (China and the US).*

## Python Walkthrough 4.6

**Creating stacked bar charts**

*Calculate proportion of total GDP*

This walk-through uses the following countries (chosen at random):

- developed countries: Germany, Japan, United States
- transition countries: Albania, Russian Federation, Ukraine
- developing countries: Brazil, China, India.

The relevant data are still in the `df` dataframe. Before we select these countries, we first calculate the required proportions for all countries for capital, final expenditure, and net exports (out of those columns).

In [None]:
columns_to_track = ["capital", "final_expenditure", "net_exports"]
countries_to_use = [
    "Germany",
    "Japan",
    "United States",
    "Albania",
    "Russian Federation",
    "Ukraine",
    "Brazil",
    "China",
    "India",
]

# Find the proportions for these columns and create new columns called "prop_" + original col name
for col in columns_to_track:
    df_table["prop_" + col] = df_table[col].divide(
        df_table[columns_to_track].sum(axis=1)
    )

# filter this down to 2015 for the countries and cols we want
cols_to_keep = ["prop_" + col for col in columns_to_track] + ["Area/Country", "year"]
df_2015 = df_table.loc[
    (df_table["Area/Country"].isin(countries_to_use)) & (df_table["year"] == 2015),
    cols_to_keep,
]
df_2015.head()

*Plot a stacked bar chart*

Now let’s create the bar chart. First we need to melt the data into a format where each row is an observation, each column a variable.

In [None]:
df_2015_melt = pd.melt(
    df_2015,
    id_vars=["Area/Country", "year"],
    value_name="Percent",
    var_name=["Component"],
)
df_2015_melt["Percent"] = df_2015_melt["Percent"] * 100

In [None]:
(
    so.Plot(df_2015_melt, y="Area/Country", x="Percent", color="Component")
    .add(so.Bar(), so.Stack())
    .label(
        color="Components of GDP",
    )
)

***Figure 4.6** GDP component proportions in 2015.*

Note that even when a country has a trade deficit (proportion of net exports < 0), the proportions will add up to 1, but the proportions of final expenditure and capital will add up to more than 1.

We have not yet ordered the countries so that they form the pre-specified groups. To achieve this, we need to explicitly impose an ordering on the Area/Country variable by converting this column to be of type category and setting the order of those categories.

In [None]:
df_2015_melt["Area/Country"] = pd.Categorical(
    df_2015_melt["Area/Country"], categories=countries_to_use
)

(
    so.Plot(df_2015_melt, y="Area/Country", x="Percent", color="Component")
    .add(so.Bar(), so.Stack())
    .label(
        color="Components of GDP",
    )
)

**FIND OUT MORE**

*The natural log: What it means, and how to calculate it in Python*

The natural log turns a linear variable into a concave variable, as shown in Figure 4.9. For any value of income on the horizontal axis, the natural log of that value on the vertical axis is smaller. At first, the difference between income and log income is not that big (for example, an income of 2 corresponds to a natural log of 0.7), but the difference becomes bigger as we move rightwards along the horizontal axis (for example, when income is 100,000, the natural log is only 11.5).

![](https://www.core-econ.org/doing-economics/book/images/web/figure-04-05.jpg)

***Figure 4.9** Comparing income with the natural logarithm of income.*

The reason why natural logs are useful in economics is because they can represent variables that have diminishing marginal returns: an additional unit of input results in a smaller increase in the total output than did the previous unit. (If you have studied production functions, then the shape of the natural log function might look familiar.)

When applied to the concept of wellbeing, the ‘input’ is income, and the ‘output’ is material wellbeing. It makes intuitive sense that a $100 increase in per capita income will have a much greater effect on wellbeing for a poor country compared to a rich country. Using the natural log of income incorporates this notion into the index we create. Conversely, the notion of diminishing marginal returns (the larger the value of the input, the smaller the contribution of an additional unit of input) is not captured by GDP per capita, which uses actual income and not its natural log. Doing so makes the assumption that a $100 increase in per capita income has the same effect on wellbeing for rich and poor countries.

The **numpy** log function in Python calculates the natural log of a value for you. To calculate the natural log of a value, `x`, type `np.log(x)`. If you have a scientific calculator, you can check that the calculation is correct by using the ln or log key.

Now that you know about the natural log, you might want to go back to Question 3(c) in Part 4.1, and create a new chart using the natural log scale. The natural log is used in economics because it approximates percentage changes i.e. log(x) – log(y) is a close approximation to the percentage change between x and y. So, using the natural log scale, you will be able to ‘read off’ the relative growth rates from the slopes of the different series you have plotted. For example, a 0.01 change in the vertical axis value corresponds to a 1% change in that variable. This will allow you to compare the growth rates of the different components of GDP.

## Python Walkthrough 4.7

**Calculating the HDI**

We will use `pd.read_excel` to import the data file, which we saved using its default name of '2020_Statistical_Annex_Table_1.xlsx’ in the data folder within our working directory. Before importing, look at the Excel file so that you understand its structure and how it corresponds to the code options used below. It's a long way from being a neat and tidy dataset! We will save the imported data as the `df_hdr` dataframe. Having taken a look at the Excel file, we can see we should skip the first few rows and take row 1 as the header (which becomes our column names).



In [None]:
df_hdr = pd.read_excel(
    Path("data/2020_Statistical_Annex_Table_1.xlsx"), skiprows=3, header=1
)
df_hdr.head()

Looking at the `df_hdr` dataframe, there are rows that have information that isn’t data (for example, all the rows with an ‘NaN’ in), as well as variables/columns that do not contain data, or are a mix of 'NaN' and data.

Cleaning up the dataframe can be easier to do in Excel by deleting irrelevant rows and columns, but one advantage of doing it in Python is replicability. Suppose in a year’s time you carried out the analysis again with an updated spreadsheet containing new information. If you had done the cleaning in Excel, you would have to redo it from scratch, but if you had done it in Python, you can simply rerun the code below. (This works for new data that are in the same format too.)

Let's do some data cleaning.

First, we rename the last column by picking up the year entry below it in order to distinguish between different years of HDI rank. Next, we replace any columns that are "Unnamed" with entries from the first row of observations with a list comprehension. Then we eliminate rows that do not have any numbers in the `"HDI rank"` column.

In [None]:
df_hdr = df_hdr.rename(columns={"HDI rank": "HDI_rank_" + str(df_hdr.iloc[1, -1])})
df_hdr.columns = [
    df_hdr.columns[i] if "Unnamed" not in df_hdr.columns[i] else df_hdr.iloc[0, i]
    for i in range(len(df_hdr.columns))
]
df_hdr = df_hdr.loc[~pd.isna(df_hdr["HDI rank"]), :]
df_hdr.head()

Now we can eliminate rows in HDI rank that do not have numbers in, and, following that, eliminate columns that contain NaNs.

In [None]:
df_hdr = df_hdr.loc[~pd.isna(df_hdr["HDI_rank_2018"]), :]
df_hdr = df_hdr.dropna(axis=1, how="any")
df_hdr.head()

Now let's switch to shorter column names and check what datatypes we have in our data:

In [None]:
new_column_names = [
    "hdi_rank",
    "country",
    "hdi",
    "life_exp",
    "exp_yrs_school",
    "mean_yrs_school",
    "gni_capita",
    "gni_hdi_rank",
    "hdi_rank_2018",
]
df_hdr.columns = new_column_names
df_hdr.info()

Looking at the structure of the data, we see that **pandas** thinks that all the data are objects because the original datafile contained non-numerical entries (these rows have now been deleted). Apart from the `"country"` variable, which we want to be a categorical variable, all variables should be doubles or ints. Let's sort that out using a trick where we 'zip' two variables (the column names and datatypes) together into a dictionary that maps the column name into the datatype we'd it to have.

In [None]:
new_column_datatypes = [
    "int",
    "category",
    "double",
    "double",
    "double",
    "double",
    "double",
    "int",
    "int",
]
df_hdr = df_hdr.astype({k: v for k, v in zip(new_column_names, new_column_datatypes)})
df_hdr.info()

Now we have a nice clean dataset that we can work with.

We start by calculating the three indices, using the information given. For the education index we calculate the index for expected and mean schooling separately, then take the arithmetic mean to get `i_education`. As some mean schooling observations exceed the specified ‘maximum’ value of 18, the calculated index values would be larger than 1. To avoid this, we first replace these observations with 18 to obtain an index value of 1.

In [None]:
df_hdr.loc[df_hdr["exp_yrs_school"] > 18, "exp_yrs_school"] = 18

# Now create the indices
df_hdr["i_health"] = (df_hdr["life_exp"] - 20) / (85 - 20)
df_hdr["i_education"] = (
    ((df_hdr["exp_yrs_school"] - 0) / (18 - 0))
    + (df_hdr["mean_yrs_school"] - 0) / (15 - 0)
) / 2
df_hdr["i_income"] = (np.log(df_hdr["gni_capita"]) - np.log(100)) / (
    np.log(75000) - np.log(100)
)
df_hdr["hdi_calc"] = np.power(
    df_hdr["i_health"] * df_hdr["i_education"] * df_hdr["i_income"], 1 / 3
)

Now we can compare the `HDI` given in the table and our calculated HDI.

In [None]:
df_hdr[["hdi", "hdi_calc"]]

The HDI is one way to measure wellbeing, but you may think that it does not use the most appropriate measures for the non-material aspects of wellbeing (health and education).

https://databank.worldbank.org/

Now we will use the same method to create our own index of non-material wellbeing (an ‘alternative HDI’), using different indicators. You can find alternative indicators to measure health and education on the [World Bank data bank](https://databank.worldbank.org).

https://databank.worldbank.org/Human-development-index/id/363d401b#

1. Create an alternative index of wellbeing. In particular, propose alternative Education and Health indices in (a) and (b), then combine these with the existing Income index in (c) to calculate an alternative HDI. Examine whether the changes caused substantial changes in country rankings in (d).

a) Choose two to three indicators to measure health, and two to three indicators to measure education. Explain why you have chosen these indicators.

b) Carefully merge the data into your existing data. Choose a reasonable maximum and minimum value for each indicator and justify your choices.

c) Calculate your alternative versions of the education and health dimension indices. Since you have chosen more than one indicator for this dimension, make sure to average the dimension indices as done in Question 3
(b). Also ensure that higher indicator values always represent better outcomes. Now calculate the alternative HDI as done in Questions 3 and 4. Remember to combine your alternative education and health indices with the existing income index from Question 2.

d) Create a new variable showing each country’s rank according to your alternative HDI, where 1 is assigned to the country with the highest value. Compare your ranking to the HDI rank. Are the rankings generally similar, or very different? (See R walk-through 4.8 on how to do this.)

## Python Walkthrough 4.8

**Creating your own HDI**

*Merge data and calculate alternative indices*

This example uses the following indicators:

- Education: Literacy rate, adult (% ages 15 and older); Gross enrolment ratio, tertiary (% of tertiary school-age population); Primary school teachers trained to teach (%)

- Health: Child malnutrition, stunting (moderate or severe) (% under age 5); Mortality rate, female adult (per 1,000 people); Mortality rate, male adult (per 1,000 people).

First, we import the data and check that it has been imported correctly. You can see that each row represents a different country, and each column represents a different year-indicator combination.

In [None]:
all_hdr = pd.read_csv(Path("data/HDR21-22_Composite_indices_complete_time_series.csv"))
all_hdr.head()

Then we follow the same process as in Python Walkthrough 4.7—getting the data for the indicators we want, reshaping it so that each indicator is in a different column (instead of a different row), and giving each indicator a shorter name. We will save this data as `hdr_wide`. Note that the variable 9999 refers to the latest year available, or the average taken over a range of years (the Excel file contains information on which year(s) were used).

We can use the convenience **pandas** function `wide_to_long`.