# Empirical Project 4

## Python-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to convert (reshape) data from wide to long format and vice versa.

## Getting started in Python

TODO (list packages needed here)

- Go to the United Nations’ [National Accounts Main Aggregates Database website](https://tinyco.re/7226184). On the right-hand side of the page, under ‘Data Availability’, click ‘Downloads’.
- Under the subheading ‘GDP and its breakdown at constant 2015 prices in US Dollars’, select the Excel file ‘All countries for all years – sorted alphabetically’.
- Save it in a subfolder of the directory you are coding in such that that the relative path is `data/Download-GDPconstant-USD-all.xlsx`.

## Python Walkthrough 4.1

**Importing the Excel file (`.xlsx` or `.xls`) into Python**

First, make sure you move the saved the data to a folder called `data` that is a subfolder of your working directory. The working directory is the folder that your code 'starts' in, and the one that you open when you start Visual Studio Code. Let's say you called it `core`, then the file and folder structure of your working directory would look like this:

```bash
📁 core
│──📁data
   └──Download-GDPconstant-USD-all.xlsx
│──empirical_project_4.py
```

This is similar to what you should see in Visual Studio Code under the explorer tab (although the working directory, `core`, won't appear). You can check your working directory by running

```python
import os
os.getcwd()
```

in Visual Studio Code.

Next, we need to import the packages and settings we'll be using:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

Before importing the file into Python, open the file in Excel, OpenOffice, LibreOffice, or Numbers to see how the data is organized in the spreadsheet, and note that:

- There is a heading that we don’t need, followed by a blank row.
- The data we need starts on row three.

Armed with this knowledge, we can import the data using the `Path` module to create the path to the data:

In [None]:
df = pd.read_excel(Path("data/Download-GDPconstant-USD-all.xlsx"), skiprows=2)
df.head()

## Python Walkthrough 4.2

**Making a frequency table**

We want to create a table showing how many years of `Final consumption expenditure` data are available for each country.

Looking at the dataset’s current format, you can see that countries and indicators (for example, `Afghanistan` and `Final consumption expenditure`) are row variables, while year is the column variable. This data is organized in ‘wide’ format (each individual’s information is in a single row).

For many data operations and making charts it is more convenient to have indicators as column variables, so we would like `Final consumption expenditure` to be a column variable, and year to be the row variable. Each observation would represent the value of an indicator for a particular country and year. This data is organized in ‘long’ format (each individual’s information is in multiple rows). This is also called 'tidy' data and it can be recognised by having variable per column and one observation per row. Many data scientists consider keeping data in tidy format good practice.

To change data from wide to long format, we use the `pd.melt` method. The `melt` method is very powerful and useful, as you will find many large datasets are in wide format. In this case, `pd.melt` takes the data from all columns not specified as being `id_vars` (via a list of column names), and uses them to create two new columns: one contains the name of the row variable created from the former column names, which is the year here; we can set that new column's name with `var_name="year"`. The second new column contains the values that were in the columns we unpivoted and is automatically given the name `value`. (We could have set a new name for this column by passing `value_name=` too.)

Compare `df_long` to the wider `df` to understand how the melt command works. To learn more about organizing data in Python, see the [Working with Data](https://aeturrell.github.io/coding-for-economists/data-intro.html) section of 'Coding for Economists'. 

In [None]:
df_long = pd.melt(df, id_vars=["Area/CountryID", "Area/Country", "IndicatorName"], var_name="year")
df_long.head()

To create the required table, we only need `Final consumption expenditure` of each country, which we extract using the `.loc` function. We'd like all columns so we pass the condition in the first position of `.loc` and leave the second entry as `:` for all columns.

In [None]:
cons = df_long.loc[df_long["IndicatorName"] == "Final consumption expenditure", :]

Now let's create our table. 

In [None]:
year_count = (
    cons
    .groupby("Area/Country")
    .agg(available_years = ("year", "count"))
    )
year_count

Translating the code in words: Take the variable `cons` and group the observations by area and country (`.groupby(Area/Country")`), then take this result and aggregate `.agg` it such that a new variable called available years (`available_years=`) is created that sees the column year counted (`("year", "count")`).

Now we can establish how many of the 250 countries and areas in the dataset have complete information. A dataset is complete if it has the maximum number of available observations (given by `year_count["available_years"].max()`).

In [None]:
sum(year_count["available_years"] == year_count["available_years"].max())

In this case, the full set of data are available for all countries and areas.

## Python Walkthrough 4.3

**Creating new variables**

We will use Brazil, the US, and China as examples.

Before we select these three countries, we will calculate the net exports (exports minus imports) for all countries, as we need that information in Python walkthrough 4.4. We will also shorten the names of the variables we need, to make the code easier to read. We will use a dictionary to map names into shorter formats. A dictionary is a built-in object type in Python and always has the structure `{key1: value1, key2: value2, ...}` where the keys and values could have any type (eg string, int, dataframe). In our case, both keys and values will be strings. We will use a convention for our naming that is known as "snake case". This means all lower case with spaces replaced by underscores (it looks a bit like a snake!). There are packages that will auto-rename long variables for you, but let's see how to do it manually here.

In [None]:
short_names_dict = {
    "Final consumption expenditure": "final_expenditure",
    "Household consumption expenditure (including Non-profit institutions serving households)": "hh_expenditure",
    "General government final consumption expenditure": "gov_expenditure",
    "Gross capital formation": "capital",
    "Imports of goods and services": "imports",
    "Exports of goods and services": "exports",
    }
# rename these values
df_long["IndicatorName"] = df_long["IndicatorName"].replace(short_names_dict)


`df_long` still has several rows for a particular country and year (one for each indicator). We will reshape this data using the `.pivot` method to ensure that we have only one row per country/area and per year. Note that `pivot` preserves the list of columns we pass as the `index=` and pivots the columns we pass to `columns=` out so that they are wide.

In [None]:
df_table = df_long.pivot(index=["Area/CountryID", "Area/Country", "year"], columns=["IndicatorName"])
df_table.head()

Now we create a `net_exports` column based on the existing columns (exports - imports), and we can know that this will be a unique country/area and year combination for each row. First we need to drop the top level of the column index, which is currently called `value`: we don't need this anymore. This will allow for direct access to the `exports` and `imports` columns. We'll also reset the index to row numbers rather than those three columns we used in the pivot. We'll also remove the name of the columns as we won't need that any longer.

In [None]:
df_table.columns = df_table.columns.droplevel()
df_table = df_table.reset_index()
df_table.columns.name = ""
df_table["net_exports"] = df_table["exports"] - df_table["imports"]

Let us select our three chosen countries to check that we calculated net exports correctly.

In [None]:
sel_countries = ["Brazil", "United States", "China"]
cols_to_keep = ["Area/Country", "year", "exports", "imports", "net_exports"]

df_sel_un = df_table.loc[df_table["Area/Country"].isin(sel_countries), cols_to_keep]
df_sel_un.head()