# Empirical Project 8

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code, Google Colab, or Jupyter Notebook, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- **pandas** for data analysis
- **matplotlib** for data visualisation
- **numpy** for numerical methods

You'll also be using the **warnings** and **pathlib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 8.1

**Importing data into Python**

As we are importing an Excel file, we use the `pd.read_excel` function provided by the **pandas** package. The file is called Project-8-datafile.xlsx and is saved into a subfolder of our working directory called 'data'. The file contains four worksheets that contain the data, named ‘Wave 1’ through to ‘Wave 4’. We will load the worksheets one-by-one and add them to the previous worksheets using the `pd.concat` function, which combines dataframes together. 

The final output is called `lifesat_data`.

In [None]:
list_of_sheetnames = ["Wave " + str(i) for i in range(1, 5)]
list_of_dataframes = [
    pd.read_excel(Path("data/Project-8-datafile.xlsx"), sheet_name=x)
    for x in list_of_sheetnames
]
lifesat_data = pd.concat(list_of_dataframes, axis=0)
lifesat_data.head()

The variable names provided in the spreadsheet are not very specific (a combination of letters and numbers that don’t tell us what the variable measures). To make it easier to keep track we can use a multi-index for our columns; this is an index with more than one entry per column. We will create a multi-index that includes the original codes, then has labels, and then has a short description.

Using a multi-index for columns does come with some downsides, as we'll see later.

To create a multi-index, we're going to create a type of Python object called a `tuple`, which is like a list but has curvy brackets instead of square brackets. Tuples (curvy brackets) and lists (square brackets) are more similar than they are different, and you can use list or tuple comprehensions of the form `[x+1 for x in lots_of_xs]` to generate a list or `(x+1 for x in lots_of_xs)` to generate a tuple. The different is that lists are mutable: you can change them once they've been created. Tuples are immutable: once they've been created, they're frozen. Both types have their uses but multi-indexes and multi-layered columns in **pandas** use tuples.

We'll zip up (using the `zip` function) the three lists of details: codenames (from the columns), labels, and short descriptions into a tuple. Each entry will look, for example, like `("A009", "Health", "State of health (subjective)")`.

In [None]:
labels = [
    "EVS-wave",
    "Country/region",
    "Respondent number",
    "Health",
    "Life satisfaction",
    "Work Q1",
    "Work Q2",
    "Work Q3",
    "Work Q4",
    "Work Q5",
    "Sex",
    "Age",
    "Marital status",
    "Number of children",
    "Education",
    "Employment",
    "Monthly household income",
]

short_description = [
    "EVS-wave",
    "Country/region",
    "Original respondent number",
    "State of health (subjective)",
    "Satisfaction with your life",
    "To develop talents you need to have a job",
    "Humiliating to receive money w/o working for it",
    "People who don't work become lazy",
    "Work is a duty towards society",
    "Work comes first even if it means less spare time",
    "Sex",
    "Age",
    "Marital status",
    "How many living children do you have",
    "Educational level (ISCED-code one digit)",
    "Employment status",
    "Monthly household income (x 1,000s PPP euros)",
]

index = pd.MultiIndex.from_tuples(
    tuple(zip(lifesat_data.columns, labels, short_description)),
    names=["code", "label", "description"],
)

index

Now we can replace the original columns with this multi-index, which is more informative than having the code names alone were.

In [None]:
lifesat_data.columns = index
lifesat_data.head()

This is mostly for convenience, but we can still look at individual columns just as we did before (using the codes):

In [None]:
lifesat_data["S003"].head()

Throughout this project we will refer to the variables using their original names, but you can see the extra info at the top of the dataframe when you need to.

## Python Walkthrough 8.2

**Cleaning data and splitting variables**

*Inspect the data and recode missing values*

Python's **pandas** package stores variables as different types depending on the kind of information the variable represents. For categorical data, where, as the name suggests, data is divided into a number of groups, such as country or occupation, the variables can be stored as the `"category"`. Numerical data (numbers that do not represent categories) can be stored as integers, `"int"`, or real numbers, usually `"double"`. There are other datatypes too, for example `"datetime64[ns]"` for datetimes in nano-second increments. Text is of type `"string"`. There's also a 'not quite sure' datatype, `"object"`, which is typically used for data that doesn't clearly fall into a bucket.

However, **pandas** is quite conservative about deciding on data types for you, so you do have to be careful to check the datatypes are what you want when they are read in. The classic example is of numbers being read in as type `"object"`.

The `.info()` method tells us what data types are being used in a **pandas** dataframe:

In [None]:
lifesat_data.info()

We have a lot of `"object"` columns, so it's clear that a lot of the columns haven't been read in as what they should be.

Looking back at our data, we can see that there are a LOT of `".a"` values and, reading the original data source, it looks like these represent missing values. Let's replace those with the proper missing value indicator, `pd.NA`.

In [None]:
lifesat_data = lifesat_data.replace(".a", pd.NA)
lifesat_data.head()

This isn't the only way to deal with those pesky `".a"` values. When we read each file in, we could have replaced the value for missing data used in the file, `".a"`, with **pandas** built-in representation of missing numbers. This is achieved via the `na_values=".a"` keyword in the `pd.read_excel` function.

*Recode the life satisfaction variable*

To recode the life satisfaction variable (`"A170"`), we can use a dictionary to map ‘Dissatisfied’ or ‘Satisfied’ into 1 or 10 respectively. This variable was imported as an object column. After changing the text into numerical values, we use the `astype("Int32")` method to convert the variable into a 32-bit integer (these can represent any integer between -$2^{31}$ and $2^{31}$).

Note that when using `.astype` below, we need to specify the complete column information (all three levels). This is so that when there's multiple columns beneath a higher level column, there isn't any ambiguity as to what the operation should be performed on. The extra complexity, which is avoid if you just have one layer of column names, is one disadvantage of multi-level column indexes.

In [None]:
col_name_tuple = ("A170", "Life satisfaction", "Satisfaction with your life")

lifesat_data[col_name_tuple] = (
    lifesat_data[col_name_tuple]
    .replace({"Satisfied": 10, "Dissatisfied": 1})
    .astype("Int32")
)
lifesat_data["A170"].info()

*Recode the variable for number of children*

We repeat this process for the variable indicating the number of children (`"X011_01"`).

In [None]:
col_name_tuple = (
    "X011_01",
    "Number of children",
    "How many living children do you have",
)

lifesat_data[col_name_tuple] = (
    lifesat_data[col_name_tuple].replace({"No children": 0}).astype("Int32")
)

*Replace text with numbers for multiple variables*

When we have to recode multiple variables with the same mapping of text to numerical value, we can take a bit of a short-cut to recode multiple columns at once—even without writing out the full column names because we know, at least in our case, that we don't have any repeated columns at any level.

To walk you through the trick, we're going to first take a column code name (the highest level), for example `"C036"`, and use that to recover the tuple (like a list but with curvy brackets instead of square ones) for the other two parts of the multi-column index like so:

In [None]:
column_code = "C036"
(column_code,) + lifesat_data[column_code].columns[0]

Now we can easily pin-point every column, at all three levels, we wish to convert by creating a list of column-tuples:

In [None]:
col_codes = ["C036", "C037", "C038", "C039", "C041"]

all_cols_at_all_levels = [
    (column_code,) + lifesat_data[column_code].columns[0] for column_code in col_codes
]

lifesat_data[all_cols_at_all_levels] = (
    lifesat_data[all_cols_at_all_levels]
    .replace(
        {
            "Strongly disagree": 1,
            "Disagree": 2,
            "Neither agree nor disagree": 3,
            "Agree": 4,
            "Strongly agree": 5,
        }
    )
    .astype("Int32")
)

# This one needs a different mapping

health_code = "A009"
health_tuple = (health_code,) + lifesat_data[health_code].columns[0]
lifesat_data[health_tuple] = (
    lifesat_data[health_tuple]
    .replace({"Very poor": 1, "Poor": 2, "Fair": 3, "Good": 4, "Very good": 5})
    .astype("Int32")
)

*Split a variable containing numbers and text*

To split the education variable `"X025A"` into two new columns, we use the `.explode` method, which creates two new variables called Education_1 and Education_2 containing the numeric value and the text description respectively. Then we use the mutate_at function to convert Education_1 into a numeric variable.

Because we're still using a multi-layered column system, we'll need to specify precise which combination of column names we're using in a tuple (as we just did above):

In [None]:
education_code = "X025A"
educ_tuple = (education_code,) + lifesat_data[education_code].columns[0]
new_col_a = educ_tuple
lifesat_data[educ_tuple].str.split(" : ", expand=True)

Let's do this again but save it back into our dataframe under two new tuples. We'll pass these back in a list. First the tuples:

In [None]:
ed_num_tuple = tuple(col + "_num" for col in educ_tuple)
ed_sch_tuple = tuple(col + "_school" for col in educ_tuple)

print(ed_num_tuple)
print(ed_sch_tuple)

Now pass them back in as a list (note the extra square brackets on the left-hand side) of tuples:

In [None]:
lifesat_data[[ed_num_tuple, ed_sch_tuple]] = (
    lifesat_data[educ_tuple]
    .str
    .split(" : ", expand=True)
)
lifesat_data