# Empirical Project 8

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code, Google Colab, or Jupyter Notebook, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- **pandas** for data analysis
- **matplotlib** for data visualisation
- **numpy** for numerical methods

You'll also be using the **warnings** and **pathlib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 8.1

**Importing data into Python**

As we are importing an Excel file, we use the `pd.read_excel` function provided by the **pandas** package. The file is called Project-8-datafile.xlsx and is saved into a subfolder of our working directory called 'data'. The file contains four worksheets that contain the data, named ‘Wave 1’ through to ‘Wave 4’. We will load the worksheets one-by-one and add them to the previous worksheets using the `pd.concat` function, which combines dataframes together. The final output is called `lifesat_data`.

In [None]:
list_of_sheetnames = ["Wave " + str(i) for i in range(1, 5)]
list_of_dataframes = [pd.read_excel(Path("data/Project-8-datafile.xlsx"), sheet_name=x) for x in list_of_sheetnames]
lifesat_data = pd.concat(list_of_dataframes, axis=0)
lifesat_data.head()

The variable names provided in the spreadsheet are not very specific (a combination of letters and numbers that don’t tell us what the variable measures). To make it easier to keep track we can use a multi-index for our columns; this is an index with more than one entry per column. We will create a multi-index that includes the original codes, then has labels, and then has a short description.

To create a multi-index, we're going to create a type of Python object called a `tuple`, which is like a list but has curvy brackets instead of square brackets. We'll zip up (using the `zip` function) the three lists of details: codenames (from the columns), labels, and short descriptions into a tuple. Each entry will look, for example, like `("A009", "Health", "State of health (subjective)")`

In [None]:
labels = ["EVS-wave", "Country/region", "Respondent number", "Health", "Life satisfaction",
    "Work Q1", "Work Q2", "Work Q3", "Work Q4", "Work Q5", "Sex", "Age", "Marital status", "Number of children",
    "Education", "Employment", "Monthly household income"]

short_description = ["EVS-wave",
    "Country/region",
    "Original respondent number",
    "State of health (subjective)",
    "Satisfaction with your life",
    "To develop talents you need to have a job",
    "Humiliating to receive money w/o working for it",
    "People who don't work become lazy",
    "Work is a duty towards society",
    "Work comes first even if it means less spare time",
    "Sex",
    "Age",
    "Marital status",
    "How many living children do you have",
    "Educational level (ISCED-code one digit)",
    "Employment status",
    "Monthly household income (x 1,000s PPP euros)"]

index = pd.MultiIndex.from_tuples(tuple(zip(lifesat_data.columns, labels, short_description)), names=["code", "label", "description"])

index

Now we can replace the original columns with this multi-index, which is more informative than having the code names alone were.

In [None]:
lifesat_data.columns = index
lifesat_data.head()

This is mostly for convenience, but we can still look at individual columns just as we did before (using the codes):

In [None]:
lifesat_data["S003"].head()

Throughout this project we will refer to the variables using their original names, but you can see the extra info at the top of the dataframe when you need to.

## Python Walkthrough 8.2

**Cleaning data and splitting variables**