# Empirical Project 8

## Getting Started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)

## Python Walkthrough 8.1

**Importing data into Python**

As we are importing an Excel file, we use the `pd.read_excel` function provided by the **pandas** package. The file is called doing-economics-datafile-working-in-excel-project-8.xlsx and needs to be saved into a subfolder of your working directory called 'data'. The file contains four worksheets that contain data, and these are named ‘Wave 1’ through to ‘Wave 4’. We will load the worksheets one by one and add them to the previous worksheets using the `pd.concat` function, which concatenates (combines) dataframes. 

The final, combined data frame is called `lifesat_data`.

In [None]:
list_of_sheetnames = ["Wave " + str(i) for i in range(1, 5)]
list_of_dataframes = [
    pd.read_excel(
        Path("data/doing-economics-datafile-working-in-excel-project-8.xlsx"),
        sheet_name=x,
    )
    for x in list_of_sheetnames
]
lifesat_data = pd.concat(list_of_dataframes, axis=0)
lifesat_data.head()

Note that reading in data from lots of Excel worksheets is quite slow. For large datasets, [parquet is a very efficient format](https://ursalabs.org/blog/2019-10-columnar-perf/) and works across programming languages.

The variable names provided in the spreadsheet are not very specific (a combination of letters and numbers that don’t tell us what the variable measures). To make it easier to keep track we could two approaches:

1. use a multi-index for our columns; this is an index with more than one entry per column, with multiple column names stacked on top of each other. We would create a multi-index that includes the original codes, then has labels, and then has a short description.

2. Work with what we have, but keep hold of an easy way to convert the codes into either labels or a short description should we need to.


Using a multi-index for columns (option 1) is convenient in some ways but it also has some downsides. The main downside is extra complexity when doing operations on columns because we'll need to specify *all* of the different names of a column in some ways. This is so that there's no ambiguity in the case where the some of the column names are repeated at one level off the multi-index. You can see why this might be needed—you could have a case where some column names are repeated on some levels of the multi-index. So, although using the syntax

```python
lifesat_data["A009"]
```

will work most of the time, for *some* operations we would have to use

```python
lifesat_data[("A009", "Health", "State of health (subjective)")]
```

instead to access the health column. As you can see, we are removing any ambiguity about which data we refer to by specifying all three of its possible names in an object enclosed by curvy brackets (this object is called a `tuple` and behaves a lot like a list, except you can't modify individual values within it).

Option 2 has the downside that, on typical dataframe operations, we will only have the codes to go on and will have to look those codes up if we need to remind ourselves of what they represent. The easiest way to do this is using dictionaries.

In this tutorial, we'll go for option 2 as it's simpler, and there's a lot to be said for making life simpler. However, if you do want to go down the option 1 route, you can and the first step would be to create the multi-index column object like so:

```python
# option 1 only
index = pd.MultiIndex.from_tuples(
    tuple(zip(lifesat_data.columns, labels, short_description)),
    names=["code", "label", "description"],
)
lifesat_data.columns = index
```

where `labels` and `short_descriptions` are lists of strings and the zip function turns the three lists of details (codenames, labels, and short descriptions) into a tuple.

Going back to option 2: let's first create our neat mapping of codes into labels and codes into short descriptions using dictionaries


In [None]:
labels = [
    "EVS-wave",
    "Country/region",
    "Respondent number",
    "Health",
    "Life satisfaction",
    "Work Q1",
    "Work Q2",
    "Work Q3",
    "Work Q4",
    "Work Q5",
    "Sex",
    "Age",
    "Marital status",
    "Number of children",
    "Education",
    "Employment",
    "Monthly household income",
]

short_descriptions = [
    "EVS-wave",
    "Country/region",
    "Original respondent number",
    "State of health (subjective)",
    "Satisfaction with your life",
    "To develop talents you need to have a job",
    "Humiliating to receive money w/o working for it",
    "People who don't work become lazy",
    "Work is a duty towards society",
    "Work comes first even if it means less spare time",
    "Sex",
    "Age",
    "Marital status",
    "How many living children do you have",
    "Educational level (ISCED-code one digit)",
    "Employment status",
    "Monthly household income (x 1,000s PPP euros)",
]

labels_dict = dict(zip(lifesat_data.columns, labels))
descrp_dict = dict(zip(lifesat_data.columns, short_descriptions))

Let's just check these work looking at the example of health again, which has code `"A009"`

In [None]:
print(labels_dict["A009"])
print(descrp_dict["A009"])

Throughout this project we will refer to the variables using their original names, the codes, but you can see the extra info when you need to by passing those codes into these dictionaries.

## Python Walkthrough 8.2

**Cleaning data and splitting variables**

*Inspect the data and recode missing values*

Python's **pandas** package stores variables as different types depending on the kind of information the variable represents. For categorical data, where, as the name suggests, data is divided into a number of groups, such as country or occupation, the variables can be stored as the `"category"`. Numerical data (numbers that do not represent categories) can be stored as integers, `"int"`, or real numbers, usually `"double"`. There are other datatypes too, for example `"datetime64[ns]"` for datetimes in nano-second increments. Text is of type `"string"`. There's also a 'not quite sure' datatype, `"object"`, which is typically used for data that doesn't clearly fall into a bucket.

However, **pandas** is quite conservative about deciding on data types for you, so you do have to be careful to check the datatypes are what you want when they are read in. The classic example is of numbers being read in as type `"object"`.

The `.info()` method tells us what data types are being used in a **pandas** dataframe:

In [None]:
lifesat_data.info()

We have a lot of `"object"` columns, so it's clear that a lot of the columns haven't been read in as what they should be.

Looking back at our data, we can see that there are a LOT of `".a"` values and, reading the original data source, it looks like these represent missing values. Let's replace those with the proper missing value indicator, `pd.NA`.

In [None]:
lifesat_data = lifesat_data.replace(".a", pd.NA)
lifesat_data.head()

This isn't the only way to deal with those pesky `".a"` values. When we read each file in, we could have replaced the value for missing data used in the file, `".a"`, with **pandas** built-in representation of missing numbers. This is achieved via the `na_values=".a"` keyword in the `pd.read_excel` function.

*Recode the life satisfaction variable*

To recode the life satisfaction variable (`"A170"`), we can use a dictionary to map ‘Dissatisfied’ or ‘Satisfied’ into 1 or 10 respectively. This variable was imported as an object column. After changing the text into numerical values, we use the `astype("Int32")` method to convert the variable into a 32-bit integer (these can represent any integer between -$2^{31}$ and $2^{31}$).

In [None]:
col_satisfaction = "A170"
lifesat_data[col_satisfaction] = (
    lifesat_data[col_satisfaction]
    .replace({"Satisfied": 10, "Dissatisfied": 1})
    .astype("Int32")
)
lifesat_data.info()

*Recode the variable for number of children*

We repeat this process for the variable indicating the number of children (`"X011_01"`).

In [None]:
col_num_children = "X011_01"

lifesat_data[col_num_children] = (
    lifesat_data[col_num_children].replace({"No children": 0}).astype("Int32")
)

*Replace text with numbers for multiple variables*

When we have to recode multiple variables with the same mapping of text to numerical value, we can take a bit of a short-cut to recode multiple columns at once.

In [None]:
col_codes = ["C036", "C037", "C038", "C039", "C041"]

lifesat_data[col_codes] = (
    lifesat_data[col_codes]
    .replace(
        {
            "Strongly disagree": 1,
            "Disagree": 2,
            "Neither agree nor disagree": 3,
            "Agree": 4,
            "Strongly agree": 5,
        }
    )
    .astype("Int32")
)

# This one needs a different mapping

health_code = "A009"
lifesat_data[health_code] = (
    lifesat_data[health_code]
    .replace({"Very poor": 1, "Poor": 2, "Fair": 3, "Good": 4, "Very good": 5})
    .astype("Int32")
)

*Split a variable containing numbers and text*

To split the education variable `"X025A"` into two new columns, we use the `.explode` method, which creates two new variables called `X025A_num` and `X025A_sch` containing the numeric value and the text description respectively. Then we will convert `X025A_num` into a numeric variable.


In [None]:
education_code = "X025A"
lifesat_data[education_code].str.split(" : ", expand=True)

Let's do this again but save it back into our dataframe under two new column names. We'll pass these back in a list.

In [None]:
ed_num, ed_sch = [education_code + suffix for suffix in ["_num", "_sch"]]

print(ed_num)
print(ed_sch)

Now pass them back in as a list (note the extra square brackets) so that they map up to the two new columns on the right hand side.

In [None]:
lifesat_data[[ed_num, ed_sch]] = lifesat_data[education_code].str.split(
    " : ", expand=True
)
lifesat_data[ed_num] = pd.to_numeric(lifesat_data[ed_num]).astype("Int32")
lifesat_data.sample(5, random_state=4)

You can see the two extra columns for education at the end of the dataframe.

There's just one more column to convert: monthly income, which is a real number rather than an integer. Let's do that, and then let's have a final look at our object types:

In [None]:
lifesat_data["X047D"] = pd.to_numeric(lifesat_data["X047D"])
lifesat_data.info()

## Python Walkthrough 8.3

**Dropping specific observations**

As not all questions were asked in all waves, we have to be careful when dropping observations with missing values for certain questions, to avoid accidentally dropping an entire wave of data. For example, information on self-reported health (`"A009"`) was not recorded in Wave 3, and questions on work attitudes (`"C036"` to `"C041"`) and information on household income are only asked in Waves 3 and 4. Furthermore, information on the number of children (`"X011_01"`) and education (`"X025A"`) are only collected in the final wave.

We will first use the `.dropna()` function to find only those observations with complete information on variables present in all waves (`"X003"`, `"A170"`, `"X028"`, `"X007"`, and `"X001"`, which we will store in a list named `include`). Combining with `.index` will enable us to find the index values for rows that have complete information. But we must also be wary that our index is currently not unique, so we'll do a reset of the index first to ensure that there is one and only one index value for each observation (this is generally good practice!) using `.reset_index()` with the keyword argument `drop=True` because we don't wish to keep the current index in the dataframe.

In [None]:
include = ["X003", "A170", "X028", "X007", "X001"]
lifesat_data = lifesat_data.reset_index(drop=True)
lifesat_data = lifesat_data.loc[lifesat_data[include].dropna().index, :]

Next we will look at variables that were only present in some waves. For each variable/group of variables, we have to only look at the particular wave(s) in which the question was asked, then keep the observations with complete information on those variables. As before, we make lists of variables that only feature in Waves 1, 2, and 4 (`"A009"`—stored in `include_wave_1_2_4`), Waves 3 and 4 (`"C036"` to `"C041"`, `"X047D"`—stored in `include_wave_3_4`), and Wave 4 only (`"X011_01"` and `"X025A"`—stored in `include_wave_4`).

First, we put together some useful background info on what questions were only included in which waves.

In [None]:
# A009 is not in Wave 3.
# Note that even though it's just one entry, we use square brackets to make it a list
include_in_wave_1_2_4 = ["A009"]
# Work attitudes and income are in Waves 3 and 4.
include_in_wave_3_4 = ["C036", "C037", "C038", "C039", "C041", "X047D"]
# Number of children and education are in Wave 4.
include_in_wave_4 = ["X011_01", "X025A"]

Now we check the cases for these waves, successively refining the data to just those we wish to keep.

Again we will use the `.dropna()` method, but combine it with the logical OR operator, `|`, to include all observations for waves that did not ask the relevant question, along with the complete cases for that question in the other waves.

As a concrete example, in the first refinement of the data below we will first pick out any row for which the column `"A009"` has an entry *or* (represented by `|`) `"S002EVS"` takes the values `"1981-1984"` or `"1990-1993"`. This keeps observations if they are in Wave 1 (1981-1984) or Wave 2 (1990-1993) or if they are in Waves 3 or 4 with complete information. As `lifesat_data[include_in_wave_3_4].notna()` will create six columns worth of boolean values (one for each variable in `include_in_wave_3_4`), we will then use the row-wise (`axis=1`) `.all()` method to create a single boolean value (`True` or `False`) for every row. This is then combined with the test whether a row is part of Wave 1 or 2 `lifesat_data["S002EVS"].isin(["1981-1984", "1990-1993"])` in an OR (`|`) test. As a result we get one value (`True` or `False`) for each row of data.

In [None]:
# in Wave 1 (1981-1984) or Wave 2 (1990-1993), or they are in Waves 3 or 4 with complete information
condition_wave_3_4 = (lifesat_data[include_in_wave_3_4].notna()).all(axis=1) | (
    lifesat_data["S002EVS"].isin(["1981-1984", "1990-1993"])
)
lifesat_data = lifesat_data.loc[condition_wave_3_4, :]

# in Wave 4 with complete information on the questions specific to that wave or not in Wave 4
condition_wave_4 = (
    lifesat_data[include_in_wave_4].notna().all(axis=1)
) | ~lifesat_data["S002EVS"].isin(["2008-2010"])
lifesat_data = lifesat_data.loc[condition_wave_4, :]

# in Waves 1, 2, or 4 with complete information on the questions specific to those waves, or in Wave 3
condition_wave_1_2_4 = (lifesat_data[include_in_wave_1_2_4].notna().all(axis=1)) | (
    lifesat_data["S002EVS"].isin(["1999-2001"])
)
lifesat_data = lifesat_data.loc[condition_wave_1_2_4, :]

## Python Walkthrough 8.4

**Calculating averages and percentiles**

*Calculate average work ethic score*

We use the `.mean(axis=1)` (remember it's `axis=0` to aggregate over index and `axis=1` to aggregrate over columns) method to calculate the average work ethic score for each observation (`"workethic"`) based on the five survey questions related to working attitudes (`"C036"` to `"C041"`). As we're still using a multi-level column naming convention, we need to specify three levels of column names to create a new column—but they can all be the same.

In [None]:
lifesat_data["work_ethic"] = lifesat_data.loc[
    :, ["C036", "C037", "C038", "C039", "C041"]
].mean(axis=1)
lifesat_data.sample(5, random_state=5)

**Calculating averages and percentiles**

Regression package **statsmodels** provides a handy method (`"ECDF"`) to obtain an individual’s relative income as a percentile. We do this in the following steps:

- We create a new column "inc_percentile" and fill it with nans (`np.nan`) for now
- We then create a Boolean value (`condition_inc_percentile`) for the relevant years with information in the relevant column
- Then we use this to filter the rows in `lifesat_data` that we want to work on. The computation on the right-hand side is:
  - groupby the range of years (`"S002EVS"`)
  - select the income variable (`["X047D"]`)
  - use the transform method, which returns a column with the same dimensions as the input data (as opposed to apply, which returns data with only as many dimensions as there are categories in the grouped-by column)
  - use a lambda function to apply the ECDF function to every row, and round it using `np.round`



In [None]:
from statsmodels.distributions.empirical_distribution import ECDF

# create empty col for new variable
lifesat_data["inc_percentile"] = np.nan

# fill it for waves 3 and 4 with relevant data
condition_inc_percentile = (
    lifesat_data["S002EVS"].isin(["1999-2001", "2008-2010"])
) & (lifesat_data["X047D"].notna())

lifesat_data.loc[condition_inc_percentile, "inc_percentile"] = (
    lifesat_data.loc[
        condition_inc_percentile, :
    ]  # Select waves 3 & 4 without missing income data
    .groupby("S002EVS")["X047D"]  # groupby wave  # select income variable
    .transform(
        lambda x: np.round(ECDF(x)(x) * 100, 1)
    )  # compute ecdf as % round to 1 decimal place
)

# see the dataframe with the new column
lifesat_data.sample(5, random_state=5)

## Python Walkthrough 8.5

**Calculating summary statistics**

*Create a table showing employment status, by country*

One of the most useful features of **pandas** is its composability. We can stack up multiple methods to create just the statistics we want. In this example, we're going to use a succession of methods to create a table showing the employment status (as a percentage) of each country's labour force. The steps are:

- Select the data for Wave 4
- Group it by employment type (`"X028"`) and country (`"S003"`). Order will matter later when we use `unstack`; whichever variable is last in the groupby command will be switched from the index to the columns when we call unstack.
- Select the column to take observations from. In this case, it makes sense to use employment again.
- Count the number of observations
- Unstack so that we have a table instead of a long list (with countries as columns)
- Transform the numbers into percentages that sum to 100 for each country
- Round the values in the table
- Because we have more countries than employment statuses, transpose the columns and index

Note that when we get to the `.transform` line, we are left with a table that has employment status in the *rows* (which is indexed) and countries in the columns. This means that each value in the table represents the counts of observation of employment statuses in a particular country. The application of the lambda function, `x: x*100/x.sum()`, then computes the proportion of employment type as a fraction of all employment in that particular country.

In [None]:
sum_table = (
    lifesat_data.loc[
        lifesat_data["S002EVS"] == "2008-2010", :
    ]  # Wave 4 only, all columns
    .groupby(["X028", "S003"])[  # Group by employment and country
        "X028"
    ]  # Select employment column
    .count()  # Count number of observations in each category (employment-country)
    .unstack()  # Turn countries from an index into columns (countries because they are the last groupby variable)
    .transform(lambda x: x * 100 / x.sum())  # Compute a percentage
    .round(2)  # Round to 2 decimal places
    .T  # Tranpose so countries are the index, employment types the columns
)

sum_table

If we then wanted to export this table for further use elsewhere, we would export it with `sum_table.to_html(filename)`, `sum_table.to_excel(filename)`, `sum_table.to_string(filename)`, `sum_table.to_latex(filename)`, or many other options that you can find [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

*Calculate summary statistics by gender*

We can also obtain summary statistics on a number of variables at the same time using the `apply` function. To obtain the mean value for each of the required variables, grouped by the gender variable (`"X001"`), we can compose methods again:

In [None]:
(
    lifesat_data.loc[
        lifesat_data["S002EVS"] == "2008-2010", :
    ]  # Wave 4 only, all columns
    .groupby(["X001"])[  # Group by gender
        ["A009", "A170", "work_ethic", "X003", "X025A_num", "X011_01"]
    ]  # Select columns
    .mean(numeric_only=True)
    .round(2)  # Round to 2 decimal places
)

Getting the standard deviation is as simple as replacing `mean()` with `std()`

In [None]:
(
    lifesat_data.loc[
        lifesat_data["S002EVS"] == "2008-2010", :
    ]  # Wave 4 only, all columns
    .groupby(["X001"])[  # Group by gender
        ["A009", "A170", "work_ethic", "X003", "X025A_num", "X011_01"]
    ]  # Select columns
    .std()
    .round(2)  # Round to 2 decimal places
)

But what if we want *both*!? We can have that too, using the `agg` (short for aggregate) method and a list of functions.

In [None]:
(
    lifesat_data.loc[
        lifesat_data["S002EVS"] == "2008-2010", :
    ]  # Wave 4 only, all columns
    .groupby(["X001"])[  # Group by gender
        ["A009", "A170", "work_ethic", "X003", "X025A_num", "X011_01"]
    ]  # Select columns
    .agg(["mean", "std"])
    .round(2)  # Round to 2 decimal places
)

Note that, in this case, we have a multi-level column object in our dataframe. If you want to flip down the last level of columns to be an index, try using `.stack()`.

If you're exporting your results, you're not going to want to use the code names though. So you'll probably want to export your table with the nice names substituted in. You can do this using the dictionaries we created right at the start. Let's use the `label_dict` with a stacked version of the table above.

In [None]:
tab = (
    lifesat_data.loc[
        lifesat_data["S002EVS"] == "2008-2010", :
    ]  # Wave 4 only, all columns
    .groupby(["X001"])[  # Group by gender
        ["A009", "A170", "work_ethic", "X003", "X025A_num", "X011_01"]
    ]  # Select columns
    .agg(["mean", "std"])
    .round(2)  # Round to 2 decimal places
    .stack()  # bring mean and std into the index
    .rename(labels_dict, axis=1)  # rename the columns
)

tab

We didn't rename `"work_ethic"` and `"X025A_num"`, because we created those variables *after* the labels dictionary was created (so there are no labels for them). And `"X001"` is not a column heading, it's the index name, so we'll have to change that separately.

Let's update our dictionary with a new name for `"X025A_num"` and a new index name.

In [None]:
labels_dict.update({"X025A_num": "Education Level"})
tab = tab.rename(labels_dict, axis=1)
tab.index.names = [
    "",
    "",
]  # Set index names empty (two levels because two column levels)

tab

## Python Walkthrough 8.6

**Calculating frequencies and percentages**

First we need to create a frequency table of the `work_ethic` variable for each wave. This variable only takes values from 1.0 to 5.0 in increments of 0.2 (since it is an average of five whole numbers), so we can group by each value and count the number of observations in each group using the `count` function. Once we have counted the number of observations that have each value (separately for each wave), we compute the percentages by dividing these numbers by the total number of observations *for that wave* using `transform`. For example, if there are 50 observations between 1 and 1.2, and 1,000 observations in that wave, the percentage would be 5%.


In [None]:
waves = ["1999-2001", "2008-2010"]
country = "Germany"
condition = (lifesat_data["S002EVS"].isin(waves)) & (  # Select Waves 3 and 4
    lifesat_data["S003"] == country
)  # Only select DE for this example

# Create a new dataframe with counts by wave and work ethic score
ethic_pct = (
    lifesat_data.loc[condition, :]
    .groupby(["S002EVS", "work_ethic"])["work_ethic"]
    .count()
    .reset_index(name="count")
)

In [None]:
# Turn the counts into a within-wave percentage using 'transform'
ethic_pct["percentage"] = (
    100 * ethic_pct["count"] / ethic_pct.groupby(["S002EVS"])["count"].transform("sum")
)
ethic_pct.head()

The frequencies and percentages are saved in a new dataframe called `ethic_pct`. If you want to look at it, you can type the name into an empty code cell in a Jupyter or Colab Notebook, or, in Visual Studio Code, use the 'Variables' button and navigate to `ethic_pct`.

Now that we have the percentages and frequency data, we use **matplotlib** to plot a column chart. To overlay the column charts for both waves and make sure that the plot for each wave is visible, we use the alpha option in the `ax.bar` function to set the transparency level (try changing the transparency to see how it affects your chart’s appearance).

In [None]:
fig, ax = plt.subplots()
for wave in waves:
    sub_df = ethic_pct.loc[
        ethic_pct["S002EVS"] == wave, :
    ]  # For convenience, subset the dataframe
    ax.bar(sub_df["work_ethic"], sub_df["percentage"], width=0.2, alpha=0.7, label=wave)
ax.legend()
ax.set_xlabel("Work ethic")
ax.set_ylabel("Percent")
ax.set_title(f"Distribution of work ethic for {country}", loc="left")
plt.show()

***Figure 8.2** Distribution of work ethic scores for Germany.*

## Python Walkthrough 8.7

**Plotting multiple lines on a chart**

*Calculate average life satisfaction, by wave and country*

Before we can plot the line charts, we have to calculate the average life satisfaction for each country in each wave.

In Python Walkthrough 8.5 we produced summary tables, grouped by country and employment status. We will copy this process, but now we only require mean values. Countries that do not report the life satisfaction variable for all waves will have an average life satisfaction of ‘NA’. Since each country is represented by a row in the summary table, we use the rowwise and na.omit functions to drop any countries that do not have a value for the average life satisfaction for all four waves.


In [None]:
avg_life_sat = (
    lifesat_data.groupby(["S002EVS", "S003"])["A170"]  # groupby wave and country
    .mean()  # take the mean of life satisfaction
    .unstack()
    .T.dropna(  # one row per country, one wave per column
        how="any", axis="index"
    )  # drop any rows with missing observations
)
avg_life_sat

*Create a line chart for average life satisfaction*

As the data are already in the format that **matplotlib** likes, namely as a matrix, we can almost use `dataframe.plot` directly. But as we want to make a few tweaks and add a few extras, we need to also create an overall axis (`ax`) and pass that to the dataframe plotting function (`dataframe.plot(ax=ax)` after `fig, ax = plt.suplots()`.

The other settings do things like setting the y-label (`ax.set_ylabel`) and add a legend `ax.legend` that is outside of the graphed area `bbox_to_anchor=(1, 1)`.

In [None]:
fig, ax = plt.subplots()
avg_life_sat.T.plot(ax=ax)
ax.set_title("Life satisfaction across countries and survey waves", loc="left")
ax.legend(bbox_to_anchor=(1, 1))
ax.set_ylabel("Mean")
plt.show()

***Figure 8.3** Line chart of average life satisfaction across countries and survey waves.*

## Python Walkthrough 8.8

**Creating dummy variables and calculating correlation coefficients**

To obtain the correlation coefficients between variables, we have to make sure that all variables are numeric. However, the data on gender and employment status are coded using text, so we need to create two new variables (`gender` and `employment`). (We choose to create a new variable rather than overwrite the original variable so that even if we make a mistake, the raw data is still preserved).

We can use the `np.where` function to make the value of the variable conditional on whether a logical statement (e.g. `x["X001"] == "Male"`) is satisfied or not. As shown below, we can nest `np.where` statements to create more complex conditions, which is useful if the variable contains more than two values (an alternative is to create a categorical column using `.astype("category")` and then `.cat.code` to turn discrete variables into numbers).

We used two `np.where` statements for the unemployment variable (`"X028"`) so that the new variable will be 1 for full-time employed, 0 for unemployed, and NA if neither condition is satisfied.

The first job is to ensure that all of the variables we'd like, "X003", "X025A_num", "employment", "gender", "A009", "X047D", "X011_01", "inc_percentile", "A170", "work_ethic", are numeric. We can check this for the variables we already have with `.info()`:

In [None]:
lifesat_data.info()

Looks like we need to convert one variable from `object` to float.

In [None]:
lifesat_data["X003"] = lifesat_data["X003"].astype("float")

In [None]:
# the columns employment and gender don't exist yet; we'll create them soon

cols_to_select = [
    "X003",
    "X025A_num",
    "employment",
    "gender",
    "A009",
    "X047D",
    "X011_01",
    "inc_percentile",
    "A170",
    "work_ethic",
]

corr_matrix = (
    lifesat_data.loc[lifesat_data["S002EVS"] == "2008-2010", :]
    .assign(
        gender=lambda x: np.where(x["X001"] == "Male", 0, 1),
        employment=lambda x: np.where(
            x["X028"] == "Full time", 1, np.where(x["X028"] == "Unemployed", 0, np.nan)
        ),
    )
    .loc[:, cols_to_select]
    .corr()
)

We used the `assign` method here. It can get confusing when performing operations on columns and rows because there are several different methods you can use: `assign`, `apply`, `transform`, and `agg`. Agg, apply, and transform are all methods that you use *after* a groupby operation.

Here's a quick guide on when to use each of the three that follow a groupby:

- Use `.agg` when you're using a groupby but you want your groups to become the new index (rows)
- Use `.transform` when you're using a groupby but you want to retain your original index
- Use `.apply` when you're using a groupby, but you want to perform operations that will leave neither the original index nor an index of groups

Let's see examples of all three on some dummy data. First, let's create the dummy data:

In [None]:
len_s = 1000
prng = np.random.default_rng(42)  # prng=probabilistic random number generator
s = pd.Series(
    index=pd.date_range("2000-01-01", periods=len_s, name="date", freq="D"),
    data=prng.integers(-10, 10, size=len_s),
)
s.head()

Now let's see the result of using each kind of the three:

In [None]:
print("\n`.agg` following `.groupby`: groups provide index")
print(s.groupby(s.index.to_period("M")).agg("skew").head())
print("\n`.transform` following `.groupby`: retain original index")
print(s.groupby(s.index.to_period("M")).transform("skew").head())
print("\n`.apply` following `.groupby`: index entries can be new")
print(s.groupby(s.index.to_period("M")).apply(lambda x: x[x > 0].cumsum()).head())

`assign`, meanwhile, is used when you want to add new columns to a dataframe *in place*. It's sister function is the pure assignment by creating a new column directly. Let's see both of these on the dummy data:

In [None]:
# make s a dataframe rather than just a series
s = pd.DataFrame(s, columns=["number"])
# creating data directly
s["new_column_directly"] = 10
s.head()

And now using assign:

In [None]:
# creating data using assign
s = s.assign(new_column_indirectly=11)
s.head()

Let's return to the correlation matrix we created:

In [None]:
# Only interested in two columns, so select those
corr_matrix.loc[:, ["A170", "work_ethic"]]

## Python Walkthrough 8.9

**Calculating group means**

*Calculate average life satisfaction and differences in average life satisfaction*

We can achieve the tasks in Question 4(a) and (b) in one go using an approach similar to that used in Python Walkthrough 8.5, although now we are interested in calculating the average life satisfaction by country and employment type. Once we have tabulated these means, we can compute the difference in the average values. We will create two new variables: `D1` for the difference between the average life satisfaction for full-time employed and unemployed, and `D2` for the difference in average life satisfaction for full-time employed and retired individuals.

In [None]:
# Set the employment types that we wish to report
employment_list = ["Full time", "Retired", "Unemployed"]

df_employment = (
    lifesat_data.loc[
        (lifesat_data["S002EVS"] == "2008-2010")
        & (lifesat_data["X028"].isin(employment_list)),  # row selection
        :,  # col selection—all columns
    ]  # select wave 4 and these specific emp types
    .groupby(["S003", "X028"])  # group by country and employment type
    .mean(numeric_only=True)[
        "A170"
    ]  # mean value of life satisfaction by country and employment
    .unstack()  # reshape to one row per country (country is inner layer)
    .assign(  # create the differences in means
        D1=lambda x: x["Full time"] - x["Unemployed"],
        D2=lambda x: x["Full time"] - x["Retired"],
    )
)

df_employment.round(2)

*Make a scatterplot sorted according to work ethic*

In order to plot the differences ordered by the average work ethic, we first need to get all data from Wave 4 (using `.loc`), summarise the `work_ethic` variable by country (`groupby` then take the `mean`), and store the results in a temporary dataframe (`df_work_ethic`).

In [None]:
df_work_ethic = (
    lifesat_data.loc[
        (lifesat_data["S002EVS"] == "2008-2010"), ["S003", "work_ethic"]
    ]  # select wave 4 and two columns only
    .groupby(["S003"])  # group by country
    .mean()  # mean value of work_ethic by country
)

df_work_ethic.head().round(2)

We can now combine the mean `work_ethic` data with the table containing the difference in means (using an inner join to match the data correctly by country) and make a scatterplot using **matplotlib**. This process can be repeated for the difference in means between full-time employed and retired individuals by changing `y = D1` to `y = D2` in the function we create for plotting.

In [None]:
df_emp_ethic_comb = pd.merge(df_employment, df_work_ethic, on=["S003"], how="inner")

fig, ax = plt.subplots()
ax.scatter(df_emp_ethic_comb["work_ethic"], df_emp_ethic_comb["D1"])
ax.set_ylabel("Difference")
ax.set_xlabel("Work ethic")
ax.set_title(
    "Difference in wellbeing between the\nfull-time employed and the unemployed vs work ethic",
    size=14,
)
ax.set_ylim(0, None)
plt.show()

***Figure 8.5** Difference in life satisfaction (wellbeing) between the full-time employed and the unemployed vs average work ethic.*

To calculate correlation coefficients, use the `corr` function applied to a dataframe. You can see that the correlation between average work ethic and difference in life satisfaction is quite weak for employed vs unemployed, but moderate and positive for employed vs retired.

In [None]:
df_emp_ethic_comb.corr().loc[["D1", "D2"], ["work_ethic"]]

## Python Walkthrough 8.10

**Calculating confidence intervals and adding error bars**

We will use Turkey, Spain, and Great Britain as example countries in the top, middle, and bottom third of work ethic scores respectively.

In the tasks in Questions 1(a) and (b) we will obtain the means, standard errors, and 95% confidence intervals step-by-step, then for Question 1(c) we show how to use a shortcut to obtain confidence intervals from a single function.

*Calculate confidence intervals manually*

We obtained the difference in means in Python Walkthrough 8.9 (`D1` and `D2`), so now we can calculate the standard error of the means for each country of interest. We'll do this the long way round, using the formula.

In [None]:
country_list = ["Turkey", "Spain", "Great Britain"]

df_emp_se = (
    lifesat_data.loc[
        (lifesat_data["S002EVS"] == "2008-2010")
        & (lifesat_data["X028"].isin(employment_list))
        & (lifesat_data["S003"].isin(country_list)),
        :,
    ]  # select the relevant employment types, countries, and wave 4
    .groupby(["S003", "X028"])  # groupby country and employment type
    .apply(lambda x: x["A170"].std() / np.sqrt(x["A170"].count()))
    # .std(ddof=0)["A170"]  # find the standard dev of life satisfaction
    .unstack()  # put the employment types along the columns
    .assign(  # calculate the standard errors of the differences
        D1_SE=lambda x: (x["Full time"].pow(2) + x["Unemployed"].pow(2)).pow(1 / 2),
        D2_SE=lambda x: (x["Full time"].pow(2) + x["Retired"].pow(2)).pow(1 / 2),
    )
)

df_emp_se.round(2)

We can now combine the standard errors with the difference in means, and compute the 95% confidence interval width.

In [None]:
df_emp_subset = (
    df_employment.loc[df_employment.index.isin(country_list), ["D1", "D2"]]
    .join(df_emp_se.loc[:, ["D1_SE", "D2_SE"]], how="inner")
    .assign(CI_1=lambda x: 1.96 * x["D1_SE"], CI_2=lambda x: 1.96 * x["D2_SE"])
)

df_emp_subset.round(3)

We now have a table containing the difference in means, the standard error of the difference in means, and the confidence intervals for each of the two differences. (Recall that `D1` is the difference between the average life satisfaction for full-time employed and unemployed, and `D2` is the difference in average life satisfaction for full-time employed and retired individuals.)

*Calculate confidence intervals using a t-test function*

We could obtain the confidence intervals directly by using the t-test function from the **pingouin** package. We already imported it at the start of this chapter, but if you didn't already, run `import pingouin as pg`.

First we need to prepare the data in two groups. In the following example we go through the difference in average life satisfaction for full-time employed and unemployed individuals in Turkey, but the process can be repeated for the difference between full-time employed and retired individuals by changing the code appropriately (also for your other two chosen countries).

We start by selecting the data for full-time and unemployed people in Turkey, and storing it in two separate temporary matrices (arrays of rows and columns) called `turkey_full` and `turkey_unemployed` respectively, which is the format needed for the t-test function.

In [None]:
# create a boolean that's true for wave 4, Turkey, and full time
full_boolean = (
    (lifesat_data["S002EVS"] == "2008-2010")
    & (lifesat_data["S003"] == "Turkey")
    & (lifesat_data["X028"] == "Full time")
)

turkey_full = (
    lifesat_data.loc[full_boolean, ["A170"]]  # select the life satisfaction data
    .astype("double")  # ensure this is a floating point number
    .values  # grab the values--this is needed for the t-test
)

# do the same for the unemployed
unem_boolean = (
    (lifesat_data["S002EVS"] == "2008-2010")
    & (lifesat_data["S003"] == "Turkey")
    & (lifesat_data["X028"] == "Unemployed")
)
turkey_unemployed = lifesat_data.loc[unem_boolean, ["A170"]].astype("double").values

In the above, for full employment, we created a Boolean row filter based on three different columns using:

```python
lifesat_data["S002EVS"] == "2008-2010" & lifesat_data["S003"] == "Turkey" & lifesat_data["X028"] == "Full time"
```

and put this into `.loc` using the `.loc[rows, columns]` syntax. `.loc` isn't the only way to select rows to operate on though: an alternative is to use `.query` and pass it a string (some text) that asks for particular column values. This is easiest to demonstrate with an example:

```python
turkey_full = (
    lifesat_data
    # select wave 4, Turkey, and full time
    .query("S002EVS == '2008-2010' and S003 == 'Turkey' and X028 == 'Full time'")
    .loc[:, ["A170"]]  # select the life satisfaction data
    .astype("double")  # ensure this is a floating point number
    .values  # grab the values--this is needed for the t-test
)
```

You can see we still use a `.loc` after the query line, but only to select columns (we select all rows from the previous step using `:`).

Sometimes `.query` can be shorter or clearer to write than a conditional statement; it varies depending on the case. Can you write the equivalent query for the unemployed? It's useful to know both but, if in doubt, simply use `.loc`.

Let's move on now. We can use the `ttest` function from the **pingouin** package on the two newly created vectors. The default confidence interval level is 95% and can be changed via the `confidence=` keyword argument. Note that the `ravel` function turns an array like `[[1], [5], [0]]` into an array of the form `[1, 5, 0]`.

In [None]:
ttest = pg.ttest(turkey_full.ravel(), turkey_unemployed.ravel()).round(3)
ttest

We can then calculate the difference in means by finding the midpoint of the interval (`ttest["CI95%"].iloc[0].mean()`is 0.74), which should be the same as the figures obtained in Question 1(b) (df.employment[3, 2] is 0.7374582).

*Add error bars to the column charts*

We can now use these confidence intervals (and widths) to add error bars to our column charts. To do so, we use the geom_errorbar option, and specify the lower and upper levels of the confidence interval for the ymin and ymax options respectively. In this case it is easier to use the results from Questions 1(a) and (b), as we already have the values for the difference in means and the CI width stored as variables.

In [None]:
fig, ax = plt.subplots()
ax.bar(df_emp_subset.index, df_emp_subset["D1"], yerr=df_emp_subset["CI_1"])
ax.set_ylabel("Difference in means")
ax.set_xlabel("Country")
ax.set_title("Difference in well-being (full-time vs unemployed)")
plt.show()

**Figure 8.7* Difference in life satisfaction (well-being) between full-time employed and unemployed.*

Again, this can be repeated for the difference in life satisfaction between full-time employed and retired. Remember to change `df_emp_subset["D1"]` to `df_emp_subset["D2"]`, and change the error bars correspondingly too.

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions