# Empirical Project 11

## Getting Started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import seaborn as sns
import seaborn.objects as so
import pingouin as pg
import warnings


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make seaborn work consistently with this
so.Plot.config.theme.update(mpl.rcParams)
# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 11.1

**Importing data and recoding variables**

Before importing data in Excel or .csv format, open it to ensure you understand the structure of the data and check if any additional options are required for the `pd.read_excel` function in order to import the data correctly. In this case the data is in a worksheet called ‘Data’, there are no missing values to worry about, and the first row contains the variable names. This format should be straightforward to import *but* this file is in the Strict Open XML Spreadsheet (.xlsx) rather than Excel Workbook (.xlsx) and loading it with `pd.read_excel` will fail. You'll need to re-save it in "xlsx: format and then import the data using the `pd.read_excel` function.

In [None]:
wtp = pd.read_excel(
    Path("data/doing-economics-datafile-working-in-excel-project-11.xlsx"),
    sheet_name="Data",
)
wtp.head(3)

### Reverse-code Variables

The first task is to recode variables related to the respondents’ views on certain aspects of government behaviour and attitudes about global warming (`cog_2`, `cog_5`, `scepticism_6`, and `scepticism_7`). This coding makes the interpretation of high and low values consistent across all questions, since the survey questions do not have this consistency.

To recode all of these values (across several variables) in one go, we use dictionary mapping: that is, using the `map` function to convert specific values to new values.




In [None]:
map_dict = {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}
vars_of_interest = ["cog_2", "cog_5", "scepticism_6", "scepticism_7"]
for col in vars_of_interest:
    wtp[col] = wtp[col].map(map_dict)

### Create new variables containing WTP amounts

Although we could employ the same technique as above to recode the value for the minimum and maximum willingness to pay variables, an alternative is to use the `pd.merge` function. This function allows us to combine two dataframes via values given in a particular variable.

We start by creating a new dataframe (`category_amount`) that has two variables: the original category value and the corresponding new euro amount. We then apply the `pd.merge` function to the `wtp` dataframe and the new dataframe, specifying the variables that link the data in each dataframe together. We're going to match on the `WTP_plmin` and, separately, the `WTP_plmax` options, to put in new columns for the min WTP and max WTP.

In [None]:
# vector containing the euro amounts
wtp_euro_levels = [48, 72, 84, 108, 156, 192, 252, 324, 432, 540, 720, 960, 1200, 1440]

# create a dataframe from this vector
category_amount = pd.DataFrame({"original": range(1, 15), "new": wtp_euro_levels})

# creating a new column for min WTP
wtp = pd.merge(
    category_amount, wtp, how="right", left_on="original", right_on="WTP_plmin"
).rename(columns={"new": "WTP_plmin_euro"})

# creating a new column for max WTP
wtp = pd.merge(
    category_amount, wtp, how="right", left_on="original", right_on="WTP_plmax"
).rename(columns={"new": "WTP_plmax_euro"})

## Python Walkthrough 11.2

**Creating indices from multiple columns**

We can create all of the required indices in three steps using column-aggregation operations; here the `.mean`. In each step we use the relevant function with `axis=1`, ie aggregation over columns. We then insert the single, created column back into the dataframe.

In [None]:
wtp["climate"] = wtp[["scepticism_2", "scepticism_6", "scepticism_7"]].mean(axis=1)
wtp["gov_intervention"] = wtp[
    ["cog_1", "cog_2", "cog_3", "cog_4", "cog_5", "cog_6"]
].mean(axis=1)
wtp["pro_environment"] = wtp[["PN_1", "PN_2", "PN_3", "PN_4", "PN_6", "PN_7"]].mean(
    axis=1
)

## Python Walkthrough 11.3

**Calculating correlation coefficients**

### Calculate correlation coefficients and Cronbach's alpha

We covered calculating correlation coefficients in Python walkthrough 10.1. In this case, since there are no missing values we can use the `.corr` method without any additional options.

For the questions on climate change:

In [None]:
wtp[["scepticism_2", "scepticism_6", "scepticism_7"]].corr().round(2)

For the questions on government behaviour:

In [None]:
wtp[["cog_1", "cog_2", "cog_3", "cog_4", "cog_5", "cog_6"]].corr().round(2)

And, finally, the questions on personal behaviour:

In [None]:
wtp[["PN_1", "PN_2", "PN_3", "PN_4", "PN_6", "PN_7"]].corr().round(2)

### Calculate Cronbach's alpha

It is straightforward to compute the Cronbach’s alpha using the `cronbach_alpha` function from the **pingouin** package. This function calculates Cronbach’s alpha. Let's look at it for these three sets of data:

In [None]:
import pingouin as pg

pg.cronbach_alpha(data=wtp[["scepticism_2", "scepticism_6", "scepticism_7"]])

In [None]:
pg.cronbach_alpha(data=wtp[["cog_1", "cog_2", "cog_3", "cog_4", "cog_5", "cog_6"]])

In [None]:
pg.cronbach_alpha(data=wtp[["PN_1", "PN_2", "PN_3", "PN_4", "PN_6", "PN_7"]])

## Python Walkthrough 11.4

**Using loops to obtain summary statistics**

The two different formats (DC and TWPL) are recorded in the variable `abst_format`, and take the values `ref` and `ladder` respectively. We will store all the variables of interest into a list called `variables`, and use a "for" loop to calculate summary statistics for each variable and present it in a table (using `pd.crosstab`).

In [None]:
variables = ["sex", "age", "kids_nr", "hhnetinc", "member", "education"]

for var in variables:
    print(pd.crosstab(wtp[var], wtp["abst_format"], normalize="columns").round(2))
    print("————\n")

The output above gives the required tables, but is not easy to read. You may want to tidy up the results, for example by translating (from German to English) and reordering the options in the household net income variable (`hhnetinc`).

## Python Walkthrough 11.5

**Calculating summary statistics**

The `agg` function can provide multiple statistics for a number of variables in one command. You will need to provide a list of the variables you want to summarise (after your groupby) and then use the `agg` option to specify the summary statistics you need. Here, we need the mean, standard deviation, mean, and max for the variables `climate`, `gov_intervention`, and `pro_environment`. Finally, `.stack` puts the data into a longer (and here more readable) format.

In [None]:
wtp.groupby("abst_format").agg(["mean", "std", "min", "max"])[
    ["climate", "gov_intervention", "pro_environment"]
].stack()

## Python Walkthrough 11.6

**Summarising willingness to pay variables**

### Create column charts for minimum and maximum WTP

Before we can plot a column chart, we need to compute frequencies (number of observations) for each value of the willingness to pay (1–14). We do this separately for the minimum and maximum willingness to pay.

In each case we select the relevant variable and remove any observations with missing values using the `.dropna()` method. We can then separate the data by level (WTP amount) of the `WTP_plmin_euro` or `WTP_plmax_euro` variables (using `groupby`), then obtain a frequency count using the [todo] function.

Once we have the frequency count stored as a dataframe, we can plot the column charts.

For the minimum willingness to pay:

In [None]:
import seaborn.objects as so

df_plmin = (
    wtp[["WTP_plmin_euro"]]
    .dropna()
    .astype("int")
    .astype("category")
    .value_counts()
    .sort_values()
    .reset_index()
)

(
    so.Plot(df_plmin, x="WTP_plmin_euro", y=0)
    .add(so.Bar())
    .label(x="Minimum WTP (euros)", y="Frequency")
    .show()
)

**Figure 11.4 Minimum WTP (euros).**

Let's now do the same for the maximum willingness to pay:

In [None]:
df_plmax = (
    wtp[["WTP_plmax_euro"]]
    .dropna()
    .astype("int")
    .astype("category")
    .value_counts()
    .sort_values()
    .reset_index()
)

(
    so.Plot(df_plmax, x="WTP_plmax_euro", y=0)
    .add(so.Bar())
    .label(x="Maximum WTP (euros)", y="Frequency")
    .show()
)

**Figure 11.5 Maximum WTP (euros).**

### Calculate average WTP for each individual

We can use the `mean` function to obtain the average of the minimum and maximum willingness to pay (combining the two columns at each row using the `axis=1` keyword argument).

In [None]:
wtp["wtp_average"] = wtp[["WTP_plmin_euro", "WTP_plmax_euro"]].mean(axis=1)

### Calculate mean and median WTP across individuals

The mean and median of this average value can be obtained using the `mean` and `median` functions. Note that invalid entries, such a NaNs, are omitted by default. And it's also worth noting that there's only a single remaining dimension to take the mean and median over here: so no need to specify `axis=0`, which is the default in any case.

In [None]:
wtp["wtp_average"].mean()

In [None]:
wtp["wtp_average"].median()

### Calculate correlation coefficients

We showed how to obtain a matrix of correlation coefficients for a number of variables in Python walkthrough 8.8. We use the same process here, storing the coefficients in an object called `M_corr`.

WTP %>%
  # Create the gender variable
  mutate(gender = 
    as.numeric(ifelse(sex == "female", 0, 1))) %>%
  select(WTP_average, education, gender,
    climate, gov_intervention, pro_environment) %>%
  cor(., use = "pairwise.complete.obs") -> M

M[, "WTP_average"]

In [None]:
M_corr = (
    wtp.assign(gender=lambda x: np.where(x["sex"] == "female", 0, 1))
    .loc[
        :,
        [
            "wtp_average",
            "education",
            "gender",
            "climate",
            "gov_intervention",
            "pro_environment",
        ],
    ]
    .corr()
)

M_corr["wtp_average"].round(3)

## Python Walkthrough 11.7

**Summarising Dichotomous Choice (DC) variables**

### Create frequency table for DC_ref_outcome

We can group by `costs` and `DC_ref_outcome` to obtain the number of observations for each combination of amount and vote response. We can also recode the voting options to ‘Yes’, ‘No’, and ‘Abstain’.

In [None]:
recoding_dict = {
    "do not support referendum and no pay": "No",
    "support referendum and pay": "Yes",
    "would not vote": "Abstain",
}

wtp_dc = (
    wtp.dropna(subset=["costs", "DC_ref_outcome"])
    .assign(DC_ref_outcome=lambda x: x["DC_ref_outcome"].map(recoding_dict))
    .groupby(["costs", "DC_ref_outcome"])["id"]
    .count()
    .unstack()
)

wtp_dc

### Add column showing proportion voting yes or no

We can extend the table from Question 2(a) to include the proportion voting yes or no (to obtain percentages, multiply the values by 100). We *chain* methods in the below; using `assign` to create new columns and `round` to round the values in the float columns.

In [None]:
wtp_dc = wtp_dc.assign(
    total=lambda x: x["Abstain"] + x["No"] + x["Yes"],
    prop_no=lambda x: (x["Abstain"] + x["No"]) / x["total"],
    prop_yes=lambda x: x["Yes"] / x["total"],
).round(2)

wtp_dc

### Make a line chart of WTP

Using the dataframe generated for Questions 2(a) and (b) (`wtp_dcz`), we can plot the ‘demand curve’ as a scatterplot with connected points using **seaborn**. Adding the extra option `x=so.Continuous().tick(every=200)` under the scale option changes the default labelling on the horizontal axis to display ticks at every 200 euros, enabling us to read the chart more easily.

In [None]:
import seaborn.objects as so

(
    so.Plot(wtp_dc, x="costs", y="prop_yes")
    .add(so.Line())
    .add(so.Dots())
    .label(
        x="Amount (Euros)",
        y="Fraction voting 'yes'",
    )
    .scale(
        x=so.Continuous().tick(every=200),
    )
    .show()
)

**Figure 11.6 Demand curve (in euros), DC method.**

### Calculate new proportions and add them to the table and chart

It is straightforward to calculate new proportions and add them to the existing dataframe:

In [None]:
wtp_dc = wtp_dc.assign(
    total_ex=lambda x: x["No"] + x["Yes"],
    prop_no_ex=lambda x: (x["No"]) / x["total_ex"],
    prop_yes_ex=lambda x: x["Yes"] / x["total_ex"],
).round(2)

And now we're going to plot them too. Because **seaborn** expects long-format data, we do need to re-orient the dataframe, however. We'll put the results of this long format data in a new dataframe called `demand_curve`. We'll clean this up a bit too: by renaming the ref outcome variable to `vote` and the entries in that variable to more relevant names for the chart. Finally, we'll plot the chart using lines and dots as before.

In [None]:
demand_curve = (
    pd.melt(
        wtp_dc.reset_index(), id_vars=["costs"], value_vars=["prop_yes", "prop_yes_ex"]
    )
    .rename(columns={"DC_ref_outcome": "vote"})
    .assign(
        vote=lambda x: x["vote"].map(
            {"prop_yes": "counted as yes", "prop_yes_ex": "excluded"}
        )
    )
)

(
    so.Plot(demand_curve, x="costs", y="value", color="vote")
    .add(so.Line())
    .add(so.Dots())
    .label(
        x="Amount (Euros)",
        y="Fraction voting 'yes'",
    )
    .scale(
        x=so.Continuous().tick(every=200),
    )
    .show()
)

**Figure 11.7 Demand curve from DC respondents, under different treatments for 'Abstain' responses.**

## Python Walkthrough 11.8

**Calculating confidence intervals for differences in means**

### Calculate the difference in means, standard deviations, and number of observations

We first create two vectors that will contain the `wtp` values for each of the two question methods. For the DC format, willingness to pay is recorded in the `costs` variable, so we select all observations where the `DC_ref_outcome` variable indicates the individual voted ‘yes’ and drop any missing observations. For the TWPL format we use the `wtp_average` variable that we created in Pythgon walkthrough 11.6.

In [None]:
dc_wtp = wtp.loc[
    wtp["DC_ref_outcome"] == "support referendum and pay", "costs"
].dropna()

dc_wtp.agg(["mean", "std", "count"]).round(2)

In [None]:
twpl_wt = wtp.loc[~wtp["wtp_average"].isna(), "wtp_average"].dropna()

twpl_wt.agg(["mean", "std", "count"]).round(2)

### Calculate 95% confidence intervals

Using the `ttest` function from the **pingouin** package to obtain 95% confidence intervals was covered in Python walkthroughs 8.10 and 10.6. As we have already separated the data for the two different question formats in Question 3(a), we can obtain the confidence interval directly.

In [None]:
pg.ttest(dc_wtp, twpl_wt, confidence=0.05)

### Calculate median WTP for the DC format

In Python walkthrough 11.6 we obtained the median WTP for the TWPL format. We now obtain the WTP using the DC format:

In [None]:
dc_wtp.median()

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions