# Empirical Project 2 - Working in Python

## Getting started in Python

TODO

## Part 2.1 Collecting data by playing a public goods game

### Python Walkthrough 2.1

**Plotting a line chart with multiple variables**

Use the data from your own experiment to answer Question 1. As an example, we will use the data for the first three cities of the dataset that will be introduced in Part 2.2.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

# Create a dictionary with the data in
data = {"Copenhagen": [14.1, 14.1, 13.7, 12.9, 12.3, 11.7, 10.8, 10.6, 9.8, 5.3],
        "Dniprop": [11.0, 12.6, 12.1, 11.2, 11.3, 10.5, 9.5, 10.3, 9.0, 8.7],
        "Minsk": [12.8, 12.3, 12.6, 12.3, 11.8, 9.9, 9.9, 8.4, 8.3, 6.9]}

df = pd.DataFrame.from_dict(data)
df.head()

In [None]:
# Plot the data
fig, ax = plt.subplots()
df.plot(ax=ax)
ax.set_title("Average contribution to public goods game: without punishment")
ax.set_ylabel("Average contribution")
ax.set_xlabel("Round");

**Figure 2.1** Average contributions in different locations.

## Part 2.2 Describing the data

### Python Walkthrough 2.2

Both the tables you need are in a single Excel worksheet. Note down the cell ranges of each table, in this case A2:Q12 for the without punishment data and A16:Q26 for the punishment data. We will now use this range information to import the data into two dataframes (data_n and data_p respectively).

In [None]:
data_np = pd.read_excel("data/Public-goods-experimental-data.xlsx", usecols="A:Q", header=1, index_col="Period")
data_n = data_np.iloc[:10, :].copy()
data_p = data_np.iloc[14:24, :].copy()

Look at the data either by opening the dataframes from the Environment window or by typing data_n or data_p into the interactive Python window.

You can see that in each row, the average contribution varies across countries; in other words, there is a distribution of average contributions in each period.

### Python Walkthrough 2.3

**Calculating the mean using different methods**

We calculate the mean using two different methods, to illustrate that there are usually many ways of achieving the same thing. We apply the first method on `data_n`, which uses the built-in `.mean()` function to calculate the average separately over each column except the first. We use the second method (the agg function) on `data_p`.

In [None]:
mean_n_c = data_n.mean(axis=1)
mean_p_c = data_p.agg(np.mean, axis=1)

As the name suggests, the `agg` function applies an aggregation function (the mean function in this case) to all rows or columns in a dataframe. The second input, `axis=1`, applies the specified function to all rows in data_p. Typing 0 would have calculated column means instead (check and see for yourself). Type `help(pd.DataFrame.agg)` in your interactive Python window for more details, or see Python Walkthrough 2.5 for further practice.

**Plot the mean contribution**

Now we will produce a line chart showing the mean contributions.

In [None]:
fig, ax = plt.subplots()
mean_n_c.plot(ax=ax, label="Without punishment")
mean_p_c.plot(ax=ax, label="With punishment")
ax.set_title("Average contribution to public goods game")
ax.set_ylabel("Average contribution")
ax.legend();

**Figure 2.2** Average contribution to public goods game, with and without punishment.

The difference between experiments is stark, as the contributions increase and then stabilize at around $13 when there is punishment, but decrease consistently from around $11 to $4 across the rounds when there is no punishment.

### Python Walkthrough 2.4

**Drawing a column chart to compare two groups**

To make a column chart, we will use the `.plot.bar()` function. We first extract the four data points we need (Periods 1 and 10, with and without punishment) and place them into another dataframe.

In [None]:
# Create new dataframe with bars in
compare_grps = pd.DataFrame([mean_n_c.loc[[1, 10]], mean_p_c.loc[[1, 10]]], index=["Without punishment", "With punishment"])
# Rename columns to have "round" in them
compare_grps.columns = ["Round " + str(i) for i in compare_grps.columns]
# flip cols and index round ready for plotting (.T is transpose)
compare_grps = compare_grps.T
# Make a bar chart
compare_grps.plot.bar(rot=0);

**Figure 2.3** Mean contributions in a public goods game.

*Tip*: Experimenting with these charts will help you to learn how to use Python and its packages. Try using `.plot.bar(stacked=True)` or using `rot=45` as *keyword arguments*, or using `.plot.barh()` instead.

### Python Walkthrough 2.5

**Calculating and understanding the standard deviation**

In order to calculate these standard deviations and variances, we will use the agg function, which we introduced in Python Walkthrough 2.3. As we saw, `agg` is a command asking **pandas** to aggregate a set of rows or columns of the dataframe using a particular aggregation function. The basic structure is as follows: `df.loc[conditions or rows, columns].agg([function1, function2, ...])`. So to calculate the variances and more, we use the following command:

In [None]:
n_c = data_n.agg(["std", "var", "mean"], 1)
n_c

Here we take `data_n` and apply the `"var"` and `"std"` functions to each row (recall that the second input 1 does this; 0 would indicate columns). Note that the index column, which contains the period numbers, is excluded from the calculation. The result is saved as a new variable called `n_c`.

We then apply the same principle to of the `data_p` dataframe.

In [None]:
p_c = data_p.agg(["std", "var", "mean"], 1)

To determine whether 95% of the observations fall within two standard deviations of the mean, we can use a line chart. As we have 16 countries in every period, we would expect about one observation (0.05 × 16 = 0.8) to fall outside this interval.

In [None]:
fig, ax = plt.subplots()
n_c["mean"].plot(ax=ax, label="mean")
# mean + 2 sd
(n_c["mean"] + 2*n_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="±2 s.d.")
# mean - 2 sd
(n_c["mean"] - 2*n_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="")
for i in range(len(data_n.columns)):
    ax.scatter(x=data_n.index, y=data_n.iloc[:, i], color="k", alpha=0.3)
ax.legend()
ax.set_ylabel("Average contribution")
ax.set_title("Contribution to public goods game without punishment")
plt.show();

**Figure 2.4** Contribution to public goods game without punishment.

None of the observations fall outside the mean ± two standard deviations interval for the public goods game without punishment. Let’s see the equivalent chart for the version with punishment.

In [None]:
fig, ax = plt.subplots()
p_c["mean"].plot(ax=ax, label="mean")
# mean + 2 sd
(p_c["mean"] + 2*p_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="±2 s.d.")
# mean - 2 sd
(p_c["mean"] - 2*p_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="")
for i in range(len(data_p.columns)):
    ax.scatter(x=data_p.index, y=data_p.iloc[:, i], color="k", alpha=0.3)
ax.legend()
ax.set_ylabel("Average contribution")
ax.set_title("Contribution to public goods game without punishment")
plt.show();

**Figure 2.5** Contribution to public goods game with punishment.

Here it looks as if we only have one observation outside the interval (in Period 8). In that aspect the two experiments look similar. However, from comparing these two charts, we see that the game with punishment displays a greater variation of responses than the game without punishment. In other words, there is a larger standard deviation and variance for the observations coming from the game with punishment.

### Python Walkthrough 2.6

**Finding the minimum, maximum, and range of a variable**

To calculate the range for both experiments and for all periods, we will use an `apply` function. In general, you can define it like this:

In [None]:
data_p.apply(lambda x: x.max() - x.min(), axis=1)

This is a *lambda* function, an idea in programming (and mathematics) that has a long and interesting history. Here, it's saying take the difference between the maximum and minimum of each row.

Because making code re-usable is good programming practice, we will instead define this as a separate function:

In [None]:
range_function = lambda x: x.max() - x.min()
range_p = data_p.apply(range_function, axis=1)
range_n = data_n.apply(range_function, axis=1)

Let’s create a chart of the ranges for both experiments for all periods in order to compare them.

In [None]:
fig, ax = plt.subplots()
range_p.plot(ax=ax, label="With punishment")
range_n.plot(ax=ax, label="Without punishment")
ax.set_ylim(0, None)
ax.legend()
ax.set_title("Range of contributions to the public goods game")
plt.show();

**Figure 2.6** Range of contributions to public goods game.

This chart confirms what we found in Python walkthrough 2.5, which is that there is a greater spread (variation) of contributions in the game with punishment.

## Python Walkthrough 2.7

**Creating a table of summary statistics**

We have already done most of the work for creating this summary table in Python walkthrough 2.7. Since we also want to display the minimum and maximum values, we should create these too. And it's convenient to pop in std and mean using the same syntax (even though we created a separate mean earlier).

In [None]:
funcs_to_apply = [range_function, "max", "min", "std", "mean"]
summ_p = data_p.apply(funcs_to_apply, axis=1).rename(columns={"<lambda>": "range"})
summ_n = data_n.apply(funcs_to_apply, axis=1).rename(columns={"<lambda>": "range"})

Now we display the summary statistics in a table. We use the `round` method, which reduces the number of digits displayed after the decimal point (`2` in our case) and makes the table easier to read.

In [None]:
summ_n.loc[[1, 10], :].round(2)

And we can do the same for the version with punishment.

In [None]:
summ_p.loc[[1, 10], :].round(2)

## Part 2.3 Describing the data

### Python Walkthrough 2.8

**Calculating the p-value for the difference in means**

We need to extract the observations in Period 1 for the data for with and without punishment, and then feed the observations into a function that performs a t-test. We'll use the statistics package pingouin for this, which you will need to install on the command line using `pip install pingouin`. Once installed, import it using `import pingouin as pg`.

**pingouin**'s t-test function is called `ttest`. The `ttest` function is extremely flexible: if you input two variables (x and y) as shown below, it will automatically test whether the difference in means is due to chance or not (formally speaking, it tests the null hypothesis that the means of both variables are equal).

In [None]:
# Note sure what's going wrong here

import pingouin as pg

pg.ttest(x=data_n.iloc[1, :-1].T.values, y=data_p.iloc[1, :-1].T.values)