# Empirical Project 6

## Getting Started in Python

Head to the "Getting Started in Python" page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a *notebook* by downloading the relevant file from this [repository](https://github.com/aeturrell/core_python) and running it on your own computer. Alternatively, you can run pages online in your browser over at [Binder](https://mybinder.org/v2/gh/aeturrell/core_python/HEAD).

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *

LetsPlot.setup_html(no_js=True)

### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)

## Part 6.1: Looking for Patterns in Survey Data

### Learning objectives for this part

- explain how survey data is collected, and describe measures that can increase the reliability and validity of survey data
- use column charts and box and whisker plots to compare distributions
- calculate conditional means for one or more conditions, and compare them on a bar chart
- use line charts to describe the behaviour of real-world variables over time.

### Downloading the Data

First download the data used in the paper to understand how this information was collected. The data is publicly available and free of charge, but you will need to create a user account in order to access it.

- Go to the [World Management Survey pages](https://worldmanagementsurvey.org/).
- Click download survey data, then register, and fill in the form.
- An account activation link will be sent to the email you provided. Click on it to activate your account.
- Now go to the World Management Survey data download page.
- In the subsection ‘Download the public WMS data now’, click the ‘Download Now’ button.
- In the ‘Login’ section, enter your account’s email and password, then click ‘Login’.
- Under the heading ‘Manufacturing: 2004–2010 combined survey data (AMP)’, click the ‘Download’ button.
- Unzip the files in the downloaded zip folder into the data/ folder within your working directory.
- You may also find it helpful to download the Bloom et al. paper [‘Management practices across firms and countries’](https://tinyco.re/6438551).

## Python Walkthrough 6.1

**Importing data into Python and creating tables and charts**

Before opening an Excel or csv file using Python, you can open the file in spreadsheet software (such as Excel) to understand how it's structured. From looking at the file, we learn that:

* the variable names are in the first row (no need to use the `skiprows` keyword argument)
* missing values are represented by empty cells
* the last variable is in Column S, with short variable descriptions in Column U: it is easier to import everything first and remove the unnecessary data afterwards.

We will call our imported data `df`.

In [None]:
df = pd.read_csv(Path("data/AMP_graph_manufacturing.csv"))
df.info()

You can see that the penultimate column with values, Column T, was imported as `"Unnamed: 19"` and only contains `NaN`s. The final column of values has been imported into **pandas** comes from Column U in the spreadsheet and contains information about the variables (named `"storage display value"`).

Let's extract the information about the variables in a new **pandas** series called `man_varinfo` and then remove both of these columns from the dataset. To make it easier to see the `man_varinfo`, we'll temporarily override **pandas** column width limits.

In [None]:
man_varinfo = df.iloc[:, -1].dropna()

with pd.option_context("display.max_colwidth", 80):
    print(man_varinfo)

And now to drop the last two columns using slicing. The syntax is `.iloc[:, :-i]` where `i` is the number of rows we wish to drop.

In [None]:
df = df.iloc[:, :-2]
df.head()

A few of the variables that have been imported as numbers are actually categorical variables: `"mne_f"` , `"mne_d"`, and `"competition2004"`. **pandas** doesn't automatically know what datatypes different variables should have. However, we can set the type of these variables as categorical and we can use labels to define what each of the numbers in the variables represents.

The first two have quite clear labels. For the third, we'll use some string manipulation tools to grab the labels directly from the `man_varinfo` variable.

In [None]:
lab_mne_f = ["No MNE_f", "MNE_f"]
lab_mne_d = ["No MNE_d", "MNE_d"]
lab_comp2004 = man_varinfo.iloc[16].split("  ")[-1].split(",")
print(lab_comp2004)

The third line of code above is doing a lot of work here. When you do coding you will often use someone else's code as a starting point for your code, and trying to figure out what some code does is therefore a very important skill. Below you can see a dissection of what each part of this line does.

In [None]:
explain_1 = man_varinfo.iloc[16]
print(explain_1)
print("\n")  # prints new line
explain_2 = man_varinfo.iloc[16].split("  ")
print(explain_2)
print("\n")  # prints new line
explain_3 = man_varinfo.iloc[16].split("  ")[-1]
print(explain_3)
print("\n")  # prints new line
explain_4 = man_varinfo.iloc[16].split("  ")[-1].split(",")
print(explain_4)

Let's now encode these variables as categoricals, with suitable names.

In [None]:
for col in ["mne_f", "mne_d", "competition2004"]:
    df[col] = pd.Categorical(df[col])

df["mne_f"] = df["mne_f"].cat.rename_categories(lab_mne_f)
df["mne_d"] = df["mne_d"].cat.rename_categories(lab_mne_d)
df["competition2004"] = df["competition2004"].cat.rename_categories(lab_comp2004)

When you create new labels, check that they have been attached to the correct entries. Although we set them as a list here using, for example, `["No MNE_f", "MNE_f"]`, we could have also used a dictionary, for example `{0: "No MNE_f", 1: "MNE_f"}`, mapping old values into new.

To create the tables, we can use a technique called method chaining. Rather than have a series of separate assignment commands using `=`, method chaining "chains" together a series of methods (these are preceeded by `.`). This approach has pros and cons: it can be easier to read, but harder to debug in case of errors. 

First, we will group data by country (using `.groupby`), then calculate the required aggregate statistics for each of these groups (using `.agg`), then order the countries according to their overall score (highest to lowest) (`.sort_values`).

`.groupby` groups the data according to a given column.

`.agg` aggregates data, and returns it with a different index. In combination with `.groupby`, it will return an index based on what column(s) was/were passed to the groupby operation. There are many ways to use `.agg`, including just setting `.agg.mean()` and other functions such as `count()`, `median()`, `sum()`, and `std()`. However, to change the column names of the returned series, you can either pass a dictionary, say `{"columnname": "mean"}`, or pass an object called a tuple that species the new column name, old column name, and the aggregation function, for example `newcolumnname = ("columnname", "count")`.

In [None]:
table_mean = (
    df.groupby("country")
    .agg(
        obs=("management", "count"),
        m_overall=("management", "mean"),
        m_monitor=("monitor", "mean"),
        m_target=("target", "mean"),
        m_incentives=("people", "mean"),
    )
    .sort_values(by="m_overall", ascending=False)
)

table_mean.round(2)

Let’s make the table showing the ranks. We can use the `.agg` function again, but in this case it keeps the same index as before because we're not using a groupby.

We will also drop the `"obs"` column, and rename all of the columns so that they begin with `"r"` (for rank) rather than `"m"`

In [None]:
table_rank = table_mean.agg("rank", ascending=False).drop("obs", axis=1)
table_rank.columns = [x.replace("m_", "r_") for x in table_rank.columns]
table_rank

Now we use **lets-plot** to create a bar chart of the `"m_overall"` value in `table_mean`. To present countries in order of their management score, we will first re-order the dataframe using `sort_values`.

In [None]:
table_mean = table_mean.sort_values(by="m_overall")

In [None]:
(
    ggplot(table_mean.reset_index(), aes(y="country", x="m_overall"))
    + geom_bar(stat="identity", orientation="y")
    + labs(
        x="Average management practice score",
        y="Country",
        title="Management practices in manufacturing firms",
    )
)

**Figure 6.3** *Management practices in manufacturing firms around the world.*


If you want to switch the order of the bars, use `table_mean.sort_values(by="m_overall", ascending=False)` before creating the plot.

## Python Walkthrough 6.2

**Obtaining frequency counts and plotting overlapping histograms**

To get frequency counts, use the `pd.cut` function. This will count the number of observations that fall within the intervals specified in using a list of bins given by **numpy**'s `arange` function; this accepts arguments of the form `arange(start, stop, step)` to create intervals.

We store this information in the **pandas** series `chile_intervals`. This gives the appropriate interval for each entry in the series. To return the counts, we need to aggregate this information to the total counts per bin, which we can do with `value_counts`. To keep the order of the intervals, we'll specify `value_counts(sort=False)` too.

In [None]:
chile_intervals = pd.cut(
    df.loc[df["country"] == "Chile", "management"], bins=np.arange(0, 5, 0.2)
)
chile_counts = chile_intervals.value_counts(sort=False)
chile_counts

That's how to get hold of a histogram in the form of data. If we want to jump straight to plotting a histogram, we have a few options.

If we just want a quick look, we can use **pandas** built-in histogram function, `.plot.hist()`.

In [None]:
country_to_hist = "Chile"
df.loc[df["country"] == country_to_hist, "management"].plot.hist();

You can of course put this on an existing (**matplotlib**) plot if you like, and turn it into a neater figure. Here's an example of how you'd do just that:

In [None]:
min_val, max_val, step = 0, 5, 0.2

fig, ax = plt.subplots()
df.loc[df["country"] == country_to_hist, "management"].hist(
    bins=np.arange(min_val, max_val, step)
).plot(ax=ax)
ax.set_xlabel("Management score")
ax.set_ylabel("Count")
ax.set_xlim(min_val, max_val)
ax.set_axisbelow(True)
ax.set_title(f"Histogram of management scores for {country_to_hist}")
plt.show()

We created a figure and chart axes (called `ax`) first, then put the information on them by calling `.plot(ax=ax)`. Then we added contextual information such as a title and axes labels.

An alternative way of generating histograms is using **lets-plot**. **matplotlib** is great for when you need lots of flexibility and customisation, but as histograms are such common chart types, they're covered by **lets-plot**.

In [None]:
(
    ggplot(df.loc[df["country"] == country_to_hist, :], aes(x="management"))
    + geom_histogram(color="black", alpha=0.5)
    + labs(
        x="Management Score",
        y="Counts",
        title=f"Histogram of management scores for {country_to_hist}",
    )
)

Using **matplotlib**, it is possible to add a second country to the same chart too, but in this case we're going to use **lets-plot**. If you did want to do that, then you could try looking up how to do it on the internet. Everyone, no matter how expert they are at coding, uses the internet to grab snippets that help them achieve what they want. In this case, you could try searching for "Plot two histograms on the same graph matplotlib".

**lets-plot** will allow us to have two histograms on the same chart too.

We're going to demonstrate this by using similar code to above to plot a second country too, the United States.

In [None]:
country_to_hist = "Chile"
scnd_country_hist = "United States"

(
    ggplot(
        df.loc[df["country"].isin(["Chile", "United States"]), :],
        aes(x="management", fill="country"),
    )
    + geom_histogram(alpha=0.6, color="black", position="identity")
    + labs(
        x="Management score",
        title=f"Management scores for {country_to_hist} and {scnd_country_hist}",
        caption="Source: World Management Survey",
    )
)

Ideally we would want to put these two countries on the same chart but *normalised* because we're more interested in the distribution of scores than the absolute numbers of surveys. Unfortunately though, **lets-plot** doesn't yet have a `geom_histogram` option that will normalise the data.

**Figure 6.6** *Comparing the distribution of management scores for the US and Chile.*

## Python Walkthrough 6.3

**Creating box plots**

We will use a somewhat similar code structure as we did for the overlapping histograms, this time plotting countries on the horizontal axis and management scores on the vertical axis. Instead of laboriously specifying a different box and whisker plot for each country independently, though, we will simply ask for them for every country in a list using `df["country"].isin()` to filter the dataframe with all the data in and then `boxplot(..., by="country", ...)` to make the chart.

In [None]:
countries_to_include = ["Chile", "United States", "Brazil", "Germany", "UK"]

(
    ggplot(
        df.loc[df["country"].isin(countries_to_include), :],
        aes(x="country", y="management"),
    )
    + geom_boxplot()
    + labs(
        y="Management score", x="Country", title="Management scores grouped by country"
    )
)

## Python Walkthrough 6.4

**Calculating confidence intervals and adding them to a chart**

As in Python walkthrough 6.1, we use method chaining to do this. First, we take management data dataframe, `df`, and extract the countries we need using `df.loc`. Then, we group the data by country (using a `group_by`), and calculate some of the required aggregate measures (`agg`) mapping the management into three new variables; the mean, the standard deviation, and the number of observations (using the `len` function).

The final step is to compute an error from the new columns and, finally, to sort the data according to the values of `"mean_m"`. We save the final result in `table_stats`.



In [None]:
countries_to_include = ["Chile", "United States", "Brazil", "Germany", "UK"]

table_stats = (
    df.loc[df["country"].isin(countries_to_include), :]
    .groupby("country")
    .agg(
        mean_m=("management", "mean"),
        sd_m=("management", "std"),
        obs=("management", len),
    )
    .assign(m_err=lambda x: 1.96 * np.sqrt(x["sd_m"] ** 2 / (x["obs"] - 1)))
    .sort_values("mean_m", ascending=False)
)

table_stats.round(2)

We can use this as the basis of a bar chart in **matplotlib** to quickly see mean management scores.

In [None]:
fig, ax = plt.subplots()
table_stats.plot.bar(ax=ax, y="mean_m", yerr="m_err", rot=0, capsize=5)
ax.set_ylim(2, 4)
ax.legend([])  # Turns legend off
ax.set_title("Mean management score across selected countries")
plt.show()

**Figure 6.10** *Bar chart of mean management score in manufacturing firms for a selection of countries, with 95% confidence intervals.*

## Python Walkthrough 6.5

**Calculating and adding conditional summary statistics and confidence intervals to a chart**

To do this, we will use many techniques encountered previously, but first we have to create a new variable that indicates whether a firm is larger or smaller than a certain threshold; we'll call this `"size"`. A firm with a value of `"lemp_firm"` greater than 5.8 is considered larger. We use a new categorical column (based on a boolean check on the condition) to make this distinction, and rename the falses to be `"smaller"` and the trues to be `"larger"`.

In [None]:
df["size"] = pd.Categorical(df["lemp_firm"] > 5.8).rename_categories(
    {False: "smaller", True: "larger"}
)
df.head()

Let's look at Canada, Brazil, and the United States. Again, we use method chaining to make the table. In the `groupby` command, we group the variables by size and ownership, as we did previously.

In [None]:
countries_to_include = ["Canada", "United States", "Brazil"]

table_stats2 = (
    df.loc[df["country"].isin(countries_to_include), :]
    .groupby(["country", "ownership", "size"])
    .agg(
        mean_m=("management", "mean"),
        sd_m=("management", "std"),
        obs=("management", len),
    )
)

table_stats2.round(2)

Now we use the variable `"size"` as a column variable, so that we can see the summary statistics in two blocks of columns (separately for larger and smaller firms). Because `"size"` is the innermost index column of our three index columns (they are the three columns we passed to the groupby variable), we can use the `unstack` method to split the main table by the unique values in the `"size"` index (we also round the values to 2 decimal places using `.round(2)`). The 'smaller' and 'larger' categories now appear under the three headings `mean_m`, `sd_m`, and `obs` separately.

In [None]:
table_stats2.unstack().round(2)

This chapter used the following package versions:

In [None]:
%load_ext watermark
%watermark --iversions