# Empirical Project 7

## Getting Started in Python

Head to the "Getting Started in Python" page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a *notebook* by downloading the relevant file from this [repository](https://github.com/aeturrell/core_python) and running it on your own computer. Alternatively, you can run pages online in your browser over at [Binder](https://mybinder.org/v2/gh/aeturrell/core_python/HEAD).

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html(no_js=True)

## Python Walkthrough 7.1

**Importing data into Python and creating tables and charts**

First ensure the data, contained in `Project-7-datafile.xlsx`, are stored within a subfolder of your working directory called `data`. The following code, using Python's built-in `glob` library, will list the file if you're in the right place:

In [None]:
import glob

glob.glob("data/*.xlsx")

If you're in the wrong place, you can change the working directory with `import os` followed by `os.chdir("path/to/your/working/directory")` but it's better practice to just open a folder with an editor like Visual Studio Code directly.

Now let's read in the data using **pandas** `pd.read_excel` function.

In [None]:
df = pd.read_excel(Path("data/Project-7-datafile.xlsx"), sheet_name="Sheet1")
df.head()

We're going to use the `np.exp` function to create the variables `"p"` (price), `"q"` (quantity), and `"h"` (harvest) from their log counterparts.

The names are a bit confusing too (`"\n"` is the new line character) so we'll clean them up first. We'll use a *regular expression* to replace any whitespace (including the new line character) with underscores.

In [None]:
df.columns = df.columns.str.replace("\s+", "_", regex=True)
df.head()

Now we can transform some of the columns with `np.exp`. As we're applying the same function multiple times, we can use a loop.

In [None]:
cols_to_convert = {"log_q_(Q)": "q", "log_p_(P)": "p", "log_h_(X)": "h"}
for key, value in cols_to_convert.items():
    df[value] = np.exp(df[key])

df.head()

Let’s plot the chart for the prices, with year as the horizontal axis variable and price (p) as the vertical axis variable.

In [None]:
(
    ggplot(df, aes(x=as_discrete("Year"), y="p"))
    + geom_line(size=2)
    + labs(x="Year", y="Price")
    + theme(axis_text=element_text(angle=0))
    + scale_x_continuous(format="d")
)

**Figure 7.2** *Line chart for prices of watermelons*

Now we create the line chart for harvest and crop quantities (the variables `"h"` and `"q"`, respectively). These data are not "tidy", so it's harder to use **lets-plot** and the "grammar of graphics" approach. We could use **matplotlib** instead, but as we've used **lets-plot** already, we're going to pursue our other option, which is to transform the data.

To turn them into a tidy format, we use `pd.melt`. We're going to keep `"Year"` as it is, but turn "q" and "h" into values in a new column called `"variable"` and their values will be in another new column called `"quantity"`.

In [None]:
tidy_df = pd.melt(
    df.rename(columns={"h": "Harvest", "q": "Crop"}),
    id_vars="Year",
    value_vars=["Harvest", "Crop"],
    value_name="quantity",
)
tidy_df.head()

In [None]:
(
    ggplot(
        tidy_df,
        aes(x=as_discrete("Year"), y="quantity", linetype="variable", color="variable"),
    )
    + geom_line(size=2)
    + labs(y="Quantity")
    + theme(axis_text_x=element_text(angle=0))
    + scale_x_continuous(format="d")
)

**Figure 7.3** *Line chart for harvest and crop for watermelons.*