# Measuring the effect of a sugar tax

[Chapter 3](https://www.core-econ.org/espp/book/text/03.html) of the 
[Economy, Society, and Public Policy](https://www.core-econ.org/espp/index.html)
suggests the project [measuring the effect of a sugar tax](http://www.core-econ.org/doing-economics/book/text/03-01.html) 
to deepen the knowledge of the topic. This notebooks analyses the data provided in the project using Python. In particular this notebook uses the project to
provide a introduction to [pandas](https://pandas.pydata.org/), a:

> a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

To run this notebook in Google Colab click on the following batch: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ceedee666/international-teaching-week-2023/blob/main/sugar_tax.ipynb)

The data for this project has already been downloaded and stored in the [dataset_sugar_tax.xlsx](./data/dataset_sugar_tax.xlsx) file located in the [data](./data/) directory. 

### Installation

- Install the required libraries installed by executing `pip3 install pandas openpyxl`
- In an Jupyter Notebook the libraries can be installed using `!pip3 install pandas openpzxl`

In [None]:
!pip3 install pandas openpyxl

## Loading the dataset

In [pandas](https://pandas.pydata.org/) data is stored in
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame)s. A 
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) represents tabular data
like e.g. data from a spreadsheet or a database. 

The `read_excel` function can be used to read data form MS Excel files into
a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame).

In [None]:
import pandas as pd

df = pd.read_excel(
    "https://github.com/ceedee666/international-teaching-week-2023/raw/main/data/dataset_sugar_tax.xlsx",
    sheet_name="Data",
)

A pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) provides different 
methods for giving an overview of the data. One nice feature of Jupyter notebooks is, that 
pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) are automatically 
shown in a readable format. Compare the result of the following two cells.

In [None]:
df.head()

In [None]:
print(df.head())

It is also possible to show just the content of selected columns or just the entries with as certain value. 

In [None]:
df[["price_per_oz", "price"]]

In [None]:
df[df["store_id"] == 16]

Using plots it is possible to get an overview of the data in the DataFrame. 

In [None]:
df["price_per_oz"].plot.hist()

In [None]:
df.plot.scatter(x="size", y="price")

The `nunique` method can be used to count the individual values in a column.

In [None]:
num_stores = df["store_id"].nunique()
num_products = df["product_id"].nunique()

print(f"Number of unique stores: {num_stores}")
print(f"Number of unique products: {num_products}")

## Updating the DataFrame 

Most of the data in the DataFrame are numbers. However, some of these numbers represent categorical data. For example,
the value `0` in the `taxed`colum represents beverages that have not been taxed while `1` represents taxed beverages. 

The `map` method can be used to replace the existing values with a textual representation. Note that `map`, like most 
pandas methods, creates a new DataFrame. The result is used to update the existing values. 

In addition it is also possible to create new columns. This is done by simply assigning the result to a new column name. 

In [None]:
df["taxed"] = df["taxed"].map({0: "not taxed", 1: "taxed"})
df

### Exercise 
For the different columns the following mappings exist: 

- `supp` column: 
    - `0` represents `Standard`
    - `1` represents `Supplemental`
- `store_type` column: 
    - `1` represents `Large Supermarket`
    - `2` represents `Small Supermarket`
    - `3` represents `Pharmacy`
    - `4` represents `Gas Station`

Replace the values in the `supp` colum by the textual values. Add a new colum named `store_type_str` containing the 
textual representation of the `store_type` values. 

Furthermore, the value `MAR2015` in colum `time` is not correct. The value needs to be changed to `MAR2016`. The `replace` 
method can be used for this purpose. 

In [None]:
# Implement the update here
df["supp"] = df["supp"].map({0: "Standard", 1: "Supplemental"})
df["store_type_str"] = df["store_type"].map(
    {1: "Large Supermarket", 2: "Small Supermarket", 3: "Pharmacy", 4: "Gas Station"}
)
df["time"] = df["time"].replace({"MAR2015": "MAR2016"})
df

In [None]:
freq_table_store_type = df.pivot_table(
    index=["store_type_str"],
    columns="time",
    values="price",
    aggfunc="count",
    margins=True,
)
freq_table_store_type = freq_table_store_type.reindex(
    ["Large Supermarket", "Small Supermarket", "Pharmacy", "Gas Station", "All"],
    level=0,
)

print(freq_table_store_type)

In [None]:
freq_table_taxed = df.pivot_table(
    index=["store_type_str"],
    columns=["time", "taxed"],
    values="price",
    aggfunc="count",
    margins=True,
).reindex(
    ["Large Supermarket", "Small Supermarket", "Pharmacy", "Gas Station", "All"],
    level=0,
)
print(freq_table_taxed)

In [None]:
freq_table_products = df.pivot_table(
    index=["type"], columns="time", values="price_per_oz", aggfunc="count", margins=True
)
print(freq_table_products)

In [None]:
fpt = df.pivot_table(
    index=["store_id", "product_id"], columns="time", values="price", aggfunc="count"
)
mask = (fpt > 0).all(axis=1)
fpt = fpt[mask]
print(fpt.head())
testdf = df.reset_index()
testdf = testdf.set_index(["store_id", "product_id"])
testdf

In [None]:
testdf = testdf.loc[fpt.index.values]
testdf = testdf[testdf["supp"] != "Supplemental"]
testdf

In [None]:
testdf = testdf.reset_index()
testdf

In [None]:
testdf["product_id"].nunique()
testdf.boxplot(column="price_per_oz", by="product_id")

In [None]:
new_pivot = testdf.pivot_table(
    index=["taxed", "store_type_str"],
    columns="time",
    values="price_per_oz",
    aggfunc="mean",
).round(3)

new_pivot

In [None]:
new_pivot["d1"] = new_pivot["JUN2015"] - new_pivot["DEC2014"]
new_pivot["d2"] = new_pivot["MAR2016"] - new_pivot["DEC2014"]
new_pivot

In [None]:
df123 = new_pivot
df123 = df123.reset_index()
df123 = df123.pivot(index="store_type_str", columns="taxed", values="d1")
df123.plot.bar()

In [None]:
df123 = new_pivot
df123 = df123.reset_index()
df123 = df123.pivot(index="store_type_str", columns="taxed", values="d2")
df123.plot.bar()

## References

- [pandas Documentation](https://pandas.pydata.org/docs/)
- [pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)