# Empirical Project 5

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code or Google Colab, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- **pandas** for data analysis
- **matplotlib** for data visualisation
- **numpy** for numerical methods

You'll also be using the **warnings** and **pathblib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings

# Set the plot style for prettier charts:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## TODO

download the zip and extract the files .. and .. to data/ within your working directory

## Python Walkthrough 6.1

**Importing data into Python and creating tables and charts**

Before opening an Excel or csv file using Python, you can open the file in spreadsheet software (such as Excel) to understand how it's structured. From looking at the file, we learn that:

* the variable names are in the first row (no need to use the `skiprows` keyword argument)
* missing values are represented by empty cells
* the last variable is in Column S, with short variable descriptions in Column U: it is easier to import everything first and remove the unnecessary data afterwards.

We will call our imported data `df`.

In [None]:
df = pd.read_csv(Path("data/AMP_graph_manufacturing.csv"))
df.info()

You can see that the penultimate column with values, Column T, was imported as `"Unnamed: 19"` and only contains `NaN`s. The final column of values has been imported into **pandas** comes from Column U in the spreadsheet and contains information about the variables (named `"storage display value"`).

Let's extract the information about the variables in a new **pandas** series called `man_varinfo` and then remove both of these columns from the dataset. To make it easier to see the `man_varinfo`, we'll temporarily override **pandas** column width limits.

In [None]:
man_varinfo = df.iloc[:, -1].dropna()

with pd.option_context('display.max_colwidth', 80):
    print(man_varinfo)

And now to drop the last two columns:

In [None]:
df = df.iloc[:, :-2]
df.head()

A few of the variables that have been imported as numbers are actually categorical variables; there are `"mne_f"` , `"mne_d"`, and `"competition2004"`. **pandas** doesn't automatically know what datatypes different variables should have. However, we can set the type of these variables as categorical and we can use labels to define what each number in the variables represents.



In [None]:
lab_mne_f = ["No MNE_f", "MNE_f"]
lab_mne_d = ["No MNE_d", "MNE_d"]
lab_comp2004 = man_varinfo.iloc[16].split("  ")[-1].split(",")
print(lab_comp2004)