In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("final-data-task.ipynb")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("seaborn-muted")
import statsmodels.api as sm

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

pd.set_option('display.max_columns', None)

# Econ 148 Final Data Task: Labor Compensation

## Introduction

Can local markets affect the labor compensation of public sector workers? In Rebecca Diamond's paper "Housing Supply Elasticity and Rent Extraction by State and Local Governments", the author studies how the slope of a city's housing supply curve could influence the city government's taxation decisions and government expenditure choices. The paper presents evidence that the slope of a city's housing supply curve, which measures the responsiveness of housing supply to changes in demand, plays a significant role in determining the government's ability to extract rent from its citizens. If you're curious on the motivation behind the data analysis below, see [here](http://www.stanford.edu/~diamondr/gov_housing_supply.pdf). Note that the analysis below is NOT the same at that found in the paper. The estimates you find will NOT match. 

In this data task, we wish to study how the wage difference between workers employed by local governments differs from the wages paid to similarly qualified workers employed in the private sector. The data you will use to measure workers' wages comes from the Consumer Population Survey Merged Outgoing Rotation Groups (CPS-MORG). This is a nationwide survey of US households conducted by the Bureau of Labor Statistics which asks a variety of questions, including information about wage earnings, employment, and many demographics (age, race...). This is also the same dataset as in Project 2 but with different variables and for a different time range. 

## Datasets
In the files provided, `morg96_sample.csv` is a dataset containing the CPS-MORG data from the year 1996. The full documentation of the data can be found at: http://www.nber.org/morg/docs/cpsx.pdf. (This should look familiar!) This describes in detail all the variables in the data and exactly how each one is defined and coded. You will need to reference this documentation in order to understand the data in `morg96_sample.csv` and to do the coding challenge below. We are using a sample here due to the memory restriction of Datahub. 

The other dataset we will be using is `housing_supply.dta`. This records for each metropolitan area (msafips) a z-scored measure of land unavailability (`unaval`), which we will use as a proxy for housing market elasticity. Higher value (z-score) of land unavailability would imply the housing market is less elastic since there is not much land to build new housing. 

**A Note on Collaboration:**  
No collaboration is allowed for this final data task. Any collaboration on this assignment is an act of academic dishonesty and will result in an F for the class and a report sent to the [Center for Student Conduct](https://conduct.berkeley.edu/). However, you are allowed to refer to resources including the course materials in our class and documentations and tutorials found online. 

**A Note on Grading:**  
In this data task, the autograder only provides sanity checks. You will need to double check your answers even if you pass the tests. 

---
## Part 0: Honor Code

As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the Honor Code. 

**Signature**: _Type in your name here_

---
## Part 1: Data Cleaning

**Question 1.1:** Read in the dataset `data/morg96_sample.csv` and store it as `morg`. 

_Points:_ 1

In [None]:
morg = ...
morg.head()

In [None]:
grader.check("q1_1")

**Question 1.2:** We are only interested in workers ages 25 to 55 (inclusive) who reported working at least 35 hours during his last week of work. Do not include any worker who report being self employed. Filter `morg` based on these conditions.  

_Points:_ 2

In [None]:
...
morg.head()

In [None]:
grader.check("q1_2")

**Question 1.3:** We want to include education as an explanatory variable. However, the only variable in our dataset that is related to education is `grade92`, which report the highest grade completed by an individual. Make a new column `eduyear` using the dictionary `grade2eduyear` provided below that contains estimates of the workers' years of schooling completed. 

Hint: You did this in Lab 7 and Lab 8. 

_Points:_ 1

In [None]:
# grade92 -> eduyear
grade2eduyear = {
    31: 0,  32: 2.5, 33: 5.5, 34: 7.5, 35: 9,  36: 10, 37: 11, 38: 12, 
    39: 12, 40: 13,  41: 14,  42: 14,  43: 16, 44: 18, 45: 20, 46: 22
}
...
morg.head()

In [None]:
grader.check("q1_3")

**Question 1.4:** Since we are concerned about the difference between public sector workers and private sector workers, let's make a new column `workertype`. `workertype` takes on the following values:

| Type | `workertype` |
| ----------- | ----------- |
| federal government | "federal" |
| state government | "state" |
| local government | "local" |
| private sector | "private" |

Then drop all the rows if `workertype` is not in the list above or is NaN. 

Hint: `class94` may be useful. One way to do this is similar to question 1.3. 

_Points:_ 2

In [None]:
...
morg["workertype"] = ...
morg = morg[morg["workertype"].isin([...])]
morg = morg.dropna(subset=["workertype"]) # drop rows if workertype is NaN
morg["workertype"] = morg["workertype"].astype(str) # convert workertype to str
morg.head()

In [None]:
grader.check("q1_4")

---
## Part 2: EDA

**Question 2.1:** Report the mean, standard deviation, min value, and max value of workers' usual weekly earnings. Report the summary statistics separately for private sector workers, local government workers, state government workers, and federal government workers. 

The table should look like this:

| workertype | mean | std | min | max |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| private | 605.993831 | 373.231066 | 0.0 | 1923.0 |
|  |  |  |  | (rows omitted) |

Hint: [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) may be helpful as an aggregate function in `groupby`, but you may also hardcode the results. 

_Points:_ 2

In [None]:
summary_stats = ...
summary_stats

In [None]:
grader.check("q2_1")

<!-- BEGIN QUESTION -->

**Question 2.2:** Now we want to explore the relationship between years of education and weekly earnings (for all individuals). Make a scatterplot. Label your plot properly. This question will be graded manually. 

_Points:_ 1

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3:** We also want to compare the level of education for private sector workers, local government workers, state government workers, and federal government workers. Replicate [this data visualization](https://www.econ148.org/sp23/resources/final/q2_3.png). This question will be graded manually.

Note: A fully correct answer should replicate everything about the provided visualization except the figure size and the color scheme. 

_Points:_ 2

In [None]:
...

<!-- END QUESTION -->

**Question 2.4:** Let's try using a classification model to predict whether a worker is a private sector worker, a local government worker, a state government worker, or a federal government worker based on the demographic variables. This is the most time-consuming question in the entire data task, so feel free to work on other parts first and then return to this question later. 

Relevant demographic variables: `grade92`, `age`, `marital`, `race`, `ethnic`, `penatvty`, `prcitshp`. 

**Use a classification model of your choice that is available in sklearn and classify `workertype` (private, local, state, or federal) using the demographic variables listed above. We will test if your classification model can achieve accuracy greater than 50\%.** 

Note: There's no need to convert `workertype` column to numeric values since it is the label. 

Hint: use `.values` when assigning x and y if you get a warning that looks like `UserWarning: X does not have valid feature names, but ... was fitted with feature name`. For example, `df["x"].values` or `df[["a", "b"]].values`. 

_Points:_ 3

In [None]:
morg_class = morg.copy().dropna(subset=["workertype", "grade92", "age", "marital", "race", "ethnic", "penatvty", "prcitshp"])

# preprocessing
...

# train-test split
x = ... # see hint
y = ... # see hint
xtrain, xtest, ytrain, ytest = ...

# import sklearn model of your choice
...

# fit the model
model = ...
...

# make prediction on `xtest`
ypred = ...

# get training and testing accuracies
train_accuracy = model.score(xtrain, ytrain)
test_accuracy = model.score(xtest, ytest)
print("Classification Report: \n", classification_report(ytest, ypred))
print(f'Training Accuracy: {train_accuracy}\nTesting Accuracy: {test_accuracy}')

In [None]:
grader.check("q2_4")

---
## Part 3: Regression

**Question 3.1:** Now you need to merge in data on the amount of land within an MSA unavailable for real-estate development due to presence of topographical constraints (water, steep grades, swamps..) that are stored in `data/housing_supply.dta`. Merge this into `morg`. You need to inner merge these together on MSA. Call the merged dataset `morg_housing`. 

_Points:_ 1

In [None]:
morg_housing = ...
morg_housing

In [None]:
grader.check("q3_1")

**Question 3.2:** Create a variable for ln(weekly earnings) for each worker and name it `lnearnwke`. We now want to analyze the wage gap between local government workers and private sector workers. Run a regression of `lnearnwke` (with a constant term) on a 0/1 binary variable of whether the worker is employed by the local government or not. Do not include federal government workers or state government workers in this regression.

Note: You need to use `missing="drop"` in `sm.OLS` for all regressions in this notebook. 

_Points:_ 3

In [None]:
# generate ln(weekly earnings)
morg_housing["lnearnwke"] = np.log(morg_housing["earnwke"] + 10e-8) # adding 10e-8 to avoid log(0)

q3_2 = ... # filter `morg_housing`

# generate 0/1 binary variable of whether the worker is employed by the local government or not
# name this new column `local`
q3_2["local"] = ...

# run regression
...
model_1 = ...
model_1.summary()

In [None]:
grader.check("q3_2")

Following the analysis done by Rebecca Diamond, now you will add controls for workers' demographics to the regressions. Add controls for whether the worker is male, whether the worker is black, whether the worker is hispanic, the age of the worker, the square of the age of the worker, years of education of the worker. 

**Question 3.3:** Generate the required columns and run the regression adding the above variables as controls. Name the new columns `male`, `black`, `hispanic`, `age_sq`. 

Hint: relevant columns are `sex`, `race`, `ethnic`, and `age`. 

_Points:_ 3

In [None]:
# generate relevant variables
q3_3 = q3_2.copy()
...

# run regression
...
model_2 = ...
model_2.summary()

In [None]:
grader.check("q3_3")

Finally, we want to analyze how the local government worker-private sector wage gap differs across metropolitan statistical areas based on the amount of land available for real estate development in that area. 

**Question 3.4:** Run the previous regression with the worker controls, adding a control for the amount of land available for real estate development, and a variable measuring the interaction between the amount of land available for real estate development with the 0/1 binary variable of whether the worker is employed by the local government. 

Note: the interaction term ($x_1 \times x_2$) of variables $x_1$ and $x_2$ is given by $$ x_1 \times x_2 = x_1 \cdot x_2$$

_Points:_ 2

In [None]:
# generate relevant variables
q3_4 = q3_3.copy()
q3_4["unaval_x_local"] = ... # interaction term

# run regression
...
model_3 = ...
model_3.summary()

In [None]:
grader.check("q3_4")

In Rebecca Diamond's paper, she claims that 

> "While the decision to regulate real-estate development and population expansion has many costs and benefits not studied in this paper, decreasing a city's housing supply elasticity through regulation gives the local government more taxation market power. Thus, the rise in land-use regulations since the 1970s may have had an unintended consequence of increasing rent seeking by governments and **leading to overpaid government workers and more corruption**."

Additional context: imposing more land-use regulations may increase `unaval` variable in our data. 

**Question 3.5:** Looking at the regression output from question 3.4, does our (overly simplified) analysis support the claim made by Diamond? If so, what is the value of the coefficient that support your reasoning? Use 90% as the confidence level. 

Assign True or False to `q3_5_1` for your answer to the first part. Assign a number to `q3_5_2` for your answer to the second part. If you don't think our analysis support the paper's claim, assign 0 to `q3_5_2`. 

_Points:_ 2

In [None]:
q3_5_1 = ...
q3_5_2 = ...

In [None]:
grader.check("q3_5")

---
**Congratulations!** You're done with the final data task!

**NOW SAVE THE NOTEBOOK BEFORE YOU CONTINUE!**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()