# Problem Set #2: Proxy Variable Regression
## Ec 143, Spring 2026
_Bryan S. Graham_
January 2026 (last updated Jan 2026)

**Due by 5PM on February 25th.** The GSI, Jinglin Yang (jinglin.yang@berkeley.edu), will handle the logistics of problem set collection.    

Working with your classmates on the problem set is actively encouraged, but everyone needs to turn in their own Jupyter Notebook and any other accompanying materials.  You must list all study partners on your turned in problem set. If you used AI to assist you in any way, please briefly describe how you used it and which one.  

This problem set provides empirical practice with proxy variable regression material discussed in lecture, as well as with the Bayesian bootstrap.
#### Code citation:
<br>
Graham, Bryan S. (2023). "Proxy Variable Regression: Python Jupyter Notebook," (Version 1.0) [Computer program]. Available at http://bryangraham.github.io/econometrics/ (Accessed 18 March 2024)

In [1]:
# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
import statsmodels.api as sm

In this problem set you will work with an extract from the [National Longitudinal Survey of Youth](https://www.nlsinfo.org/) (NLSY) 1979 cohort. Specifically you will explore how the earning of this cohort around the year 2000 varies with their years of completed schooling and other background variables. This cohort was approximately 40 years old in 2000.

In [2]:
# Directory where nlsy79extract.csv file is located
workdir =  '/Users/bgraham/Dropbox/Teaching/Berkeley_Courses/Ec143/Ec143_Spring2026/Data/'

In [3]:
# Read in NLSY79 Extract as a pandas dataframe
nlsy79 = pd.read_csv(workdir+'nlsy79extract.csv') # Reading .csv file as DataFrame

# Hierarchical index: household, then individual; keep indices as columns too
nlsy79.set_index(['HHID_79','PID_79'], drop=False)
nlsy79.rename(columns = {'AFQT_Adj':'AFQT'}, inplace=True) # Renaming AFQT

#Display the first few rows of the dataframe
nlsy79.head()

Unnamed: 0,PID_79,HHID_79,core_sample,sample_wgts,month_born,year_born,live_with_mom_at_14,live_with_dad_at_14,single_mom_at_14,usborn,...,weeks_worked_2001,weeks_worked_2003,weeks_worked_2005,weeks_worked_2007,weeks_worked_2009,weeks_worked_2011,NORTH_EAST_79,NORTH_CENTRAL_79,SOUTH_79,WEST_79
0,1,1,1,602156.31,9,58,1.0,1.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0
1,2,2,1,816100.38,1,59,1.0,1.0,0.0,0.0,...,0.0,18.0,52.0,52.0,52.0,52.0,1.0,0.0,0.0,0.0
2,3,3,1,572996.38,8,61,1.0,0.0,0.0,1.0,...,0.0,,43.0,0.0,,52.0,1.0,0.0,0.0,0.0
3,4,3,1,604567.88,8,62,1.0,0.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0
4,5,5,1,764753.0,7,59,1.0,1.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0


We will work with a subsample of respondents that belong to the core sample, are male and have complete cases for all required variables

In [4]:
# Only retain non-black, non-hispanic, male NLSY79 respondents belonging to core sample
nlsy79 = nlsy79[(nlsy79.core_sample != 0) & (nlsy79.male != 0)]

# Calculate average earnings across the 1997, 1999, 2001 and 2003 calendar years
# NOTE: This is an average over non-missing earnings values
nlsy79['Earnings'] = nlsy79[["real_earnings_1997", "real_earnings_1999", \
                             "real_earnings_2001", "real_earnings_2003"]].mean(axis=1)

# Only retain complete cases of year of birth, earnings, schooling, AFQT and family background
nlsy79 = nlsy79[["PID_79", "HHID_79", "year_born", "usborn", "hispanic", "black", \
                 "Earnings", "HGC_Age28", "AFQT", \
                 "live_with_mom_at_14", "live_with_dad_at_14", "HGC_FATH79r", "HGC_MOTH79r"]] 
nlsy79 = nlsy79.dropna()

# Summary statistics
nlsy79.describe()

Unnamed: 0,PID_79,HHID_79,year_born,usborn,hispanic,black,Earnings,HGC_Age28,AFQT,live_with_mom_at_14,live_with_dad_at_14,HGC_FATH79r,HGC_MOTH79r
count,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0,3435.0
mean,5077.071616,5069.208151,60.830277,0.932751,0.184279,0.268705,53082.003979,12.923726,43.913141,0.94818,0.786026,10.821252,10.9377
std,3258.080955,3252.758283,2.190681,0.250489,0.387768,0.44335,49435.239705,2.409577,29.835416,0.221695,0.410168,4.150287,3.309359
min,6.0,5.0,57.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
25%,2392.5,2391.5,59.0,1.0,0.0,0.0,23553.124625,12.0,16.698,1.0,1.0,8.0,10.0
50%,4695.0,4693.0,61.0,1.0,0.0,0.0,42477.26775,12.0,40.937,1.0,1.0,12.0,12.0
75%,7589.5,7571.5,63.0,1.0,0.0,1.0,66368.8665,14.0,69.5415,1.0,1.0,12.0,12.0
max,12517.0,12514.0,64.0,1.0,1.0,1.0,315153.72,20.0,100.0,1.0,1.0,20.0,20.0


### Data Analysis
1. Complete the least squares fit of log(Earnings) onto a constant and HGC_Age28. You may use Python's StatsModels OLS implementation for computation and standard error construction (use the cov_type='HC3' option for heteroscedastic robust standard errors).
2. Compute the least squares fit of log(Earnings) onto a constant and HGC_Age28, usborn, hispanic, black, AFQT, live_with_mom_at_14, live_with_dad_at_14, HGC_FATH79r, "HGC_MOTH79r.
3. Compute the least squares fit of HGC_Age28 onto a constant usborn, hispanic, black, AFQT, live_with_mom_at_14, live_with_dad_at_14, HGC_FATH79r, "HGC_MOTH79r.
4. Using your regression output from questions 2 and 3 only, (re-)compute the coefficient on HGC_Age28 in question 1.
5. Compute the fitted residuals, say _V_hat_, associated with the auxiliary regression calculated in question 4.
6. Compute the least squares fit of log(Earnings) onto the residuals computed in question 5. 
7. Provide a narrative summary of your analysis referencing the results you developed in the pencil and paper portion of the problem set. **[5-8 paragraphs]**.
8. Use the Bayesian Bootstrap to simulate draws from the posterior distribution for the coefficient on HGC_Age28 in the long regression calculated in question 3. Using at least 1000 posterior draws, summarize this posterior distribution via histogram. Compute the posterior mean and standard deviation. How do these quantities compare to the least square point estimate and standard error computed in questions 3?
