# In-Class Assignment: The Regression Analyst ðŸ“ˆ

**Topic:** Statsmodels, OLS, and Robust Inference  
**Objective:** Estimate a wage equation, interpret the results, and correct for heteroskedasticity.

---

### Context
You are analyzing a dataset to determine the returns to education. However, you suspect that the variance in wages increases as education increases (heteroskedasticity), which might mess up your standard errors.

### Part 0: Data Generation
Run the cell below to generate the dataset. Notice how the "noise" (epsilon) depends on Education.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

np.random.seed(101)
n = 500

# Predictors
education = np.random.randint(10, 20, n)
experience = np.random.randint(0, 30, n)
female = np.random.randint(0, 2, n)

# Heteroskedastic Noise: Variance increases with Education
noise = np.random.normal(0, education/2, n)

# Outcome Variable (Wage)
wage = 10 + 2.5 * education + 0.5 * experience - 1.5 * female + noise

df = pd.DataFrame({
    'wage': wage,
    'education': education,
    'experience': experience,
    'female': female
})

print(df.head())

### Part 1: Basic OLS

**Task:**
1.  Use `smf.ols` to define the model: `wage ~ education + experience + female`.
2.  Fit the model.
3.  Print the summary.
4.  **Reflection:** Is the coefficient for `female` statistically significant at the 5% level? (Look at $P>|t|$).

In [None]:
# Define and fit OLS model
model = smf.ols("wage ~ education + experience + female", data=df)
results = model.fit()

# Print summary
print(results.summary())

### Part 2: Robust Standard Errors

Since we know we generated heteroskedastic data (noise depended on education), our standard errors might be wrong.

**Task:**
1.  Use the `get_robustcov_results()` method on your results object.
2.  Use `cov_type='HC1'` (Heteroskedasticity Consistent).
3.  Print the new summary.
4.  Compare the standard error for `education` here vs in Part 1. Did it get larger or smaller?

In [None]:
# Get robust results
robust_results = results.get_robustcov_results(cov_type='HC1')

print(robust_results.summary())

### Part 3: Instrumental Variables (Bonus/Optional)

Suppose `education` is endogenous (correlated with unobserved ability). You have an instrument: `distance_to_college`. 

**Task:**
1.  Import `IV2SLS` from `linearmodels`.
2.  Define the dependent variable (`wage`) and exog variables (`1`, `experience`, `female`).
3.  Define the endogenous variable (`education`) and instrument (`distance`).
4.  Fit the model.

In [None]:
from linearmodels import IV2SLS

# Generate a fake instrument for the sake of the exercise
df['const'] = 1
df['distance'] = np.random.normal(0, 1, n) + df['education'] * 0.5 # Correlated with education

# Your IV code here
