Assignment 1
---
*Empirical Methods in Economics and Business Studies*

*Spring 2025*

*Instructor: 陈志远*

Due: 2025-4-18


**Notes**: 
- This assignment requires to you to use both `Python` and `Stata`.
- Please submit your assignment as a Jupyter notebook.
- You are strongly encouraged to use LLMs to help with your coding. 

##### 0. Background Information 
Here I describe how I generate (or simulate) the data that you will be using in this assignment. 

The data is generated using the following model:
$$
Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \epsilon_i \tag{1}
$$
* $\epsilon_i$ is the error term, $\epsilon_i \sim \mathbb{N}(0, 1)$
* $\beta_0 = 1$ is the constant term
* $X_i$ is a continuous covariate, $X_i \sim \mathbb{N}(1, 2^2)$, and $\beta_2 = 0.5$ is the coefficient of $X_i$
* $D_i$ is a binary treatment variable (1 if treated, 0 if not), the assignment to treatment is determined by the following rule:
$$
D_i = \begin{cases}
1 & \text{if } X_i \geq v_i \\
0 & \text{otherwise}
\end{cases} \tag{2}
$$
where $v_i$ is a random variable, $v_i \sim \mathbb{N}(1.5, 0.5^2)$, and $\beta_1 = 0.2$ is the coefficient of $D_i$.
* $Y_i$ is the dependent variable (outcome variable).
* The sample size is $N = 1000$.

The python code below generates the data according to the model above. 

```python
# Simulate the data set 
import numpy as np 
import pandas as pd 

#define the parameters 
beta0 = 1
beta1 = 0.2
beta2 = 0.5
# generate random variables 
np.random.seed(1000)
n = 1000
epsilon = np.random.normal(0, 1, n)
x = np.random.normal(1, 2, n)
v = np.random.normal(1.5, 0.5**2, n)
#generate the treatment variable as an array of 0s and 1s
D = np.zeros(n)
for i in range(n):
    if x[i] >= v[i]:
        D[i] = 1
# generate the treatment variable
y = beta0 + beta1 * D + beta2 * x + epsilon 
# store the data in a csv file 
data = pd.DataFrame({'y': y, 'D': D, 'x': x})
data.to_csv('estimation_sample.csv', index=False)
```

### Problems 

1. Load the data and provide a table of summary statistics for the variables in the data set.
2. Estimate the following model in different ways:
   $$
Y_i = \alpha_0 + \alpha_1 D_i + \eta_i \tag{3}
   $$
where $\eta_i$ is the error term. 
    * Without using any package for OLS regression, use `Python` to directly write down the estimates and standard errors using the OLS estimator and heteroskedasticity-robust standard errors.
    * Use the package `statsmodels` to estimate the model and report the estimates and standard errors.
    * Use the package `pystata` to call `Stata` from `Python` to estimate the model and report the estimates and standard errors. 
3. Estimate the following model in different ways and explain the differences in the coefficient estimates for $D_i$: 
   $$
  Y_i = \alpha_0 + \alpha_1 D_i + \alpha_2 X_i +  \eta_i \tag{4}
   $$
    where $\eta_i$ is the error term. 
    * Without using any package for OLS regression, use `Python` to directly write down the estimates and standard errors using the OLS estimator and heteroskedasticity-robust standard errors.
    * Use the package `statsmodels` to estimate the model and report the estimates and standard errors.
    * Use the package `pystata` to call `Stata` from `Python` to estimate the model and report the estimates and standard errors. 
4. **Resampling**: 
    * Randomly draw a sample of size 500 from the data set and calculate the OLS estimates and standard errors of $\alpha_1$ for model (4) using the sample. 
    * Repeat the above step 1000 times, store your estimates and standard errors:
      *  Plot the distribution of your estimates and standard errors and draw the estimate/standard errors of $\alpha_1$ obtained from question 3. 
      *  Calculate the standard deviation of your estimates and compare it with the standard error of $\alpha_1$ obtained from question 3.
    *  Wrap up your code in a function called `ols_resampling` that takes the data set and the number of resamples as arguments.