---
title: "QTM 385 - Experimental Methods"
subtitle: Assignment 03
---

# Instructions

This assignment covers the last two lectures of the course. As usual, it consists of 10 questions, each worth one point. You can answer the questions in any format you prefer, but I recommend using Jupyter Notebooks and converting the answers to PDF or html, as they are easier to read on Canvas. Please write at least one or two paragraphs for each written question.

If you have any questions about the assignment, feel free to email me at <danilo.freire@emory.edu>.

Good luck!

# Questions 

# 1. Compare and contrast Type I and Type II errors. In causal inference experiments, why might a researcher be more concerned with one type of error over the other?

A type 1 error is when we falsely reject the null hypothesis when it is in fact true, while a type 2 error is when we fail to reject a null hypothesis that is false. 

A false positive is generally the more serious scenario, especially for researchers looking for evidence that a treatment is effective (like a new medicine or policy), especially if it is something that can effect everyone. According to wikipedia, "A vaccine is generally considered effective if the estimate is ≥50% with a >30% lower limit of the 95% confidence interval." When COVID vaccines came out, people all over the world took them (and boosters) with the assumption that it was safe to be unmasked and stop social distancing. However, if one of them was not actually in that confidence interval - millions of people would be screwed!

However, in a medical situation for example, a false negative can be dangerous - saying someone doesn't have a life-threatening disease like cancer when they actually do, and not catching the error until it is too late to save them, would be a big concern. Additionally, like the pregnacy picture in the slides, telling a woman she's not pregnant when she actually is could be extremely problematic - especially if she didn't want the baby and well... lived in America. 

However, type errors aren't always bad. A fun fact I learned in my AP stats is that meteorologists often overestimate when prediciting the forecast, especially with the chance of rain. Suppose the null hypothesis is that it won't rain on Tuesday, and assuming there's a cutoff percentage that determines whether or not it will rain, the weather app might tell us it will probably rain (ex: 60% ) when in reality it probably won't (30%). That's technically a type error, since we are being falsely told there's a significantly signifant chance it will rain - but this is done all the time, because most people would be pleasantly surprised for it to not rain when they are prepared for it, but upset if it rained when the app said it wouldn't!


# 2. Explain the concept of randomisation inference and outline its advantages over traditional parametric tests, especially in the context of testing the sharp null hypothesis.

Randomisation inference is when we calculate  p values based on an inventory of possible randomizations, or every possibility that could happen, and is applicable to any sort of size, whether minimal sampling or an entire country. You don't have to rely on the assumption of nomality and isn't parametric. When testing the sharp null hypothesis, randomization inference provides a more precise way to determine statistical significance.


# 3. Compare Neyman’s hypothesis testing framework with Fisher’s sharp null hypothesis approach. What are the main advantages and disadvantages of each method in experimental settings?

Neyman's appraoch focused more on finding average treatment effect and constructing confidence intervals. However, it relies on possibilities rather than the certainity of something happening, meaning that you have to work with a range of outcomes from a treatment (like discussed in class, trillions of randomizations can happen)

Fisher's approach in comparison assumes for the treatment to have zero effect on everyone, making it restricted to essentially all or nothing in experiments, which makes it less helpful in smaller experiments.

# 4. Critically evaluate the use of p-values in hypothesis testing. What alternatives are suggested (or implied) in the lectures, and what are the potential benefits of these alternatives?

P values are used all the time but they aren't always used accurately and understood. I feel like many people think that the commonly used cutoff of 0.05 means that there is a 95% chance the null isn't true - but it actually means that if the null is true, we would see an effect 5% of the time. Additionally, people might skew the p values to make it bigger or smaller to get a number that is considered significant - the notorious p hacking! Rndomization inference and confidence intervals are viable alternatives - confidence intervals give a range of values that aren't just "significant" or "not significant" while randomization inference uses outcomes that we can't directly observe a lot of the time and gives us better p values.

# 5. The code below simulates a dataset. Modify the code so that it adds a new variable called `treat` with 500 treated individuals and 500 control individuals (complete random assignment). Also include a binary covariate called `gender` (0 = male, 1 = female; with equal probability) and update the outcome (`interviews`) by adding 2 points if the individual is female.



```r
## Set seed for reproducibility
set.seed(385)

# Load packages
# install.packages("fabricatr")
# install.packages("randomizr") # if you haven't installed them yet
library(fabricatr)
library(randomizr)

## Simulate data
data <- fabricate(
  N = 1000,
  interviews = round(rnorm(1000, mean = 10, sd = 2) + 5 * treat, digits = 0)
)
head(data)
```


In [None]:
# my version
```r

set.seed(385)
library(fabricatr)
library(randomizr)

data <- fabricate(
  N = 1000,
  treat = complete_ra(N = 1000, m = 500),  # Randomly assign 500 
  gender = rbinom(1000, 1, 0.5),           # 0 = male, 1 = female
  interviews = round(rnorm(1000, mean = 10, sd = 2) + 5 * treat + 2 * gender, digits = 0)
)
head(data)
```

# This adds random treatment assignment and a binary gender variable, increasing interviews by 2 for females.

SyntaxError: invalid syntax (2305238311.py, line 2)

# 6. Using the dataset created in the previous question, estimate the average treatment effect on the outcome `interviews` using the `lm_robust()` function from the `estimatr` package. Interpret the results.

In [None]:
library(estimatr)

model <- lm_robust(interviews ~ treat, data = data)

summary(model)

# The coefficient on treat represents the average treatment effect. A significant positive coefficient suggests the treatment increases the number of interviews.

# 7. Using the same dataset, estimate the average treatment effect of the treatment on the outcome interviews using randomisation inference. Interpret the results.

In [None]:
library(ri2)

ri_result <- conduct_ri(
  formula = interviews ~ treat,
  assignment = "treat",
  sharp_hypothesis = 0,
  data = data
)

summary(ri_result)

# randomization inference tests if our observed effect is due to chance. A small p-value means the treatment was probably significant.

# 8. Explain how including covariates in an experimental regression model can increase the precision of the treatment effect estimate. Under what conditions might this adjustment lead to biased estimates?

Including covariates in an experimental regression model can help precision of the treatment effect estimate by accounting for variability in the outcome that isn't because of treatment By controlling for these additional variables, we reduce the residual variance, leading to more narrow confidence intervals for the treatment effect. However, this can lead to biased estimates if the covariates are post-treatment variables or if they are influenced by both the treatment and the outcome. Therefore, it's crucial to include only pre-treatment covariates that are not influenced by the treatment, like subgroups such as race.

# 9. Simulate a dataset with heterogeneous treatment effects (e.g., the treatment effect is larger for individuals with higher education). Estimate the treatment effect for different subgroups using an interaction term.


In [None]:
data <- fabricate(
  N = 1000,
  treat = complete_ra(N, m = 500),
  education = rnorm(1000, mean = 12, sd = 2),  # Years of education
  interviews = round(rnorm(1000, mean = 10, sd = 2) + 5 * treat + 0.5 * education + 2 * treat * education, 0)
)

model_het <- lm_robust(interviews ~ treat * education, data = data)
summary(model_het)

# The interaction term (treat:education) shows whether higher education increases the treatment effect.

In [None]:

data <- fabricate(
  N = 1000,
  treat = complete_ra(N, m = 500),
  education = rnorm(1000, mean = 12, sd = 2),  # Years of education
  interviews = round(rnorm(1000, mean = 10, sd = 2) + 5 * treat + 0.5 * education + 2 * treat * education, 0)
)

model_het <- lm_robust(interviews ~ treat * education, data = data)
summary(model_het)


: 

# 10. Why is the publication of null results important in experimental research? What are the main challenges in publishing null results, and how can the scientific community address these challenges?

Publishing null results is so important because it prevents publication bias, where only experiments with "meaningful" results are published. This bias causes a lot of false positives and experiments that cannot be replicated. The challenge is that journals would rather publish experiments that went somewhere rather than "failed" ones, making it harder to publish studies that find no effect. Solutions to combat this include pre-registration, where you have to publish details beforehand, and journals that highlight null results, which can also help us put more trust into the things we see. 