## PS3 Discussion: Statistical Significance 

### PS 3 Week 6 Discussion - Clara Hu

Today, we're going reinforce our understanding of uncertainty and statistical significance.
_______

Let's take a look at an experiment where students in a large course were randomly assigned to attend section. 

Our dataset shows a hypothetical scenario in which we can observe *both* the potential outcome under treatment (randomly assigned to attend section) and potential outcome under control (randomly assigned to not attend section).

In [None]:
section_experiment <- read.csv("data/section_attd_experiment.csv")
head(section_experiment)

Let's see how the distributions of the potential outcome of grades under section attendance and section non-attendance looks like.

In [None]:
hist(section_experiment$po_control) # histogram for control group
abline(v = mean(section_experiment$po_control), col = "red", lwd = 3) #mean shown with red line
hist(section_experiment$po_treatment) # histogram for treatment group
abline(v = mean(section_experiment$po_treatment), col = "red", lwd = 3) #mean shown with red line

Whoo, it seems like attending section makes people get higher grades! But, let's not jump to any conclusions yet.

Let's examine the distribution of the potential outcome for the treatment group more closely.  

**Questions:**

1. Is the distribution of the potential outcome under treatment normal?

2. Is the distribution here a sampling distribution?

Let's get a better idea of the potential outcomes under treatment for all the students in the class.  
Find the standard deviation and the mean of the potential outcome under treatment among all students in the class. 

In [None]:
# Type your code here
## For SD:
sd(...)
## For mean:
mean(...)

Now, in reality, we can't observe both the potential outcome under treatment and the potential outcome under control. That's why running *randomized* experiments are important. For the following parts, we randomly assign 150 to attend section and 150 to not attend section.

By running the cell below, you randomly select 150 different people to attend section. The code plots the distribution, the mean and the standard deviation of the grades of the people in the sample who were assigned to attend section. 

In [None]:
# Each time you re-run this cell, you will select a different sample!

section_experiment_subset <- section_experiment$po_treatment[sample(x=1:300, size = 150, replace = F)]

# The SD is:
sd(section_experiment_subset)
# The mean is:
mean(section_experiment_subset)

In [None]:
# The distribution looks like:
hist(section_experiment_subset)
abline(v = mean(section_experiment_subset), col = "red", lwd = 3)

**Questions:**

1. Is the distribution of the grades of people in the sample normal?

2. Is the distribution here a sampling distribution?

By running the code below, you once again randomly select 150 people to attend section. However, rather than doing this once, you are now doing this 1000 times. You save their mean grade in each sample, and then plot the mean grades from the 1000 draws.

In [None]:
# This code randomly selects samples of size 150 from people people to attend section 
# We are now saving the mean (for those attending section) of each of the 1000 draws to a value in rep_draw.
rep_draw <- as.numeric()
for (i in 1:1000){
section_experiment_subset <- section_experiment$po_treatment[sample(x=1:300, size = 150, replace = F)]
rep_draw[i] <- mean(section_experiment_subset)
    }
hist(rep_draw)
abline(v = mean(rep_draw), col = "red", lwd = 3)
mean(rep_draw)

**Questions:**

1. Is the distribution normal?

2. Is the distribution here a sampling distribution?

Let's find the standard error. Recall the standard error is the standard deviation of a sampling distribution.

In [None]:
# Find the SE of the distribution below
sd(...)

In the experiment, however, the outcome isn't the mean grade in the treatment group. It is the difference in mean grade between the treatment group (section) and the control group (no section).

The code below runs the hypothetical experiment once, randomly assigning 150 to attend section and 150 not to attend section.

In [None]:
section_experiment_draw1 <- section_experiment
# We randomly assign people to the two groups here
section_experiment_draw1$assignment <- sample(c(rep("section", 150), rep("nosection", 150)), replace = F)
# Create a new variable representing what we would observe
section_experiment_draw1$po_observed <- ifelse(section_experiment_draw1$assignment == "section", section_experiment_draw1$po_treatment,
                                        section_experiment_draw1$po_control)

# Get rid of the potential outcome variables that we can't actually observe
section_experiment_draw1$po_treatment <- NULL
section_experiment_draw1$po_control <- NULL

# This is what we would actually observe in reality_{TM}
head(section_experiment_draw1)

**Creating Hypotheses**

1. Null Hypothesis: ...
2. Alternative Hypothesis: ...

Let's examine our treatment and control groups more carefully by separating them into subsets

In [None]:
# Let's make two subsets, one for treatment and one for control
section_exp_section <- subset(section_experiment_draw1, assignment == "section")
section_exp_nosection <- subset(section_experiment_draw1, assignment == "nosection")
# For the treatment group
print("Section group: sample mean and sd")
mean(section_exp_section$po_observed)
sd(section_exp_section$po_observed)


# For the control group
print("No section group: sample mean and sd")
mean(section_exp_nosection$po_observed)
sd(section_exp_nosection$po_observed)


**Treatment Effect**  
In this particular draw, the treatment effect is: 

In [None]:
# Find the treatment effect
treatment_effect <- mean(section_exp_section$po_observed) - mean(section_exp_nosection$po_observed)
treatment_effect

**Is this an estimate, estimand and estimator?**

Let's check to see if our treatment effect is statistically significant!

**Using the `difference_in_means()` function**, find the t-statistic and p-value of the experiment in `section_experiment_draw1`. Assume we are at an 0.05 level of significance.  Provide an interpretation for the two estimates.

In [None]:
# Type your code here
library(estimatr)
difference_in_means(...)

**Question:**  
What does the output tell us?   
*Bonus Question:* How should we interpret the confidence interval?

*Type your interpretation here*

**Visualizing our results**

In [None]:
#Run this Cell
treatment_means <- as.numeric()
control_means <- as.numeric()

# We run this 1000 times
for (i in 1:1000){
    section_experiment_draw_1000 <- section_experiment
    # We randomly assign people to the two groups here
    section_experiment_draw_1000$assignment <- sample(c(rep("section", 150), rep("nosection", 150)), replace = F)
    # Create a new variable representing what we would observe
    section_experiment_draw_1000$po_observed <- ifelse(section_experiment_draw_1000$assignment == "section", section_experiment_draw_1000$po_treatment,
                                        section_experiment_draw_1000$po_control)
    # Get rid of the potential outcome variables that we can't actually observe
    section_experiment_draw_1000$po_treatment <- NULL
    section_experiment_draw_1000$po_control <- NULL
    section_exp_section_tmp <- subset(section_experiment_draw_1000, assignment == "section")
    section_exp_nosection_tmp <- subset(section_experiment_draw_1000, assignment == "nosection")
    treatment_means[i] <- mean(section_exp_section_tmp$po_observed)
    control_means[i] <- mean(section_exp_nosection_tmp$po_observed)

}

hist(control_means, xlim = c(25, 90), main = "Sampling Distributions", xlab = "Grades", col = "red")
hist(treatment_means, col = "blue", add = T)
legend("top", inset=.02, c("control","treatment"), fill=c("red", "blue"), horiz=TRUE, cex=0.8)