### Analyzing distributions

- So far, we have made bar charts showing the relative rates of COVID in the vaccinated (treatment) and unvaccinated (control) group.

- It seems like the treatment group is far less likely to get COVID. 

- But how can we be sure? Can we make a mathematical argument that these groups are different? 

We will explore this question throughout this semester. But today we will build some intution for this question.

In [None]:
### Let's start by reviewing code from last time

In [6]:
import pandas as pd 
import altair as alt

df3 = pd.DataFrame(df.value_counts()).reset_index()
df3 = df3.rename(columns = {0: "count"})

treat = df3[df3["group"] == "treatment"]
control = df3[df3["group"] == "control"] 

treat_chart = alt.Chart(treat).mark_bar().encode(
    x='covid',
    y='count'
)

control_chart = alt.Chart(control).mark_bar().encode(
    x='covid',
    y='count'
)

treat_chart | control_chart

One way to explore if the distributions are actually different is to examine what these plots would look like if the data were drawn from different distributions.

In [None]:
def sample(p_covid, N):
    '''
    Return a binary vector of length N where a 1 indicates the person got COVID 
    
    Each person should be randomly assigned to get COVID with probability p_covid
    '''
    return np.random.choice([0, 1], N, p=[1-p_covid, p_covid])
    
def make_data_frame(p_covid_treat, p_covid_control, totalN):
    '''
    Make a data frame like df3 with columns group, covid and count indicating the counts 
    of covid in the treatment and control groups. 
    
    You should assign the counts at random based on the p_covid_treat, p_covid_control
    and totalN parameters. 
    
    Note that N should be totalN/2
    
    You will have to use your sample function for this
    '''
    
    # sample from the treatment and control groups
    treatment = sample(p_covid_treat, int(totalN/2))
    control = sample(p_covid_control, int(totalN/2))

    # count up how many people get/don't get covid in the control group
    N_covid_control = np.sum(control) # np.sum([0, 1, 0, 0, 1, 1]) = 3
    N_no_covid_control = control.size - N_covid_control

    # count up how many people get/don't get covid in the treatment group
    N_covid_treatment = np.sum(treatment)  # np.sum([0, 1, 0, 1, 1, 1]) = 4
    N_no_covid_treatment = treatment.size - N_covid_treatment
    
    group = ["treatment", "treatment", "control", "control"]
    covid = [True, False, True, False]
    count = [N_covid_treatment, N_no_covid_treatment, N_covid_control, N_no_covid_control]
    return pd.DataFrame({"group": group, "count": count, "covid": covid})


def make_plot(df):
    '''
    Write a function to make a plot that shows the number of patients 
    who did and did not get covid in the treatment adn control groups
    
    You can assume the dataframe comes from make_data_frame
    '''
    treat = df[df["group"] == "treatment"]
    control = df[df["group"] == "control"] 
    
    treat_chart = alt.Chart(treat).mark_bar().encode(
    x='covid',
    y='count'
    ).properties(title="treatment")

    control_chart = alt.Chart(control).mark_bar().encode(
        x='covid',
        y='count'
    ).properties(title="control")

    return treat_chart | control_chart

df = make_data_frame(p_covid_treat=.01, p_covid_control=.2, totalN=50)

make_plot(df)

1. Add comments to the lines of code above to explain what each line is doing. Print out an HTML version of your notebook and turn it to Canvas.

In [None]:
2. Make 5 plots with `totalN`=100, p_covid_treat=.02, p_covid_control=.10

[Your code here]

In [None]:
3. What do you observe? Do the plots look the same each time? 

In [None]:
[Your answer here]

4. Make 5 plots with `totalN`=1000 (10x more), with `p_covid_treat`=.02 and `p_covid_control`=.10
What do you observe? Do the plots look more similar than when N is 100? Why might that be the case?

In [None]:
[Your answer here]

5. Make 5 plots with `totalN`=1000, `p_covid_treat`=.02, `p_covid_control`=.15 Notice that the difference between the probability parameters is bigger this time. How would observing such an outcome affect your confidence in the efficacy of the vaccine?

In [None]:
[Your answer here]

6. Make 5 plots with `totalN`=100000, p_covid_treat=.02, p_covid_control=.04 
The difference between the probability of getting COVID is very small. But do your plots give you confidence that the vaccine lowers the probability of getting COVID? 

[Your answer here]

In [None]:
7. What do you think would happen if you repeated the above experiment with `totalN`=10?

8. Can you state any general conclusions about the size of your dataset, the similarity of your plots and the difference between $p(covid | treatment)$ and $p(covid | control)$?