### Concepts 

- Small multiples

- Visual communication 

- Adding columns/'features' to data frame

- Effect size 

- Sample size

In [1]:
import pandas as pd 
import altair as alt

df = pd.read_csv("clinical_trial.csv")

df3 = pd.DataFrame(df.value_counts()).reset_index()
df3 = df3.rename(columns = {0: "count"})

One way to explore if the distributions are actually different is to examine what these plots would look like if the data were drawn from different distributions.

In [83]:
import numpy as np

def sample(p_covid, N):
    '''
    Return a binary vector of length N where a 1 indicates the person got COVID 
    
    Each person should be randomly assigned to get COVID with probability p_covid
    '''
    return np.random.choice([0, 1], N, p=[1-p_covid, p_covid])
    
def make_data_frame(p_covid_treat, p_covid_control, totalN):
    '''
    Make a data frame like df3 with columns group, covid and count indicating the counts 
    of covid in the treatment and control groups. 
    
    You should assign the counts at random based on the p_covid_treat, p_covid_control
    and totalN parameters. 
    
    Note that N should be totalN/2
    
    You will have to use your sample function for this
    '''
    
    # sample from the treatment and control groups
    treatment = sample(p_covid_treat, int(totalN/2))
    control = sample(p_covid_control, int(totalN/2))

    # count up how many people get/don't get covid in the control group
    N_covid_control = np.sum(control) # np.sum([0, 1, 0, 0, 1, 1]) = 3
    N_no_covid_control = control.size - N_covid_control

    # count up how many people get/don't get covid in the treatment group
    N_covid_treatment = np.sum(treatment)  # np.sum([0, 1, 0, 1, 1, 1]) = 4
    N_no_covid_treatment = treatment.size - N_covid_treatment
    
    group = ["treatment", "treatment", "control", "control"]
    covid = [True, False, True, False]
    count = [N_covid_treatment, N_no_covid_treatment, N_covid_control, N_no_covid_control]
    return pd.DataFrame({"group": group, "count": count, "covid": covid})

def make_plot(df):
    '''
    Write a function to make a plot that shows the number of patients 
    who did and did not get covid in the treatment adn control groups
    
    You can assume the dataframe comes from make_data_frame
    '''
    treat = df[df["group"] == "treatment"]
    control = df[df["group"] == "control"] 
    
    treat_chart = alt.Chart(treat).mark_bar().encode(
    x='covid',
    y='count'
    ).properties(title="treatment")

    control_chart = alt.Chart(control).mark_bar().encode(
        x='covid',
        y='count'
    ).properties(title="control")

    return treat_chart | control_chart

def make_plot_2(df):
    '''
    Write a function to make a plot that shows the number of patients 
    who did and did not get covid in the treatment adn control groups
    
    You can assume the dataframe comes from make_data_frame
    '''
    plot2 = df[df["covid"] == True][["group", "count"]]
    
    max_ = max(plot2["count"])
    plot2["is_max"] = plot2["count"] == max_
    
    _chart = alt.Chart(plot2).mark_bar().encode(
        x='group',
        y='count',
        color="is_max",
        tooltip="count"
    ).properties(title="# covid", width=20, height=50)
    return _chart

In [88]:
all_ = []

for j in range(8):
    df1 = make_data_frame(p_covid_treat=.1, p_covid_control=.11, totalN=100)
    plot1 = make_plot_2(df1)
    all_.append(plot1)

alt.hconcat(*all_)

- What does orange mean in the plots above?

[your answer here]

- What do you notice about which plots are orange?

[your answer here]

In [94]:
all_ = []

for j in range(8):
    df1 = make_data_frame(p_covid_treat=.1, p_covid_control=.11, totalN=10000)
    plot1 = make_plot_2(df1)
    all_.append(plot1)

alt.hconcat(*all_)

- What happens as you incease totalN to 10000 (the sample size) above? 

[your answer here]

- What happens if you vary the effect size (by increasing the distance between p_covid_treat and p_covid_control)?

[your answer here]

### Challenge 1

How small an effect size can you detect, by varying the N to have as big a sample as you want? For the purposes of this notebook, we can say an effect is "detected" if all 8 experiments show the same result (i.e. more COVID in control group). We will define things like this more precisely later in the semester. For now, we are just building intuition.


### Challenge 2

How small big an effect size can you detect with the smallest possible N? For the purposes of this notebook, we say an effect is "detected" if all 8 experiments show the same result (i.e. more COVID in control group). We will define things like this more precisely later in the semester. For now, we are just building intuition.