# Producing Data: Designing Studies

## 1. Experiments with One Explanatory Variable
A local internet service provider (ISP) created two new versions of its software, with alternative ways of implementing a new feature. To find the product that would lead to the highest satisfaction among customers, the ISP conducted an experiment comparing users' preferences for the two new versions versus the existing software.

The ISP ideally wants to find out which of the three software products causes the highest user satisfaction. It has identified three major potential lurking variables that might affect user satisfaction—gender, age, and hours per week of computer use.

In this activity, we will use adults in a hypothetical city as the population of interest to the ISP. We will:

1. create a simple random sample as the basis for the experimental study of the population,
2. use randomization to assign individuals to treatment groups, and
3. verify that randomization prevented the three treatment groups from being different with respect to the most obvious lurking variables.

In [37]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

init_notebook_mode(True)

In [38]:
customers = pd.read_excel('files/computers.xls')
customers.head()

Unnamed: 0,Age,Gender,Comp
0,46,Female,2
1,76,Female,1
2,51,Female,6
3,62,Female,6
4,24,Female,12


In [39]:
customers.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,20783.0,44.273108,17.070678,18.0,31.0,41.0,55.0,107.0
Comp,20783.0,11.099937,7.533505,0.0,5.0,11.0,17.0,37.0


In [40]:
customers.groupby('Gender').describe().transpose()

Unnamed: 0,Gender,Female,Male
Age,count,10368.0,10415.0
Age,mean,44.279225,44.267019
Age,std,17.315714,16.824032
Age,min,18.0,18.0
Age,25%,30.0,31.0
Age,50%,41.0,42.0
Age,75%,55.0,55.0
Age,max,107.0,99.0
Comp,count,10368.0,10415.0
Comp,mean,11.153549,11.046567


In [41]:
rand_sample = customers.sample(n=450)
rand_sample.head()

Unnamed: 0,Age,Gender,Comp
3958,85,Female,0
2362,38,Male,9
19243,41,Female,15
12246,29,Male,8
1616,55,Female,4


In [42]:
rand_sample.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,450.0,44.113333,17.56713,18.0,30.0,41.0,57.0,92.0
Comp,450.0,11.248889,7.688264,0.0,4.25,11.0,18.0,29.0


In [43]:
group = np.random.randint(1, 4, size=len(rand_sample))

In [44]:
rand_sample['group'] = group
rand_sample.head()

Unnamed: 0,Age,Gender,Comp,group
3958,85,Female,0,1
2362,38,Male,9,1
19243,41,Female,15,2
12246,29,Male,8,2
1616,55,Female,4,2


We will now examine whether the randomization was successful in making our three treatment groups similar with respect to the variables age, gender, and comp. In other words, we will now examine whether the distributions of these variables in the three groups are similar or not.

##### To compare the distribution of age among the three treatment groups, we'll create side-by-side boxplots of age by treatment.

In [45]:
g1 = go.Box(y=rand_sample[rand_sample.group == 1].Age, name="Group 1")
g2 = go.Box(y=rand_sample[rand_sample.group == 2].Age, name="Group 2")
g3 = go.Box(y=rand_sample[rand_sample.group == 3].Age, name="Group 3")

In [46]:
boxlayout = {
    'xaxis': {'title': 'Group'},
    'yaxis': {'title': 'Age(years)'}
}

In [47]:
iplot(go.Figure(data=[g1, g2, g3], layout=boxlayout))

##### To compare the distribution of gender among the three treatment groups, we'll look at a two-way table of conditional percents

In [48]:
pd.crosstab(rand_sample.group, rand_sample.Gender)

Gender,Female,Male
group,Unnamed: 1_level_1,Unnamed: 2_level_1
1,62,85
2,61,83
3,82,77


##### To compare the distribution of comp (the hours per week of computer use) among the three treatment groups, we'll create side by side boxplots of comp by treatment

In [49]:
g1 = go.Box(y=rand_sample[rand_sample.group == 1].Comp, name="Group 1")
g2 = go.Box(y=rand_sample[rand_sample.group == 2].Comp, name="Group 2")
g3 = go.Box(y=rand_sample[rand_sample.group == 3].Comp, name="Group 3")

boxlayout = {
    'xaxis': {'title': 'Group'},
    'yaxis': {'title': 'Computer Usage (hrs/week)'}
}

iplot(go.Figure(data=[g1, g2, g3], layout=boxlayout))

##### Are the distributions of age, gender, and comp in the three treatment groups similar?

Everyone will get slightly different displays here, but they should all "look" about the same. Based upon the side-by-side boxplots, the distribution of ages and hours per week of computer use appears the same in each of the three treatment groups. Similarly, the table of conditional percents suggests that the distribution of the genders is about the same in all three treatment groups.

## StatTutor Lab: Treating Depression: A Randomized Clinical Trial

**Background:** Clinical depression is a recurrent illness requiring treatment and often hospitalization. Nearly 50% of people who have an episode of major depression will have a recurrence within 2-3 years. Being able to prevent the recurrence of depression in people who are at risk for the disease would go a long way to alleviate the pain and suffering of patients and would also save society many thousands of dollars in medical expenses and lost wages due to an inability to work.

**The Study:** During the 1980's the federal government, through the National Institutes of Health (NIH), sponsored a multi-centered, randomized, controlled, clinical trial to evaluate two drugs to prevent the recurrence of depression in patients who have had at least one previous episode of the illness (Prien et al., Archives of General Psychiatry, 1984). 

**The Study Design:** The study was multi-centered. There were 5 medical clinics in major metropolitan areas across the country that participated in this trial. Using many clinics enabled the investigators to enroll many more patients into the study and allowed for more diversity in the patients who participated. There were 3 treatment groups. Patients received either Imipramine (Imip), Lithium (Li), or a Placebo (Pl) where Imip and Li are active drugs. Patients were randomly assigned to one of the 3 treatment groups. Like most other medical studies where new, unexplored treatments are evaluated, patients chose themselves to participate in the study by signing a consent form. Patients were followed for 2-4 years to see whether or not they had a recurrence of depression. If they did not have a recurrence within this time frame, then their treatment was considered a Success. If they did have a recurrence, it was considered a Failure. The study was double-blinded. A number of additional background variables were measured for each patient.

In [50]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

init_notebook_mode(True)

### Understand Data

#### Check Data Format

In [51]:
depression = pd.read_excel('files/depression.xls')
depression.head()

Unnamed: 0,Hospt,Treat,Outcome,Time,AcuteT,Age,Gender
0,1,0,1,36.143002,211,33,1
1,1,1,0,105.142998,176,49,1
2,1,1,0,74.570999,191,50,1
3,1,0,1,49.714001,206,29,2
4,1,0,0,14.429,63,29,1


**Hospt:** Which hospital the patient was from: Labeled 1, 2, 3, 5 or 6  
**Treat:** 0=Lithium; 1=Imipramine; 2=Placebo  
**Outcome:** 0=Success 1=Failure (recurrence of depression)  
**Time:** Number of weeks until a recurrence (if outcome=1) or until study ended (if outcome=0)  
**AcuteT:** How long the patient was depressed before the start of the current study, measured in days  
**Age:** Age in years  
**Gender:** 1=Female 2=Male  

Out of the first ten individuals in the datafile, how many had a recurrence of depression during the study?

In [52]:
depression.Outcome[:10].sum()

5

Out of the first ten individuals in the datafile who were assigned to the Lithium treatment group, how many had a recurrence of depression during the study?

In [53]:
depression[depression.Treat == 0].Outcome[:10].sum()

5

How many days was the first male in the datafile depressed before the start of the study?

In [54]:
depression[depression.Gender == 2].iloc[0].AcuteT

206.0

#### Consider Study Design
The next step in understanding the problem is addressing the issues of sampling and study design, which have implications on the generalizability of the results and the type of conclusions you can draw from them.

This study is an  **experiment**

Sampling:
The patients that were recruited to this study all had at least one prior episode of depression and volunteered to participate in the study by signing a consent form.


##### Thought Question
Subjects were randomly assigned to the different treatments. Randomization is supposed to assign approximately equal numbers of subjects to each treatment group as well as "balance" variables that we did not control for (such as AcuteT, Age, Gender) among the 3 treatment groups. Now we'll check whether the randomization in this case was effective in achieving these goals.

_a) Was the randomization effective in assigning an approximately equal number of patients to each treatment group? Explain._

In [55]:
depression.groupby('Treat').Treat.count()

Treat
0    37
1    38
2    34
Name: Treat, dtype: int64

_b) Was the randomization successful in balancing other variables such as AcuteT? Answer by comparing the distributions (boxplots) of AcuteT in the three treatment groups._

In [56]:
tt0 = go.Box(x=depression[depression.Treat == 0].AcuteT, name="Lithium")
tt1 = go.Box(x=depression[depression.Treat == 1].AcuteT, name="Imapramine")
tt2 = go.Box(x=depression[depression.Treat == 2].AcuteT, name="Placebo")


iplot(go.Figure(data=[tt0, tt1, tt2]))

print("The side-by-side boxplots reveal that the distributions of the variable AcuteT within each of the three treatment groups are very similar. All have a median of roughly 170, and no unusual or systematic differences in spread. Thus, the randomization was effective in balancing AcuteT among the three treatment groups.")

The side-by-side boxplots reveal that the distributions of the variable AcuteT within each of the three treatment groups are very similar. All have a median of roughly 170, and no unusual or systematic differences in spread. Thus, the randomization was effective in balancing AcuteT among the three treatment groups.


### Question 1. 
**Which of the drugs (if either) was more successful in preventing the recurrence of depression relative to the placebo?**

#### Reflect on the Question
Before analyzing the data and discovering which of the drugs was more effective, how do you think the results could be used in practice once they are obtained?

Ans: The results can be used by the FDA (Food and Drug Administration) to make an informed decision about the usage and safety of medicines

#### Analyze Data

_Which variable(s) among those listed below is/are particularly relevant to the current question?_
Treat and Outcome

The variable **Treat** is **explanatory** variable and is **categorical**
The variable **Outcome** is **response** variable and is **categorical**

A meaningful display is: **Two-way Table** (C -> C)

A meaningful numerical summary to supplement the above display is **Conditional Percentages**

Using this display and numerical summary, I will **examine the relationship between two categorical variables**

In [75]:
table = pd.crosstab(depression.Treat, depression.Outcome, margins=True, margins_name='Total')
table

Outcome,0,1,Total
Treat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14,23,37
1,27,11,38
2,11,23,34
Total,52,57,109


In [78]:
table.div(table['Total'], axis=0) * 100

Outcome,0,1,Total
Treat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,37.837838,62.162162,100.0
1,71.052632,28.947368,100.0
2,32.352941,67.647059,100.0
Total,47.706422,52.293578,100.0


The percentage of patients that had a recurrence of depression is: 62.16% in the Lithium treatment group, 28.95% in the Imipramine treatment group, and 67.65% in the Placebo group.

**Answer:** The results clearly reveal that Imipramine is by far more effective than Lithium in preventing the recurrence of depression. 62.16% of the patients taking Lithium had a recurrence of depression which is more than twice that of patients who took Imipramine (28.95%). 

The results also question the effectiveness of Lithium in general since the percentage of patients who had a recurrence of depression in the Placebo group (67.65%) is only slightly higher than that of the Lithium group (62.16%).

Imapramine should be considered for treating depression. Lithium should be avoided as it has only marginally more success rate as compared to placebo

### Question 2:
**Which of the drugs (if either) delayed the recurrence of depression longer relative to the placebo?**

#### Reflect on the Question
Before analyzing the data and discovering which of the drugs was more effective in delaying the recurrence of depression, try to predict what the data will show

Ans: Given the results of the first question, we can expect that Imipramine would be the most effective drug in delaying the recurrence of depression.

#### Analyze Data

_Which variable(s) among those listed below is/are particularly relevant to the current question?_

The variable **Treat** is **explanatory** variable and is **categorical**
The variable **Time** is **response** variable and is **quantitative**

A meaningful display is: **Side-by-Side Boxplots**

A meaningful numerical summary to supplement the above display is **descriptive statistics**

Using this display and numerical summary, I will **compare distribution of one quantitative varibale over multiple groups**

In [81]:
tt0 = go.Box(x=depression[depression.Treat == 0].Time, name="Lithium")
tt1 = go.Box(x=depression[depression.Treat == 1].Time, name="Imapramine")
tt2 = go.Box(x=depression[depression.Treat == 2].Time, name="Placebo")

iplot(go.Figure(data=[tt0, tt1, tt2]))

Center: The distribution of Time in the Imipramine group (coded as 1) has a substantially larger median (70.71) compared to the other two groups, whose medians are not very different (22, and 17.79). In fact the median of the Imipramine group (70.71) is even larger than the third quartile of the two other groups (67 and 63.7). 

Spread: The distribution of Time in the Imipramine group displays the largest spread (IQR is roughly 80 vs. 62 and 59 in the two other groups). Note that the larger spread in the Imipramine group is in the "positive" direction.

**Answer:** The results show that overall, Imipramine is more effective in delaying the recurrence of depression compared to Lithium. It should be noted, however, that the large spread in the distribution of the variable "Time" among patients who were treated with Imipramine indicates that being treated with this drug does not guarantee a long delay in the recurrence of depression. In addition, the similarity in the distribution of time until recurrence of depression between the Lithium and Placebo groups suggests that Lithium may not really be an effective treatment for depression.

Imipramine is superior to Lithium as a treatment for depression so doctors should consider prescribing it instead of Lithium.


### Summary

In this exercise we assessed the effectiveness of Lithium and Imipramine as treatments for depression. The two measures of effectiveness were whether or not recurrence of depression occurred, and the time until recurrence of depression. Using both measures we found that Imipramine is more effective than Lithium. More specifically, patients who were treated with Imipramine were less likely to have a recurrence of depression, and if they did, it took longer for recurrence to occur compared to patients who were treated with Lithium.

The results we found also called into question the effectiveness of Lithium since the patients who were treated with it did not display any substantial differences compared to patients who did not receive any treatment, both in terms of an occurrence of depression, and in terms of the time until a recurrence.