# Producing Data: Designing Studies

## 1. Experiments with One Explanatory Variable
A local internet service provider (ISP) created two new versions of its software, with alternative ways of implementing a new feature. To find the product that would lead to the highest satisfaction among customers, the ISP conducted an experiment comparing users' preferences for the two new versions versus the existing software.

The ISP ideally wants to find out which of the three software products causes the highest user satisfaction. It has identified three major potential lurking variables that might affect user satisfaction—gender, age, and hours per week of computer use.

In this activity, we will use adults in a hypothetical city as the population of interest to the ISP. We will:

1. create a simple random sample as the basis for the experimental study of the population,
2. use randomization to assign individuals to treatment groups, and
3. verify that randomization prevented the three treatment groups from being different with respect to the most obvious lurking variables.

In [14]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

init_notebook_mode(True)

In [2]:
customers = pd.read_excel('files/computers.xls')
customers.head()

Unnamed: 0,Age,Gender,Comp
0,46,Female,2
1,76,Female,1
2,51,Female,6
3,62,Female,6
4,24,Female,12


In [9]:
customers.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,20783.0,44.273108,17.070678,18.0,31.0,41.0,55.0,107.0
Comp,20783.0,11.099937,7.533505,0.0,5.0,11.0,17.0,37.0


In [10]:
customers.groupby('Gender').describe().transpose()

Unnamed: 0,Gender,Female,Male
Age,count,10368.0,10415.0
Age,mean,44.279225,44.267019
Age,std,17.315714,16.824032
Age,min,18.0,18.0
Age,25%,30.0,31.0
Age,50%,41.0,42.0
Age,75%,55.0,55.0
Age,max,107.0,99.0
Comp,count,10368.0,10415.0
Comp,mean,11.153549,11.046567


In [8]:
rand_sample = customers.sample(n=450)
rand_sample.head()

Unnamed: 0,Age,Gender,Comp
12398,22,Male,19
16744,58,Male,2
16697,56,Female,4
17635,47,Male,5
8639,18,Male,23


In [11]:
rand_sample.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,450.0,43.824444,17.211713,18.0,30.0,41.0,55.0,99.0
Comp,450.0,10.671111,7.17446,0.0,4.0,10.0,16.0,30.0


In [17]:
group = np.random.randint(1, 4, size=len(rand_sample))

In [18]:
rand_sample['group'] = group
rand_sample.head()

Unnamed: 0,Age,Gender,Comp,group
12398,22,Male,19,1
16744,58,Male,2,1
16697,56,Female,4,2
17635,47,Male,5,1
8639,18,Male,23,3


We will now examine whether the randomization was successful in making our three treatment groups similar with respect to the variables age, gender, and comp. In other words, we will now examine whether the distributions of these variables in the three groups are similar or not.

##### To compare the distribution of age among the three treatment groups, we'll create side-by-side boxplots of age by treatment.

In [27]:
g1 = go.Box(y=rand_sample[rand_sample.group == 1].Age, name="Group 1")
g2 = go.Box(y=rand_sample[rand_sample.group == 2].Age, name="Group 2")
g3 = go.Box(y=rand_sample[rand_sample.group == 3].Age, name="Group 3")

In [28]:
boxlayout = {
    'xaxis': {'title': 'Group'},
    'yaxis': {'title': 'Age(years)'}
}

In [29]:
iplot(go.Figure(data=[g1, g2, g3], layout=boxlayout))

##### To compare the distribution of gender among the three treatment groups, we'll look at a two-way table of conditional percents

In [30]:
pd.crosstab(rand_sample.group, rand_sample.Gender)

Gender,Female,Male
group,Unnamed: 1_level_1,Unnamed: 2_level_1
1,76,72
2,72,66
3,82,82


##### To compare the distribution of comp (the hours per week of computer use) among the three treatment groups, we'll create side by side boxplots of comp by treatment

In [33]:
g1 = go.Box(y=rand_sample[rand_sample.group == 1].Comp, name="Group 1")
g2 = go.Box(y=rand_sample[rand_sample.group == 2].Comp, name="Group 2")
g3 = go.Box(y=rand_sample[rand_sample.group == 3].Comp, name="Group 3")

boxlayout = {
    'xaxis': {'title': 'Group'},
    'yaxis': {'title': 'Computer Usage (hrs/week)'}
}

iplot(go.Figure(data=[g1, g2, g3], layout=boxlayout))

##### Are the distributions of age, gender, and comp in the three treatment groups similar?

Everyone will get slightly different displays here, but they should all "look" about the same. Based upon the side-by-side boxplots, the distribution of ages and hours per week of computer use appears the same in each of the three treatment groups. Similarly, the table of conditional percents suggests that the distribution of the genders is about the same in all three treatment groups.