## Assignment #1 - Details

In the [video](https://www.youtube.com/watch?v=5Dnw46eC-0o&feature=youtu.be) we watched in class, we saw how a random permutation test was used in place of a t-test to assign significance to a difference between two groups (in this case, subjects given either beer or water and measured for attractiveness to mosquitos).

For the assignment, you are asked to:

1. Create side-by-side boxplots for the number of mosquitos in each group (beer vs water)
2. Answer the question: What does the graph reveal about the data for both groups? Is there an association between beer consumption and attractiveness to mosquitos?
3. Calculate basic statistics measures for each group: the mean, median, standard deviation
4. Explain the numbers
5. Write the code to implement the data simulation demonstrated in the video

Some additional details are provided for each of these questions below

By: Gabe Musso
Date: January 20th, 2018

In [1]:
# Packages we'll need for this assignment
import pandas as pd  # Pandas can be used to read in the data from the CSV file, and also for some calculations
import numpy as np  # Numpy is a Python math library, it can be used for resampling and associated calculations

# Plotly is a graphing service
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

### 1. Create side-by-side boxplots for the number of mosquitos in each group (beer vs water)
There are several tools you can use to generate boxplots within Jupyter/IPython notebooks. In class I used [Plotly](https://plot.ly/python/box-plots/), but you can also use [matplotlib](https://matplotlib.org/examples/pylab_examples/boxplot_demo.html), or [Seaborn](https://seaborn.pydata.org/generated/seaborn.boxplot.html). Each of the above hyperlinks will demonstrate how to generate these plots. You should end up with a plot that contains:

1. Labeled Axes
2. Units (if applicable)

Note, if using matplotlib, you may have to call the [matplotlib inline function](https://stackoverflow.com/questions/19410042/how-to-make-ipython-notebook-matplotlib-plot-inline) to make the graphs appear.

In [2]:
df = pd.read_csv('mosquitos_data.csv')
df

Unnamed: 0,Response,Treatment
0,27,Beer
1,20,Beer
2,21,Beer
3,26,Beer
4,27,Beer
5,31,Beer
6,24,Beer
7,21,Beer
8,20,Beer
9,19,Beer


In [3]:
plotly.offline.init_notebook_mode(connected=True)

y0 = df[df.Treatment == 'Beer'].Response
y1 = df[df.Treatment == 'Water'].Response

trace0 = go.Box(
    y=y0,
    name = 'Beer',
    marker = dict(
        color = 'rgb(214, 12, 140)',
    )
)

trace1 = go.Box(
    y=y1,
    name = 'Water',
    marker = dict(
        color = 'rgb(0, 128, 128)',
    )
)

data = [trace0, trace1]

layout = go.Layout(
    title = "Attracted Mosquitoes by Treatment",
    yaxis = dict(title = 'Response - Number of Attracted Mosquitoes'),
    xaxis = dict(title = 'Treatment')
)

fig = go.Figure(data=data,layout=layout)

py.iplot(fig)

### 2. Answer the question: What does the graph reveal about the data for both groups? Is there an association between beer consumption and attractiveness to mosquitos?
<i>Some things to consider when answering this qustion:

1. What can you state about the medians and interquartile ranges of the two groups?
2. What do these properties tell us about the two data sets?
3. Based on the boxplot alone, can we say if the difference between the two groups is significant?</i>


For each treatment group, the graph explicitly illustrates the minimum and maximum number of attracted mosquitoes, the median number of attracted mosquitoes and the values in the first and third quartiles. The graph visually depicts both the full range and the likely range of variability in the data for each treatment group. The likely range of variability is illustrated through the use of a box and is referred to as the inter-quartile range (IQR). There are no outliers in either group, which would be depicted as dots place outside of the "whiskers" which extend out from either end of each box. 

For the Beer group, the median number of attracted mosquitoes is 24 and the inter-quartile range is 7 mosquitoes.
For the Water group, the median number of attracted mosquitoes is 20 and the inter-quartile range is 6 mosqitoes.
The median tells us where the "centre" of the datset occurs, so 50% of the observations are below the median, and 50% are above.
The IQR shows us that even though the median is much higher in the Beer group, the likely range of variability is about the same in each group.

Based on the boxplots alone, we cannot say if the difference between the two groups is significant. The boxplot depicts descriptive statistics (features of the data), which summarize the sample but doesn't say anything about the population that the data represents. A statistical test must be used to prove that the difference between the two groups is not due to random chance and to determine its significance.

### 3. Calculate basic statistics measures for each group: the mean, median, standard deviation
There are multiple ways to calculate mean, median and standard deviation using Python. Below are three examples of how to calculate mean. Note that for this question you'll need to calculate the mean, median, and standard deviation for both the 'beer' and 'water' subjects.

In [4]:
print("Mean number of mosquitoes attracted to the beer subjects is {:03.2f}".format(df[df.Treatment == 'Beer'].Response.mean()))
print("Median number of mosquitoes attracted to the beer subjects is {:03.2f}".format(df[df.Treatment == 'Beer'].Response.median()))
print("The standard deviation for the beer subjects is {:03.2f}".format(df[df.Treatment == 'Beer'].Response.std()))
print("")
print("Mean number of mosquitoes attracted to the water subjects is {:03.2f}".format(df[df.Treatment == 'Water'].Response.mean()))
print("Median number of mosquitoes attracted to the water subjects is {:03.2f}".format(df[df.Treatment == 'Water'].Response.median()))
print("The standard deviation for the water subjects is {:03.2f}".format(df[df.Treatment == 'Water'].Response.std()))

Mean number of mosquitoes attracted to the beer subjects is 23.60
Median number of mosquitoes attracted to the beer subjects is 24.00
The standard deviation for the beer subjects is 4.13

Mean number of mosquitoes attracted to the water subjects is 19.22
Median number of mosquitoes attracted to the water subjects is 20.00
The standard deviation for the water subjects is 3.67


### 4. Explain the numbers
<i>After calculating the mean, median, and standard deviation for the beer and water groups, what can we learn from these numbers? Is the difference in means consistent with the difference in medians? Why might means and medians be different, could the standard deviation hold some clues?</i>

Since the mean and median are close within each group, we can infer that the data is normally distributed in each group. We can see from our standard deviation values that the typical observation in the beer group deviates from its mean more than the typical observation in the water group deviates from its mean. This tells us that the data points in the beer group are spread out over a wider range of values, which we can confirm by looking at the box plot for the 2 groups.

The difference in means (4.38) seems fairly consistent with the difference in medians (4.0).

Means are heavily influenced by values at the top or bottom of the range, or any outliers in the data, whereas medians are not affected by these extreme values. So outliers or skewed data can account for differences between means and medians. The standard deviation describes how far away the typical observation is from the mean. If the standard deviation is small, then the measurements are all going to be relatively close to the average. A higher standard deviation indicates that the numbers are more spread out over a wider range of values. A mean average on its own doesn't provide a complete picture without the standard deviation, if the typical observation varies greatly from the mean.

### 5. Write the code to implement the data simulation demonstrated in the video
To do this simulation we'll need a few elements:

1. We'll have to iterate over 50,000 simulations
2. For each simulation we'll have to shuffle the beer/water labels.
3. Once shuffled we'll have to calculate the difference in means between the two groups, and store this measurement in a list or numpy array.

Iteration can be handled using a standard 'for' loop. To shuffle the labels, we have a few options. Below is an example of how to randomize labels using Python's [randomization library](https://docs.python.org/2/library/random.html) library, however both [pandas](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) and [numpy](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html) have methods for random sampling. In class I used a histogram to show the results of the permutations. That's not necessary for this assignment, but please make sure you retain the results of the mean comparisons at each step.

In [5]:
# Load the data file into a pandas dataframe
df = pd.read_csv('mosquitos_data.csv')

# Initialize an array to store mean comparisons
comp = np.empty(50000)

# Create an array from each column in the dataframe
response = df.Response.values
treatment = df.Treatment.values

# The simulation will run 50,000 times
for n in np.arange(50000):
    
    # Calculate the mean number of attracted mosquitoes for each treatment.
    # The where() function creates an index that is used to filter the responses.
    meanBeer = np.mean(response[np.where(treatment == 'Beer')])
    meanWater = np.mean(response[np.where(treatment == 'Water')])
    
    # Calculate and store the difference between the means
    comp[n] = meanBeer - meanWater

    # Shuffle the labels for the next iteration
    np.random.shuffle(treatment)
    
# At this point, comp contains 50,000 values computed by taking the difference between the means.
# Analyze the comparisons to determine how many times we wound up with a number >= 4.4

In [6]:
# This counts the number of comparisons which validate the skeptic's argument.
comp[np.where(comp >= 4.4)].size

23

In [None]:
# 23 is insignificant relative to 50,000 comparisons, so we can reject the null hypothesis.