<a href="https://colab.research.google.com/github/albert-h-wong/DS-Unit-1-Sprint-4-Statistical-Tests-and-Experiments/blob/master/LS_DS_142_Sampling_Confidence_Intervals_and_Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 142
## Sampling, Confidence Intervals, and Hypothesis Testing

## Prepare - examine other available hypothesis tests

If you had to pick a single hypothesis test in your toolbox, t-test would probably be the best choice - but the good news is you don't have to pick just one! Here's some of the others to be aware of:

In [1]:
import numpy as np
from scipy.stats import chisquare  # One-way chi square test

# Chi square can take any crosstab/table and test the independence of rows/cols
# The null hypothesis is that the rows/cols are independent -> low chi square
# The alternative is that there is a dependence -> high chi square
# Be aware! Chi square does *not* tell you direction/causation

ind_obs = np.array([[1, 1], [2, 2]]).T
print(ind_obs)
print(chisquare(ind_obs, axis=None))

dep_obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T
print(dep_obs)
print(chisquare(dep_obs, axis=None))

[[1 2]
 [1 2]]
Power_divergenceResult(statistic=0.6666666666666666, pvalue=0.8810148425137847)
[[16 32]
 [18 24]
 [16 16]
 [14 28]
 [12 20]
 [12 24]]
Power_divergenceResult(statistic=23.31034482758621, pvalue=0.015975692534127565)


In [2]:
# Distribution tests:
# We often assume that something is normal, but it can be important to *check*

# For example, later on with predictive modeling, a typical assumption is that
# residuals (prediction errors) are normal - checking is a good diagnostic

from scipy.stats import normaltest
# Poisson models arrival times and is related to the binomial (coinflip)
sample = np.random.poisson(5, 1000)
print(normaltest(sample))  # Pretty clearly not normal

NormaltestResult(statistic=25.410256316254348, pvalue=3.0355189544762573e-06)


In [3]:
# Kruskal-Wallis H-test - compare the median rank between 2+ groups
# Can be applied to ranking decisions/outcomes/recommendations
# The underlying math comes from chi-square distribution, and is best for n>5
from scipy.stats import kruskal

x1 = [1, 3, 5, 7, 9]
y1 = [2, 4, 6, 8, 10]
print(kruskal(x1, y1))  # x1 is a little better, but not "significantly" so

x2 = [1, 1, 1]
y2 = [2, 2, 2]
z = [2, 2]  # Hey, a third group, and of different size!
print(kruskal(x2, y2, z))  # x clearly dominates

KruskalResult(statistic=0.2727272727272734, pvalue=0.6015081344405895)
KruskalResult(statistic=7.0, pvalue=0.0301973834223185)


And there's many more! `scipy.stats` is fairly comprehensive, though there are even more available if you delve into the extended world of statistics packages. As tests get increasingly obscure and specialized, the importance of knowing them by heart becomes small - but being able to look them up and figure them out when they *are* relevant is still important.

## Live Lecture - let's explore some more of scipy.stats

In [0]:
# Taking requests! Come to lecture with a topic or problem and we'll try it.

## Assignment - Build a confidence interval

A confidence interval refers to a neighborhood around some point estimate, the size of which is determined by the desired p-value. For instance, we might say that 52% of Americans prefer tacos to burritos, with a 95% confidence interval of +/- 5%.

52% (0.52) is the point estimate, and +/- 5% (the interval $[0.47, 0.57]$) is the confidence interval. "95% confidence" means a p-value $\leq 1 - 0.95 = 0.05$.

In this case, the confidence interval includes $0.5$ - which is the natural null hypothesis (that half of Americans prefer tacos and half burritos, thus there is no clear favorite). So in this case, we could use the confidence interval to report that we've failed to reject the null hypothesis.

But providing the full analysis with a confidence interval, including a graphical representation of it, can be a helpful and powerful way to tell your story. Done well, it is also more intuitive to a layperson than simply saying "fail to reject the null hypothesis" - it shows that in fact the data does *not* give a single clear result (the point estimate) but a whole range of possibilities.

How is a confidence interval built, and how should it be interpreted? It does *not* mean that 95% of the data lies in that interval - instead, the frequentist interpretation is "if we were to repeat this experiment 100 times, we would expect the average result to lie in this interval ~95 times."

For a 95% confidence interval and a normal(-ish) distribution, you can simply remember that +/-2 standard deviations contains 95% of the probability mass, and so the 95% confidence interval based on a given sample is centered at the mean (point estimate) and has a range of +/- 2 (or technically 1.96) standard deviations.

Different distributions/assumptions (90% confidence, 99% confidence) will require different math, but the overall process and interpretation (with a frequentist approach) will be the same.

Your assignment - using the data from the prior module ([congressional voting records](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)):

1. Generate and numerically represent a confidence interval
2. Graphically (with a plot) represent the confidence interval
3. Interpret the confidence interval - what does it tell you about the data and its distribution?

Stretch goals:

1. Write a summary of your findings, mixing prose and math/code/results. *Note* - yes, this is by definition a political topic. It is challenging but important to keep your writing voice *neutral* and stick to the facts of the data. Data science often involves considering controversial issues, so it's important to be sensitive about them (especially if you want to publish).
2. Apply the techniques you learned today to your project data or other data of your choice, and write/discuss your findings here.

In [6]:
# Getting started with drug data
# http://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29

!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip

--2018-12-05 05:19:32--  http://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/zip]
Saving to: ‘drugsCom_raw.zip.2’


2018-12-05 05:19:37 (8.63 MB/s) - ‘drugsCom_raw.zip.2’ saved [42989872/42989872]



In [0]:
!unzip drugsCom_raw.zip

In [0]:
!head drugsComTrain_raw.tsv

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import random
import scipy

In [3]:
drugsdf = pd.read_table('drugsComTrain_raw.tsv')
drugsdf.head(15)

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37
5,155963,Cialis,Benign Prostatic Hyperplasia,"""2nd day on 5mg started to work with rock hard...",2.0,"November 28, 2015",43
6,165907,Levonorgestrel,Emergency Contraception,"""He pulled out, but he cummed a bit in me. I t...",1.0,"March 7, 2017",5
7,102654,Aripiprazole,Bipolar Disorde,"""Abilify changed my life. There is hope. I was...",10.0,"March 14, 2015",32
8,74811,Keppra,Epilepsy,""" I Ve had nothing but problems with the Kepp...",1.0,"August 9, 2016",11
9,48928,Ethinyl estradiol / levonorgestrel,Birth Control,"""I had been on the pill for many years. When m...",8.0,"December 8, 2016",1


In [4]:
print(drugsdf.shape)
print(drugsdf.dtypes)

(161297, 7)
Unnamed: 0       int64
drugName        object
condition       object
review          object
rating         float64
date            object
usefulCount      int64
dtype: object


In [5]:
drugsdf.isnull().sum()

Unnamed: 0       0
drugName         0
condition      899
review           0
rating           0
date             0
usefulCount      0
dtype: int64

In [6]:
drugsdf[drugsdf.isnull().any(axis=1)]

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
30,51452,Azithromycin,,"""Very good response. It is so useful for me. """,10.0,"August 18, 2010",1
148,61073,Urea,,"""Accurate information.""",10.0,"July 2, 2011",13
488,132651,Doxepin,,"""So far so good. Good for me and I can take it...",10.0,"October 20, 2010",25
733,44297,Ethinyl estradiol / norgestimate,,"""I haven&#039;t been on it for a long time and...",8.0,"January 24, 2011",1
851,68697,Medroxyprogesterone,,"""I started the shot in July 2015 and ended in ...",6.0,"March 23, 2017",1
1014,182050,Acetaminophen / caffeine,,"""I get migraine and have found out by taking e...",10.0,"February 19, 2012",7
1124,154412,Tavaborole,,"""I have struggled with nail for 8 or ten years...",10.0,"May 21, 2016",6
1163,110945,Acetaminophen / butalbital / caffeine / codeine,,"""I found that while this medicine does relieve...",5.0,"December 11, 2011",3
1253,74242,Ethinyl estradiol / norethindrone,,"""I started Loestrin and within two months I ex...",4.0,"April 28, 2011",0
1267,58340,Conjugated estrogens,,"""I had to have a total hysterectomy in 2009 in...",10.0,"June 11, 2016",27


In [13]:
drugsdf2 = drugsdf.dropna(axis='index')
drugsdf2.isnull().sum()

Unnamed: 0     0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64

In [17]:
rating_pivot = pd.pivot_table(drugsdf2,index=["drugName"],values=["rating"], aggfunc=[np.mean,np.median,len])
print(rating_pivot)

                                                        mean median    len
                                                      rating rating rating
drugName                                                                  
A + D Cracked Skin Relief                          10.000000   10.0    1.0
A / B Otic                                         10.000000   10.0    1.0
Abacavir / dolutegravir / lamivudine                8.211538    9.5   52.0
Abacavir / lamivudine / zidovudine                  9.000000    9.0    1.0
Abatacept                                           7.157895    9.0   19.0
Abilify                                             6.540359    8.0  446.0
Abilify Discmelt                                    8.000000    8.0    2.0
Abilify Maintena                                    7.750000   10.0    4.0
Abiraterone                                         9.111111   10.0    9.0
AbobotulinumtoxinA                                  7.000000    8.0    3.0
Abraxane                 

In [32]:
rating_pivot.columns
list(rating_pivot.columns)

[('mean', 'rating'), ('median', 'rating'), ('len', 'rating')]

In [34]:
rating_pivot[('len','rating')]

drugName
A + D Cracked Skin Relief                              1.0
A / B Otic                                             1.0
Abacavir / dolutegravir / lamivudine                  52.0
Abacavir / lamivudine / zidovudine                     1.0
Abatacept                                             19.0
Abilify                                              446.0
Abilify Discmelt                                       2.0
Abilify Maintena                                       4.0
Abiraterone                                            9.0
AbobotulinumtoxinA                                     3.0
Abraxane                                               4.0
Abreva                                               158.0
Absorbine Jr.                                          1.0
Absorica                                               2.0
Acamprosate                                          109.0
Acanya                                                45.0
Acarbose                                       

In [35]:
rating_pivot[('len','rating')].value_counts()

1.0      805
2.0      397
3.0      232
4.0      154
5.0      135
6.0      103
8.0       85
7.0       83
10.0      61
9.0       59
11.0      58
14.0      49
13.0      42
12.0      36
17.0      34
15.0      33
18.0      29
19.0      27
16.0      23
27.0      23
30.0      23
21.0      22
22.0      21
20.0      21
24.0      21
26.0      20
23.0      20
25.0      19
31.0      18
33.0      17
        ... 
805.0      1
511.0      1
125.0      1
148.0      1
164.0      1
180.0      1
752.0      1
106.0      1
196.0      1
250.0      1
178.0      1
264.0      1
276.0      1
214.0      1
680.0      1
408.0      1
763.0      1
548.0      1
151.0      1
115.0      1
249.0      1
87.0       1
495.0      1
525.0      1
823.0      1
375.0      1
626.0      1
998.0      1
261.0      1
572.0      1
Name: (len, rating), Length: 332, dtype: int64

In [38]:
min_sample_df = rating_pivot.drop(rating_pivot[rating_pivot[('len','rating')] < 30].index)

min_sample_df


Unnamed: 0_level_0,mean,median,len
Unnamed: 0_level_1,rating,rating,rating
drugName,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Abacavir / dolutegravir / lamivudine,8.211538,9.5,52.0
Abilify,6.540359,8.0,446.0
Abreva,5.727848,7.0,158.0
Acamprosate,8.899083,10.0,109.0
Acanya,7.422222,9.0,45.0
Accutane,8.427273,9.5,330.0
Acetaminophen,8.033333,10.0,30.0
Acetaminophen / aspirin / caffeine,7.643678,9.0,87.0
Acetaminophen / butalbital / caffeine,8.898438,10.0,128.0
Acetaminophen / codeine,5.252427,5.0,103.0


In [40]:
list(min_sample_df.columns)

[('mean', 'rating'), ('median', 'rating'), ('len', 'rating')]

In [44]:
min_sample_df.columns = min_sample_df.columns.droplevel(1)

AttributeError: ignored

In [46]:
list(min_sample_df.columns)

['mean', 'median', 'len']

In [47]:
min_sample_df

Unnamed: 0_level_0,mean,median,len
drugName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abacavir / dolutegravir / lamivudine,8.211538,9.5,52.0
Abilify,6.540359,8.0,446.0
Abreva,5.727848,7.0,158.0
Acamprosate,8.899083,10.0,109.0
Acanya,7.422222,9.0,45.0
Accutane,8.427273,9.5,330.0
Acetaminophen,8.033333,10.0,30.0
Acetaminophen / aspirin / caffeine,7.643678,9.0,87.0
Acetaminophen / butalbital / caffeine,8.898438,10.0,128.0
Acetaminophen / codeine,5.252427,5.0,103.0


In [0]:
def confidence_interval(data, confidence=0.95):
  """
  Calculate a confidence interval around a sample mean for given data.
  Using t-distribution and two-tailed test, default 95% confidence. 
  
  Arguments:
    data - iterable (list or numpy array) of sample observations
    confidence - level of confidence for the interval
  
  Returns:
    interval
  """
  data = np.array(data)
  mean = np.mean(data)
  n = len(data)
  stderr = stats.sem(data)
  interval = stderr * stats.t.ppf((1 + confidence) / 2., n - 1)
  return interval


In [49]:
min_sample_df.agg(['confidence_interval'])


ValueError: ignored

In [0]:
stats = df.groupby(['drugName'])['rating'].agg(['mean', 'count', 'std'])


In [41]:
pd.pivot_table(min_sample_df,index=["drugName"],values=[('len','rating')], aggfunc=[np.mean,confidence_interval,np.median,len])

  keepdims=keepdims)
  ret = ret.dtype.type(ret / rcount)


AssertionError: ignored

In [0]:
pd.pivot_table(drugsdf,index=["drugName","condition"],values=["rating"])