# Class 7 activities

For today's in class activities we will run a hypothesis test for a single proportion, and a hypothesis test for comparing two means. While these activities do not need to be turned in, they will be good practice for lab 6.



## Overview

**All hypothesis tests have the same logic**. Namely: 

1. One starts with a *null hypothesis* that states that a parameter has a particular value. 
2. The parameter value stated in the null hypothesis implies a particular distribution of statistics called the *null distribution*.
3. If the observed statistic is unlikely to come from the null distribution (i.e., if one has a small p-value) then one can reject the null hypothesis; i.e., we can state that parameter in the null hypothesis likely not to be true. 


**All hypothesis test can be run by following these 5 steps**: 

1. State the null and alternative hypotheses.
2. Calculate the observed statistic from the data we have.
3. Create a null distribution that is a distribution of statistics that is consistent with the null hypothesis.
4. Calculate the p-value which is the probability of getting a statistic in the null distribution that is as more more extreme than the observed statistic.
5. Make a decision if you should reject the null hypothesis (i.e., are the results "statistically significant"). 

Let's go through a few examples of this now.



In [2]:
from datascience import *
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

%matplotlib inline



# Part 1: Hypothesis test for a single proportion 

Let is start by running a hypothesis test for a single proportion. Namely, we will try to assess whether a parameter π is equal to a particular value, based on a statistic p̂ that is calculated from data.



## Hypothesis test to examine whether Paul the Octopus was psychic 

Paul the Octopus was an octopus who was believed to have psychic abilities. In particular, it was claimed that Paul had the ability to predict the winner of soccer games. 

To test Paul’s psychic abilities, before each soccer game, two containers of food (mussels) were lowered into the octopus’ tank. The containers were identical, except for country flags of the opposing teams, one on each container. Whichever container Paul opened was deemed his predicted winner. In the 2010 World Cup, Paul (in a German aquarium) became famous for correctly predicting 11 out of 13 soccer games. We will use hypothesis testing to determine the probability that Paul would get 11 out of 13 correct if he was merely guessing.

Let's run a hypothesis test to examine whether there is sufficient evidence to determine if Paul is psychic. 


<img src="https://raw.githubusercontent.com/emeyers/SDS173/master/images/Paul.jpg">



## 1.1: Stating the null and alternative hypothesis 

Please write down the null and alternative hypotheses in symbols and words to assess whether Paul is psychic. 


**Answer**







## 1.2: Calculate the statistic of interest

Please calculate the observed statistic p̂ and assign the value of the observed statistic to a name `paul_stat`


## 1.3a: Create a statistic consistent with the null distribution

A *null distribution* is a distribution of statistics that we would expect to get if the null hypothesis was true; i.e., if the parameter specified in the null hypothesis was correct. When running a hypothesis test for a single proportion π, we can simulate observed statistics p̂ that are consistent with the null hypothesis by flipping coins. 

To generate data that is consistent with the null hypothesis, the probability of getting a "head" in each coin flip is given by the parameter specified in the null hypothesis. If the actual data set we have collected contains *n* values, then we can then flip the coin *n* times to simulate a "fake data set" that is consistent with the null hypothesis.  We can then calculate one statistic (p̂) that from this simulated fake data that is consistent with the null hypothesis. If we repeat this process many times, we can then generate a null distribution of statistics that are consistent with the null hypothesis. 

Let's start this process of creating a null distribution by writing a function `generate_flip_proportion_heads(num_flips, prob_heads)` that takes a number of coin flips (`num_flips`), and a probability of getting a heads on each flip (`prob_heads`), and returns a statistic (p̂) which is the proportion of heads obtained from flipping the coin `num_flips` times. To do this, the `sample_proportions(num_heads, [prob_heads, 1 - prob_heads])` function in the `datascience` package will be useful.

Once you have created the `sample_proportions()` method, run it once to generate one p̂ that is consistent with the null hypothesis that Paul is just guessing.


In [58]:
def generate_flip_proportion_heads(num_flips, prob_heads):
    pass



0.07692307692307693

## 1.3b: Create the null distribution

Now that you have a function to create a statistic consistent with the null hypothesis, use a for loop to create a null distribution that has 10,000 points in it. Then visualize this null distribution as a histogram.



In [3]:
null_distribution = make_array()




## 1.4: Calculate the p-value

Now that you have a null distribution, calculate the p-value which is the proportion of points in the null distribution that are as great or greater than the observed statistic (`paul_stat`). To do this, the `np.count_nonzero()` function will be useful.




## 1.5: Make a decision

Based on the results you have, is Paul psychic? 



# Part 2: Hypothesis test for comparing two means

Let's now run a hypothesis test for comparing two means. Like all hypothesis test, this hypothesis test has the same logic, and can be run using the 5-step procedure, that we have discussed. However where we are trying to assess whether two parameters μ1 and μ2 are equal based on two observed statistics x̅1 and x̅2.



## Hypothesis test to examine whether baseball games have gotten longer

Let's examine whether the average length of baseball games have gotten longer. To do this we will compare the length of baseball games in 1964 to the length of baseball games in 2014. 

The code below loads data that has the length of baseball games. Please create a name `game_info2` that reduce this data table to only have columns that contain the year, and the length of the games in minutes, and only contains rows for the years 1964 and 2014.



In [4]:
game_info = Table.from_df(pd.read_csv('https://yale.box.com/shared/static/un9xq5rm2ackorbphpuo9xvjhsg8o8r3.gz'))

game_info.show(3)


  interactivity=interactivity, compiler=compiler, result=result)


Date,DoubleHeader,DayOfWeek,VisitingTeam,VisitingTeamLeague,VisitingTeamGameNumber,HomeTeam,HomeTeamLeague,HomeTeamGameNumber,VisitorRunsScored,HomeRunsScore,LengthInOuts,DayNight,CompletionInfo,ForfeitInfo,ProtestInfo,ParkID,Attendence,Duration,VisitorLineScore,HomeLineScore,VisitorAB,VisitorH,VisitorD,VisitorT,VisitorHR,VisitorRBI,VisitorSH,VisitorSF,VisitorHBP,VisitorBB,VisitorIBB,VisitorK,VisitorSB,VisitorCS,VisitorGDP,VisitorCI,VisitorLOB,VisitorPitchers,VisitorER,VisitorTER,VisitorWP,VisitorBalks,VisitorPO,VisitorA,VisitorE,VisitorPassed,VisitorDB,VisitorTP,HomeAB,HomeH,HomeD,HomeT,HomeHR,HomeRBI,HomeSH,HomeSF,HomeHBP,HomeBB,HomeIBB,HomeK,HomeSB,HomeCS,HomeGDP,HomeCI,HomeLOB,HomePitchers,HomeER,HomeTER,HomeWP,HomeBalks,HomePO,HomeA,HomeE,HomePassed,HomeDB,HomeTP,UmpireHID,UmpireHName,Umpire1BID,Umpire1BName,Umpire2BID,Umpire2BName,Umpire3BID,Umpire3BName,UmpireLFID,UmpireLFName,UmpireRFID,UmpireRFName,VisitorManagerID,VisitorManagerName,HomeManagerID,HomeManagerName,WinningPitcherID,WinningPitcherName,LosingPitcherID,LosingPitcherNAme,SavingPitcherID,SavingPitcherName,GameWinningRBIID,GameWinningRBIName,VisitorStartingPitcherID,VisitorStartingPitcherName,HomeStartingPitcherID,HomeStartingPitcherName,VisitorBatting1PlayerID,VisitorBatting1Name,VisitorBatting1Position,VisitorBatting2PlayerID,VisitorBatting2Name,VisitorBatting2Position,VisitorBatting3PlayerID,VisitorBatting3Name,VisitorBatting3Position,VisitorBatting4PlayerID,VisitorBatting4Name,VisitorBatting4Position,VisitorBatting5PlayerID,VisitorBatting5Name,VisitorBatting5Position,VisitorBatting6PlayerID,VisitorBatting6Name,VisitorBatting6Position,VisitorBatting7PlayerID,VisitorBatting7Name,VisitorBatting7Position,VisitorBatting8PlayerID,VisitorBatting8Name,VisitorBatting8Position,VisitorBatting9PlayerID,VisitorBatting9Name,VisitorBatting9Position,HomeBatting1PlayerID,HomeBatting1Name,HomeBatting1Position,HomeBatting2PlayerID,HomeBatting2Name,HomeBatting2Position,HomeBatting3PlayerID,HomeBatting3Name,HomeBatting3Position,HomeBatting4PlayerID,HomeBatting4Name,HomeBatting4Position,HomeBatting5PlayerID,HomeBatting5Name,HomeBatting5Position,HomeBatting6PlayerID,HomeBatting6Name,HomeBatting6Position,HomeBatting7PlayerID,HomeBatting7Name,HomeBatting7Position,HomeBatting8PlayerID,HomeBatting8Name,HomeBatting8Position,HomeBatting9PlayerID,HomeBatting9Name,HomeBatting9Position,AdditionalInfo,AcquisitionInfo,Year,Month,Day
2014-03-22,0,Sat,LAN,NL,1,ARI,NL,1,3,1,54,N,,,,SYD01,38266,169,10200000.0,000001000,33,5,2,0,1,3,0,0,1,3,0,11,0,0,0,0,7,4,1,1,1,0,27,13,1,0,0,0,33,5,1,0,0,1,0,0,0,2,0,10,0,0,0,0,7,5,3,3,1,0,27,10,1,0,0,0,welkt901,Tim Welke,scotd901,Dale Scott,diazl901,Laz Diaz,carlm901,Mark Carlson,,(none),,(none),mattd001,Don Mattingly,gibsk001,Kirk Gibson,kersc001,Clayton Kershaw,milew001,Wade Miley,jansk001,Kenley Jansen,ethia001,Andre Ethier,kersc001,Clayton Kershaw,milew001,Wade Miley,puigy001,Yasiel Puig,9,turnj001,Justin Turner,4,ramih003,Hanley Ramirez,6,gonza003,Adrian Gonzalez,3,vanss001,Scott Van Slyke,7,uribj002,Juan Uribe,5,ethia001,Andre Ethier,8,ellia001,A.J. Ellis,2,kersc001,Clayton Kershaw,1,polla001,A.J. Pollock,8,hilla001,Aaron Hill,4,goldp001,Paul Goldschmidt,3,pradm001,Martin Prado,5,trumm001,Mark Trumbo,7,montm001,Miguel Montero,2,owinc001,Chris Owings,6,parrg001,Gerardo Parra,9,milew001,Wade Miley,1,,Y,2014,3,22
2014-03-23,0,Sun,LAN,NL,2,ARI,NL,2,7,5,54,D,,,,SYD01,38079,241,102021100.0,000000014,34,13,3,0,0,6,1,2,2,8,0,7,1,0,1,0,13,8,5,5,0,0,27,4,1,0,2,0,35,8,0,0,1,5,0,0,0,8,0,8,0,0,2,0,11,6,6,6,1,0,27,15,3,0,1,0,scotd901,Dale Scott,diazl901,Laz Diaz,carlm901,Mark Carlson,welkt901,Tim Welke,,(none),,(none),mattd001,Don Mattingly,gibsk001,Kirk Gibson,ryu-h001,Hyun-Jin Ryu,cahit001,Trevor Cahill,,(none),ethia001,Andre Ethier,ryu-h001,Hyun-Jin Ryu,cahit001,Trevor Cahill,gordd002,Dee Gordon,4,puigy001,Yasiel Puig,9,ramih003,Hanley Ramirez,6,gonza003,Adrian Gonzalez,3,ethia001,Andre Ethier,8,ellia001,A.J. Ellis,2,baxtm001,Mike Baxter,7,uribj002,Juan Uribe,5,ryu-h001,Hyun-Jin Ryu,1,polla001,A.J. Pollock,8,hilla001,Aaron Hill,4,goldp001,Paul Goldschmidt,3,pradm001,Martin Prado,5,montm001,Miguel Montero,2,trumm001,Mark Trumbo,7,parrg001,Gerardo Parra,9,gregd001,Didi Gregorius,6,cahit001,Trevor Cahill,1,,Y,2014,3,23
2014-03-30,0,Sun,LAN,NL,3,SDN,NL,1,1,3,51,N,,,,SAN02,45567,169,10000.0,00000003x,31,4,0,0,0,1,0,0,0,3,0,9,0,0,0,0,6,4,2,2,0,0,24,12,2,0,2,0,27,5,0,0,1,3,2,0,0,4,0,10,1,0,2,0,6,5,1,1,1,0,27,10,0,0,0,0,culbf901,Fieldin Culbreth,gonzm901,Manny Gonzalez,reynj901,Jim Reynolds,barbs901,Sean Barber,,(none),,(none),mattd001,Don Mattingly,blacb001,Buddy Black,thayd001,Dale Thayer,wilsb001,Brian Wilson,streh001,Huston Street,denoc001,Chris Denorfia,ryu-h001,Hyun-Jin Ryu,casha001,Andrew Cashner,crawc002,Carl Crawford,7,puigy001,Yasiel Puig,9,ramih003,Hanley Ramirez,6,gonza003,Adrian Gonzalez,3,ethia001,Andre Ethier,8,uribj002,Juan Uribe,5,ellia001,A.J. Ellis,2,gordd002,Dee Gordon,4,ryu-h001,Hyun-Jin Ryu,1,cabre001,Everth Cabrera,6,denoc001,Chris Denorfia,9,headc001,Chase Headley,5,gyorj001,Jedd Gyorko,4,alony001,Yonder Alonso,3,medit001,Tommy Medica,7,venaw001,Will Venable,8,river003,Rene Rivera,2,casha001,Andrew Cashner,1,,Y,2014,3,30


## 2.1: Stating the null and alternative hypothesis 

Please write down the null and alternative hypotheses in symbols and words to assess whether the average length of a game is the same in 1964 and 2014.


**Answer**

In words: The null hypothesis is that on average the length of games is the same in 1964 as it is in 2014. The alternative hypothesis is that games are longer in 2014 than in 1964.

In symbols we would write this as: 

$H_0: \mu_{2014} = \mu_{1964} = 0$

$H_A: \mu_{2014} > \mu_{1964} \ne 0$




## 2.2: Calculate the statistic of interest

Please calculate the observed statistic of interest and assign the value of the observed statistic to a name `obs_stat`.

## 2.3a: Create a statistic consistent with the null distribution

Now let's start by creating the null distribution by creating one point that is consistent with the null hypothesis. This can be done using the following steps:

1. Use the `tb.sample(with_replacement = False)` method to create a ndarray of shuffled labels. Save this to the name `shuffled_labels`.
2. Create an name `original_and_shuffled` that has the original data in the `game_info2` along with a new column called `Shuffled Year` that has the shuffled labels.
3. Create a name `shuffled_group_means` that has the mean game lengths for the shuffled "2014" and "1964" data. 
4. Create a name called `obs_stat_shuff` that has one statistic that is consistent with the null hypothesis and print this statistic value.


In [57]:
# shuffle the Year labels and create a table and store the results in an ndarray


# create a table original_and_shuffled that has an additional column called 'Shuffled Label'


# calculate the means for the shuffled "1964" and "2014" duration in a table 


# calculate the observed statistic




0.8642931984875304

## 2.3b: Create the null distribution

Now use a for loop to create a null distribution that has 1,000 points in it. Then visualize this null distribution as a histogram.



In [6]:
null_dist = make_array()





## 2.4: Calculate the p-value

Now that you have a null distribution, calculate the p-value which is the proportion of points in the null distribution that are as great or greater than the observed statistic (`obs_stat`). 



## 2.5: Make a decision

Based on the results, have baseball games gotten longer? 

