# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [91]:
import pandas as pd

df = pd.read_json('searchlog.json', lines=True)

df_groupby = df.groupby('search_ui').agg('mean')

print("difference of means is "+str(df_groupby['search_count'].diff().values[1]))
df

difference of means is 0.13500569535052287


Unnamed: 0,uid,is_instructor,search_ui,search_count
0,6061521,True,A,2
1,11986457,False,A,0
2,15995765,False,A,0
3,9106912,True,B,0
4,9882383,False,A,0
...,...,...,...,...
676,16768212,False,B,0
677,7643715,True,A,0
678,14838641,False,A,0
679,6454817,False,A,0


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [20]:
import numpy as np

obs_diff = 0.135
result=0
numSamples = 10000

for i in range(numSamples):
    df['search_count'] = np.random.permutation(df['search_count'])
    df_groupby = df.groupby('search_ui').agg('mean')
    if(df_groupby['search_count'].diff().values[1]>obs_diff):
        result += 1
        
p_value = result/numSamples
p_value

0.1291

Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

Yes this is a p-hacking. We keep doing analysis on the same data. To solve this we can decrease the level of signficant 

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [85]:
True_A = df[(df['is_instructor']==True) & (df['search_ui']=='A')].count().values[0]
True_B = df[(df['is_instructor']==True) & (df['search_ui']=='B')].count().values[0]
False_A = df[(df['is_instructor']==False) & (df['search_ui']=='A')].count().values[0]
False_B = df[(df['is_instructor']==False) & (df['search_ui']=='B')].count().values[0]

data = {'Search_UI_A':[True_A, False_A], 'Search_UI_B':[True_B, False_B]}
df_chi = pd.DataFrame(data, index=['instructor_T', 'instructor_F'])
df_chi

Unnamed: 0,Search_UI_A,Search_UI_B
instructor_T,115,120
instructor_F,233,213


In [87]:
Obs_True = df_chi.sum(axis=1)[0]
Obs_False = df_chi.sum(axis=1)[1]
Obs_A = df_chi.sum(axis=0)[0]
Obs_B = df_chi.sum(axis=0)[1]
total = df_chi.values.sum()

expt_True_A = (Obs_True*Obs_A)/total
expt_True_B = (Obs_True*Obs_B)/total
expt_False_A = (Obs_False*Obs_A)/total
expt_False_B = (Obs_False*Obs_B)/total

chi_square = ((True_A-expt_True_A)**2)/expt_True_A+((True_B-expt_True_B)**2)/expt_True_B+((False_A-expt_False_A)**2)/expt_False_A+((False_B-expt_False_B)**2)/expt_False_B


0.6731740891275046

Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

After we got the chi square statistics we need to use the chi square table to find the corresponding critical value with dof and significance level. dof=(#rows-1)(#cols-1)=1 here. And we set the significance level =0.05. The critical value we got is 3.841. 0.67<3.841 so we can't reject the null hypothesis in other words the two columns are independent. 

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.