# Permutation Testing

Permutation testing is a way of checking whether labels are significant. For example, in the Titanic data, suppose we wanted to check whether `class` has an effect on survival rate. We could do something like this:

1. Find a statistic we're interested in, say 
$$
T = \frac{P(\text{survived}|\text{first class})}{P(\text{survived}|\text{third class})}  
$$
2. Compute this for the actual data
3. Recompute for lots of permutations of the `class` label. This is our null hypothesis distribution.
4. See where our observation falls, relative to the null hypothesis.

In [1]:
# To edit this:
# code $(jupyter --data-dir)/nbextensions/snippets/snippets.json

# imports a library 'pandas', names it as 'pd'
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns

# enables inline plots, without it plots don't show up in the notebook
%matplotlib inline
# %config InlineBackend.figure_format = 'svg'
%config InlineBackend.figure_format = 'png'
# mpl.rcParams['figure.dpi']= 300

In [2]:
df = pd.read_csv("../../week01-benson/02-git_viz/data/titanic.csv")

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Exercises

1. Implement a permutation test to answer the question aboove
2. Time your algorithm, and see how quick you can get it running
3. Explore some other Titanic hypotheses (in real life you'd need to watch out for false discovery rate)

In [7]:
from itertools import permutations

In [29]:
def test_titantic_class(classes, df):
    perms = permutations(classes, 2)
    print(list(perms))

    Tlist = []
    for pair in perms:
        numerator   = df[df['pclass'] == pair[0]]['survived'].value_counts()[1] / df[df['pclass'] == pair[0]]['survived'].count()
        denominator = df[df['pclass'] == pair[1]]['survived'].value_counts()[1] / df[df['pclass'] == pair[1]]['survived'].count()
        stat = numerator / denominator
        Tlist.append(stat)
    return Tlist
test_titantic_class([1, 2, 3], df)

[(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)]


[]

In [23]:
%timeit test_titantic_class([1,2,3], df)

29.5 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
sum((test_titantic_class([1, 2, 3], df))) / 6

1.2548148270573887

In [26]:
from scipy.stats import ttest_ind

In [28]:
Ts = test_titantic_class([1,2,3], df)

In [32]:
Ts

[1.3316304810557684,
 2.597883597883598,
 0.7509590792838874,
 1.9509042747533796,
 0.38492871690427694,
 0.5125828124634221]

In [35]:
ttest_ind(df[df['pclass'] == 1]['survived'], df[df['pclass'] == 3]['survived'])

Ttest_indResult(statistic=10.623796623966948, pvalue=1.4803959119909571e-24)

In [37]:
def test_titantic_class_stats(classes, df):
    perms = permutations(classes, 2)
#     print(list(perms))

    Tlist = []
    for pair in perms:
        numerator   = df[df['pclass'] == pair[0]]['survived'].value_counts()[1] / df[df['pclass'] == pair[0]]['survived'].count()
        num_count = df[df['pclass'] == pair[0]]['survived'].count()
        num_stddev = np.std(df[df['pclass'] == pair[0]]['survived'])
        
        
        denominator = df[df['pclass'] == pair[1]]['survived'].value_counts()[1] / df[df['pclass'] == pair[1]]['survived'].count()
        den_count = df[df['pclass'] == pair[0]]['survived'].count()
        den_stddev = np.std(df[df['pclass'] == pair[0]]['survived'])


        stat = numerator / denominator
        Tlist.append(stat)
        
    return Tlist
test_titantic_class([1, 2, 3], df)

[(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)]


[]