# Permutation Tests

In [None]:
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

## 1. Load and explore maternal smoker data 

First stage of our data science pipeline, let's explore the data and see if we find something interesting. 

You can read more about this data [here](https://www.stat.berkeley.edu/~statlabs/labs.html#babiesI).

In [None]:
births = Table.read_table('data/baby.csv')

In [None]:
births.show(4)

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')

In [None]:
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.hist('Birth Weight', group='Maternal Smoker')

Interesting! It looks like there's a higher birth weight for maternal non-smokers. But is this just due to chance? Let's use hypothesis testing to find out.

## 2. Test Statistic


In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.mean)
means_table

In [None]:
means = means_table.column('Birth Weight mean')
observed_difference = means.item(0) - means.item(1)
observed_difference

In keeping with the approach we laid out last lecture, we'll focus only on absolute difference...

In [None]:
observed_difference = abs(means.item(0) - means.item(1))
observed_difference

In [None]:
def abs_difference_of_means(table, group_label, value_label):   
    # table containing group means
    means_table = table.group(group_label, np.mean)
    
    # array of group means
    means = means_table.column(value_label + ' mean')
    
    return abs(means.item(0) - means.item(1))

**Our observed difference**

In [None]:
observed_difference = abs_difference_of_means(births, 'Maternal Smoker', "Birth Weight")
observed_difference

We can use this function on lots of columns!

In [None]:
abs_difference_of_means(births, 'Maternal Smoker', "Maternal Age")

In [None]:
abs_difference_of_means(births, 'Maternal Smoker', "Maternal Height")

## 3. Simulation Under Null Hypothesis

### Creating Permutations of Labels

Just use a tiny table to show our approach...

In [None]:
tiny_smoking_and_birthweight = smoking_and_birthweight.take(np.arange(0,6))
tiny_smoking_and_birthweight

We'll use `.sample(with_replacement=False)` to shuffle the rows of a table. 

In [None]:
shuffled_labels = tiny_smoking_and_birthweight.sample(with_replacement=False).column('Maternal Smoker')
shuffled_labels

In [None]:
original_and_shuffled = tiny_smoking_and_birthweight.with_columns('Shuffled Label', 
                                                                 shuffled_labels)
original_and_shuffled

A function to make a permutation!

In [None]:
def permutation_sample(table, group_label):
    """
    Returns: The table with a new "Shuffled Label" column containing
    the shuffled values of the group_label.
    """
    
    # array of shuffled labels
    shuffled_labels = table.sample(with_replacement=False).column(group_label)
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.with_columns('Shuffled Label', shuffled_labels)
    
    return shuffled_table

In [None]:
original_and_shuffled = permutation_sample(tiny_smoking_and_birthweight, 
                                           "Maternal Smoker")
original_and_shuffled

We'll calculate the statistic for the shuffled groups. 

In [None]:
abs_difference_of_means(original_and_shuffled, "Shuffled Label", "Birth Weight")

And now the full table...

In [None]:
smoking_and_birthweight

In [None]:
original_and_shuffled = permutation_sample(smoking_and_birthweight, 
                                           "Maternal Smoker")
original_and_shuffled

Statistic for one sample of the null hypothesis. 

In [None]:
abs_difference_of_means(original_and_shuffled, 'Shuffled Label', 'Birth Weight')

### Permutation Test

Our `simulate_permutation_statistic` function is in the library.  Here's the full code.  It's just a minor variation on our usual simulation code!

In [None]:
def simulate_permutation_statistic(table, group_label, value_label, num_trials):
    sample_statistics = make_array()
    for i in np.arange(num_trials):
        one_sample = permutation_sample(table, group_label)
        sample_statistic = abs_difference_of_means(one_sample, 
                                                   "Shuffled Label", 
                                                   value_label)
        sample_statistics = np.append(sample_statistics, sample_statistic)
    return sample_statistics

In [None]:
simulated_birth_weight_diffs = simulate_permutation_statistic(smoking_and_birthweight, 
                                                              'Maternal Smoker', 
                                                              'Birth Weight', 
                                                              1000)

In [None]:
results = Table().with_columns('abs(Group A Mean - Group B Mean)', 
                               simulated_birth_weight_diffs)

In [None]:
plot = results.hist()
plot.set_title("Null hypothesis empirical distribution")
plot.dot(observed_difference)

Let's calculate the p-value (even if we can easily guess what it is here)...

In [None]:
np.count_nonzero(simulated_birth_weight_diffs >= observed_difference) / len(simulated_birth_weight_diffs)

Or, even better... Use our function!

In [None]:
empirical_pvalue(simulated_birth_weight_diffs, observed_difference)

## 3. A second hypothesis test

Is the Maternal Age of smokers different than non-smokers?

In [None]:
observed_difference = abs_difference_of_means(births, 'Maternal Smoker', "Maternal Age")

simulated_birth_weight_diffs = simulate_permutation_statistic(births, 
                                                              'Maternal Smoker', 
                                                              'Maternal Age', 
                                                              1000)

In [None]:
results = Table().with_columns('abs(Group A Mean Age - Group B Mean Age)', 
                               simulated_birth_weight_diffs)

In [None]:
plot = results.hist(left_end=observed_difference)
plot.set_title("Null hypothesis empirical distribution")
plot.dot(observed_difference)

In [None]:
empirical_pvalue(simulated_birth_weight_diffs, observed_difference)