# P-Hacking

The term "p-hacking" refers to the process by which less-than-honorable researchers manipulate their data until they get a "statistically significant" result. A result is statistically significant if it achieves a p-value of less than 5%, meaning that fewer than 5% of the random permuted situations experienced the same outcome as that observed in the data.

---

## The Situation

You have been hired to determine whether or not there is a link between how a state voted in the 2016 presidential election and some other outcomes related to hate crimes. You decide to run an A/B Test on your variable of interest to determine whether or not the two groups come from the same underlying distribution.

See if you can get a significant result below by toying with the test parameters.

In [111]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import ipywidgets as widgets
from IPython.display import display, HTML
import warnings
warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = [15, 10]

In [100]:
crimes = pd.read_csv("../data/hate-crimes.csv")

In [101]:
crimes["voted_repub"] = crimes["share_voters_voted_trump"] > .5

In [102]:
def test_stat(df, outcome):
    """
    Computes test statistic, returns length 2 array. arr[0] is STEM=False, arr[1] is STEM=True
    """
    df2 = df.copy()
    rates = df2.groupby("voted_repub").mean()[outcome].values
    return abs(rates[0] - rates[1])

original = test_stat(crimes, "hate_crimes_per_100k_splc")

In [103]:
def shuffle_table():
    repub = crimes.sample(frac=1).reset_index()["voted_repub"]
    df = crimes.copy()
    df["voted_repub"] = repub
    return df

In [104]:
def run_ab_test_drop_na(n, outcome):
    """
    Runs A/B test
    """
    test_stats = []
    for _ in range(n):
        df = shuffle_table()
        df.dropna(inplace=True)
        stat = test_stat(df, outcome)
        test_stats += [stat]

    p_val = sum(np.array(test_stats) >= original) / len(test_stats)
    return p_val

In [105]:
def run_ab_test_fill_na_mean(n, outcome):
    """
    Runs A/B test
    """
    test_stats = []
    for _ in range(n):
        df = shuffle_table()
        df.fillna(np.mean(df[outcome]), inplace=True)
        stat = test_stat(df, outcome)
        test_stats += [stat]

    p_val = sum(np.array(test_stats) >= original) / len(test_stats)
    return p_val

In [106]:
cols = {
    "State" : "state",
    "Median Household Income" : "median_household_income",
    "Share Seasonally Unemployed" : "share_unemployed_seasonal",
    "Share in Metro Areas" : "share_population_in_metro_areas",
    "Share with High School Degree" : "share_population_with_high_school_degree",
    "Share Non-Citizens" : "share_non_citizen",
    "Share White Poverty" : "share_white_poverty",
    "Gini Index" : "gini_index",
    "Share Non-White" : "share_non_white",
    "Hate Crimes - FBI" : "avg_hatecrimes_per_100k_fbi",
    "Hate Crimes - SPLC" : "hate_crimes_per_100k_splc"
}

In [107]:
def barplot(outcome):
    col = cols[outcome]
    sns.barplot(x="voted_repub", y=col, data=crimes)
    plt.xlabel("Red State in 2016 Election")
    plt.ylabel(outcome);

In [108]:
def get_html(percent, good=True):
    if good:
        html = """
        <div style="font-family: sans-serif; font-size: 14pt;">
            <h2>Congratulations!</h2>
            <p>Your p-value was <strong>{:.2f}%</strong>! You can publish your results!</p>
        </div>
        """.format(percent)
    else:
        html = """
        <div style="font-family: sans-serif; font-size: 14pt;">
            <h2>Sorry!</h2>
            <p>Your p-value was <strong>{:.2f}%</strong>. You can't publish <em>these</em> findings!</p>
        </div>
        """.format(percent)
    return html

In [109]:
def run_test(outcome, n, fillna=False):
    col = cols[outcome]
    if fillna:
        p = run_ab_test_fill_na_mean(n, col)
    else:
        p = run_ab_test_drop_na(n, col)
    barplot(outcome)
    return HTML(get_html(p * 100, p <= .05))

In [112]:
widgets.interact(
    run_test,
    outcome = widgets.Dropdown(options=list(cols.keys())[1:], description="Outcome"),
    n = widgets.IntSlider(value=50, min=50, max=1000, step=50, description="Reshuffles"),
    fillna = widgets.Checkbox(value=False, description="Fill NA Values")
);

interactive(children=(Dropdown(description='Outcome', options=('Median Household Income', 'Share Seasonally Un…

---

App designed by [Chris Pyles](https://github.com/chrispyles). Data from [FiveThirtyEight on Github](https://github.com/fivethirtyeight/data/blob/master/hate-crimes/hate_crimes.csv).