# T-Tests and P-Values

Let's say we're running an A/B test. We'll fabricate some data that randomly assigns order amounts from customers in sets A and B, with B being a little bit higher:

In [3]:
import numpy as np
from scipy import stats

A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 5.0, 10000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-13.27210025707641, pvalue=4.963098258445146e-40)

The t-statistic is a measure of the difference between the two sets expressed in units of standard error. Put differently, it's the size of the difference relative to the variance in the data. A high t value means there's probably a real difference between the two sets; you have "significance". The P-value is a measure of the probability of an observation lying at extreme t-values; so a low p-value also implies "significance." If you're looking for a "statistically significant" result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.

Let's change things up so both A and B are just random, generated under the same parameters. So there's no "real" difference between the two:

In [8]:
B = np.random.normal(25.0, 5.0, 10000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-1.205787672621694, pvalue=0.22789965234863954)

In [9]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-0.6538788585418803, pvalue=0.513190607378698)

Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn't help, so if we were to keep running this A/B test for years, you'd never acheive the result you're hoping for:

In [10]:
A = np.random.normal(25.0, 5.0, 1000000)
B = np.random.normal(25.0, 5.0, 1000000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-0.06921154481545617, pvalue=0.9448212406310315)

If we compare the same set to itself, by definition we get a t-statistic of 0 and p-value of 1:

In [11]:
stats.ttest_ind(A, A)

Ttest_indResult(statistic=0.0, pvalue=1.0)

The threshold of significance on p-value is really just a judgment call. As everything is a matter of probabilities, you can never definitively say that an experiment's results are "significant". But you can use the t-test and p-value as a measure of signficance, and look at trends in these metrics as the experiment runs to see if there might be something real happening between the two.

## Activity

Experiment with more different distributions for A and B, and see the effect it has on the t-test.

In [23]:
import matplotlib.pyplot as plt

n = 10000000

# Normal Distribution
A = np.random.normal(loc=25.0, scale=5.0, size=n)
B = np.random.normal(loc=25.0, scale=5.0, size=n)

normal_d = stats.ttest_ind(A, B)

# Weibull Distribution
A = np.random.weibull(5 ,n)
B = np.random.weibull(5 ,n)

weibull_d = stats.ttest_ind(A, B)

# Poisson Distribution
A = np.random.poisson(2,n)
B = np.random.poisson(2,n)

poisson_d = stats.ttest_ind(A, B)

# Binomial Distribution
A = np.random.binomial(36,1/6,n)
B = np.random.binomial(36,1/6,n)

binomial_d = stats.ttest_ind(A, B)

# Uniform Distribution
A = np.random.uniform(-1,0,n)
B = np.random.uniform(-1,0,n)

uniform_d = stats.ttest_ind(A, B)

dists = [normal_d, weibull_d, poisson_d, binomial_d, uniform_d]

for i in dists:
    print(f"Distribution {i}")

Distribution Ttest_indResult(statistic=0.16786800127898543, pvalue=0.8666871258346134)
Distribution Ttest_indResult(statistic=0.9811093394432884, pvalue=0.3265388371794339)
Distribution Ttest_indResult(statistic=-1.0230031314418226, pvalue=0.3063063776055588)
Distribution Ttest_indResult(statistic=0.9720908827194316, pvalue=0.33100534922788794)
Distribution Ttest_indResult(statistic=-0.417207724561009, pvalue=0.6765264800783413)
