# A/B test
## T-Tests and P-Values

Let's say we're running an A/B test. We'll fabricate some data that randomly assigns order amounts from customers in sets A and B, with B being a little bit higher:

In [1]:
import numpy as np
from scipy import stats

#A - test group
A = np.random.normal(25.0, 5.0, 10000)
#B - control group
B = np.random.normal(26.0, 5.0, 10000)

#running test in box
stats.ttest_ind(A, B)
#negative statistics means changes have negative impact
#low pvalue means that the result is real (not occasional deviation)

Ttest_indResult(statistic=-14.401489079766488, pvalue=8.683098950371365e-47)

The t-statistic is a measure of the difference between the two sets expressed in units of standard error. Put differently, it's the size of the difference relative to the variance in the data. A high t value means there's probably a real difference between the two sets; you have "significance". The P-value is a measure of the probability of an observation lying at extreme t-values; so a low p-value also implies "significance." If you're looking for a "statistically significant" result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.

Let's change things up so both A and B are just random, generated under the same parameters. So there's no "real" difference between the two:

In [2]:
B = np.random.normal(25.0, 5.0, 10000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-0.5228944122460242, pvalue=0.6010535263082546)

Now, our t-statistic is much lower and our p-value is really high. This supports the null hypothesis - that there is no real difference in behavior between these two sets.

Let's do the same thing - where the null hypothesis is accurate - but with 10X as many samples:

In [3]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=1.4805328626808123, pvalue=0.1387326725970262)

Our p-value actually got a higher, and the t-test smaller, but still not enough to declare a real difference. 

In [4]:
A = np.random.normal(25.0, 5.0, 1000000)
B = np.random.normal(25.0, 5.0, 1000000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=0.5599442298496917, pvalue=0.5755175413543256)

When we increase samples again, we got lower statistics (less influence) and higher pvalue (test in not relevant).

So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn't help.

There is no significant difference between between A & B

If we compare the same set to itself, by definition we get a t-statistic of 0 and p-value of 1:

In [5]:
stats.ttest_ind(A, A)

Ttest_indResult(statistic=0.0, pvalue=1.0)

We run A/A test to check if we have system bugs. A/A test shows we can trust our results

Let's see the influence of std on the test

In [6]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 6.0, 100000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-0.6387370726739928, pvalue=0.5229947225536924)

Our p-value is low, and the t-test high.

Let's increase std difference

In [9]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 10.0, 100000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=0.9886650911938444, pvalue=0.3228282238988791)

Our p-value got higher, and the t-test bit smaller, but not enough to declare influence.

Let's increase std difference more

In [14]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 20.0, 100000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=1.0366643223315852, pvalue=0.2998935782223358)

Our p-value got bit higher, and the t-test bit smaller.
We see that with the same center & different std we get randomly different small result in statistics.
It means we have no seriouse changes in result.
Higher the difference in std shows us that the result is less random.

In our case we still can declare no significant influence.

Let's try increase both std in very first case:

In [22]:
A = np.random.normal(25.0, 10.0, 10000)
B = np.random.normal(26.0, 10.0, 10000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-5.825832749506706, pvalue=5.770018341440199e-09)

In [23]:
A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 20.0, 10000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-4.345954711297113, pvalue=1.3935513465654028e-05)

When we have difference in centers, we see it in satistics. 
Though higher std or difference in std, we got less stable result and p-value shows less negative impact.
P-value in both cases is very small. It means in both cases we have negative impact we can declare.