This archive is a simulation of an AB Test, considering an study for the incidence of malignant lymphoma among tattooed and non-tattooed  people.

As first step, we import the libraries we are using.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

Then we generate two arrays for the simulated samples of non-tattooed and tattooed group, using numpy random.choice() and the probabilities of 0.18 and 0.21 respectively as indicated on:

https://www.sciencedirect.com/science/article/pii/S2589537024002281#sec6

In [3]:
tattoo_sample = np.random.choice(['limphoma', 'no_limphoma'], size = 5000, p = [0.21,  0.79])
no_tattoo_sample = np.random.choice(['limphoma', 'no_limphoma'], size = 5000, p = [0.18,  0.82])

Then we concatenate both simulated sampes on a Single Data Frame

In [4]:
group = ['tattoo']*5000 + ['no_tattoo']*5000
outcome = list(tattoo_sample) + list(no_tattoo_sample)
sim_data = {'has_tattoo': group, 'has_limphoma': outcome}
sim_data = pd.DataFrame(sim_data)
print(sim_data.head())
print(sim_data.tail())

  has_tattoo has_limphoma
0     tattoo  no_limphoma
1     tattoo  no_limphoma
2     tattoo  no_limphoma
3     tattoo  no_limphoma
4     tattoo  no_limphoma
     has_tattoo has_limphoma
9995  no_tattoo     limphoma
9996  no_tattoo  no_limphoma
9997  no_tattoo  no_limphoma
9998  no_tattoo  no_limphoma
9999  no_tattoo  no_limphoma


Then we create our contingency table, counting the cases between groups and its outcomes:

In [9]:
ab_contingency = pd.crosstab(sim_data.has_tattoo, sim_data.has_limphoma)
ab_contingency

has_limphoma,limphoma,no_limphoma
has_tattoo,Unnamed: 1_level_1,Unnamed: 2_level_1
no_tattoo,914,4086
tattoo,1024,3976


Then we calculate the p-value of the Chi-Sqare Test to see if there is a significant difference among groups:

In [6]:
chi2, pval, dof, expected = chi2_contingency(ab_contingency)

result = ('significant' if pval < 0.05 else 'not significant')
print(f'For this simulation, the difference in between limphoma incidence is {result}, because the pval is {pval}.')

For this simulation, the difference in between limphoma incidence is significant, because the pval is 8.661628192324232e-07.


After that, we re run the Test for 100 times, to record the whether the results are 'significant' or 'not significant', allowing us to estimate the Power of the Test. 

In [12]:
results = []
for i in range(100):
    # We randomly set the two groups (tattoo and no_tattoo) with their respectively limphoma probabilities 
    tattoo_sample = np.random.choice(['limphoma', 'no_limphoma'], size = 5000, p = [0.21,  0.79])
    no_tattoo_sample = np.random.choice(['limphoma', 'no_limphoma'], size = 5000, p = [0.18,  0.82])
    group = ['tattoo']*5000 + ['no_tattoo']*5000
    # We put our simulated data into a Pandas DataFrame
    outcome = list(tattoo_sample) + list(no_tattoo_sample)
    sim_data = {'has_tattoo': group, 'has_limphoma': outcome}
    sim_data = pd.DataFrame(sim_data)
    #Contingency Table
    ab_contingency = pd.crosstab(sim_data.has_tattoo, sim_data.has_limphoma)
#     print(ab_contingency)
    
    # We run the Chi-squared test for 
    chi2, pval, dof, expected = chi2_contingency(ab_contingency)
    result = ('significant' if pval < 0.05 else 'not significant')
    results.append(result)
# print(results)

In [13]:
count = 0
for i in results:
    if i == 'not significant':
        count += 1
print(count)

6


Conclusion, for this sample size, the power of the test surrouns the 95%, having the least 'False Positive'-Type 1 errors possible.