# Serology Testing Sample Size Estimation


### Questions:
1. Q1: How large does sample size need to be to detect prevalence levels using a perfect test?
2. Q2: How large does sample size need to be to detect prevalence levels using an imperfect test?
3. Q3: For a set prevalence, how does sensitivity and specificity affect needed sample size?


### Resources
1. Humphry 2004 https://www.sciencedirect.com/science/article/pii/S0167587704001412
2. Arya 2012 https://link.springer.com/article/10.1007/s12098-012-0763-3


### Q1: How large does sample size need to be to detect prevalence levels using a perfect test?

Use normal approximation to binomial distribution to determine minimum sample size $n$ to detect prevalence $p$ with tolerance $d$ at desired confidence interval $z$ 

$n = (\frac{z}{d})^2 * p (1 - p))$

In [27]:
# Set up data

import pandas as pd
import numpy as np

ref_data_file = "/Users/margaretantonio/Documents/projects/serology_testing/data/reported_data.csv"

# Read in the ref data file and add prev, sens, spec
def get_ref_data():
    df = pd.read_csv(ref_data_file,
                     header = 0,
                    sep = ",")
    df['raw_prevalence'] = df['n_positive']/df['n']
    df['sensitivity'] = df['test_pos_test']/df['test_pos_known']
    df['specificity'] = df['test_neg_test']/df['test_neg_known']
    
    return df


studies_df = get_ref_data()

studies_df

Unnamed: 0,study_name,location,location_population,study_date,n,n_positive,test_name,test_pos_known,test_pos_test,test_neg_known,test_neg_test,test_data_source,raw_prevalence,sensitivity,specificity
0,NY State Shoppers,New York,8399000,5/12/20,15000,1845,Wadsworth,334,265,256,255,https://www.fda.gov/media/137541/download,0.123,0.793413,0.996094
1,LA Study,LA County,10040000,5/18/20,865,35,Premier Biotech,197,178,401,399,manufacturer + stanford,0.040462,0.903553,0.995012
2,Santa Clara Study,Santa Clara County,1928000,4/27/20,3300,50,Premier Biotech,197,178,401,399,manufacturer + stanford,0.015152,0.903553,0.995012


In [118]:
# Z score for 95% CI
z = 1.96

# PREVALENCE RANGE 
# based on reported prevalences
min_prev = 0.001 # infinity for zero
max_prev = max(studies_df['raw_prevalence'] + 0.1)

# TOLERANCE LEVEL (ALLOWABLE ERROR)
# Conventionally, an ‘absolute’ allowable error margin d of 5 % is chosen, 
# but, as is common in clinical practice, 
# if expected prevalence P is <10 %, 
# the 95 % confidence boundaries may cross 0, which is impractical.
# A common recommendation is to set d = P/2 for rare 
# and d = (1-P)/2 for very common conditions


# Data frame of prevalence, tolerance, and min n
# Use simple formula for perfect etst
# min_n_perfect: 95% chance of estimating true prevalence at min sample size

# simulate data

z_range = [1.645, 1.96, 2.326, 2.576]

prev_range = np.linspace(float(min_prev), float(max_prev), 100)
tolerance = prev_range/2

result_df = pd.DataFrame([(x, z) for x in prev_range for z in z_range],
                            columns = ["raw_prevalence", "z"])

result_df['tolerance'] = result_df['raw_prevalence']/2

result_df['min_n'] = ((result_df['z']/result_df['tolerance'])**2) * result_df['raw_prevalence'] * (1 - result_df['raw_prevalence'])

In [121]:
# Plot estimates of prevalence vs min n

import plotly.express as px

fig = px.line(result_df, x = "raw_prevalence", y = "min_n",
             title = 'Minimum sample size for 95% confidence at different prevalence levels')

fig = px.scatter(result_df, x = "raw_prevalence", y = "min_n",
                 animation_frame = "z",
                 error_x="tolerance")

fig.update_layout(yaxis_type="log")

fig.show()

### Q2: How large does sample size need to be to detect prevalence levels using an imperfect test?

Same as Eq. 1, but incorporate sensitivity and specificity. See Humphry 2014 Eq. 2

x = sensitivity
y = specificity
$n = (\frac{z}{d})^2 \frac{xp + (1-x) * (1-p) *(1- xp - (1-y)(1-p))}{(x - y - 1)^2}$

In [166]:

# Imperfect test - with sensitivity and specificity
# Rogan and Gladen 1978
# Assume normal approximation to binomial distribution

# Vary sensitivity and specificity
# For now hardcode values around Marson paper estimates
sens = np.round(np.linspace(0.8,1., 5), decimals = 2)
spec = sens
prev = np.linspace(0.01,0.3, 100)
tol_fracs = np.linspace(0.5, 1, 5)

result2_df = pd.DataFrame([(x, s1,s2, z, t) for x in prev_range for s1 in sens for s2 in spec for z in z_range for t in tol_fracs],
                            columns = ["raw_prevalence", "sens", "spec", "z", "tol_frac"])

result2_df['tolerance'] = result2_df['raw_prevalence'] * result2_df['tol_frac']
result2_df['min_n'] = (result2_df['z']/result2_df['tolerance'])**2 * ((result2_df['sens'] * result2_df['raw_prevalence']) + 
           (1 - result2_df['spec'])*(1 - result2_df['raw_prevalence'])) * (1 - (result2_df['sens'] * result2_df['raw_prevalence']) - 
           (1 - result2_df['spec'])* (1 - result2_df['raw_prevalence'])) / (result2_df['sens'] + result2_df['spec'] - 1)**2

result2_df.to_csv("/Users/margaretantonio/Documents/projects/serology_testing/new_dat.csv")

In [165]:
# Plot estimates of prevalence vs min n

import plotly.express as px

fig = px.line(result2_df, x = "raw_prevalence", y = "min_n",
             title = 'Minimum sample size for 95% confidence at different prevalence levels')

fig = px.line(result2_df, x = "raw_prevalence", y = "min_n",
                 animation_frame = "z",
                 facet_col = "sens",
                 facet_row = "spec",
                 color = "tol_frac"
                )
fig.update_xaxes(matches = 'x')
fig.update_layout(yaxis_type = "log")


fig.show()