In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

Sometimes we can reassure ourselve that our code is correct by using a logical check against a known answer. These checks can be implemented in Python using [assert](https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement) expressions that fail when a given expression is not true and run silently otherwise.

A trivial example of an assert is:

In [2]:
assert 3 == 2 + 1

As a slightly more interesting example, suppose that we aren't clear what the numpy [max](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) function does. We can reassure ourselves by checking its result against an example where the truth is known:

In [3]:
assert np.max([2, 5, 3]) == 5

# Testing statistical procedures with simulation

A more interesting example uses simulation to confirm that the result of a procedure that involves random sampling gives results that are close to what is expected. Specifically, suppose we wish to confirm that the numpy [random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) function generates data that follow a standard normal distribution. One way to do this is to generate a large random sample, and check whether the sample standard deviation is close to 1 (we could also check whether the sample mean is close to zero). Since the procedure is random, we cannot expect perfect agreement, so we use the numpy [allclose](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html) function to check that the sample standard deviation is close to 1. See the 'allclose' documentation for the precise meaning of the relative and absolute tolerance arguments (rtol and atol). We fix the seed since there is always a small chance that any simulation-based test will fail.

In [4]:
np.random.seed(123)
assert np.allclose(np.random.normal(size=1000).std(), 1, rtol=1e-2, atol=1e-2)

A known property of the Poisson distribution is that its population mean and variance are identical. We can check that the sample mean and variance of a random sample from a Poisson distribution are similar.

In [5]:
np.random.seed(123)
x = np.random.poisson(2, size=100000)
assert np.allclose(x.mean(), x.var(), rtol=1e-2, atol=1e-3)

## Testing data manipulation code

Next we consider some ways to reassure us that code involving data manipulation is operating correctly. Suppose we are analyzing the NHANES 2015-2016 data. First we will load the data.

In [6]:
# Download NHANES 2015-2016 data
df = pd.read_csv("nhanes_2015_2016.csv")

After loading the data, we may wish to confirm that the sequence variable [SEQN](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SEQN) (which is supposed to be a unique identifier for each subject) is never missing. This can be done as follows:

In [7]:
assert pd.notnull(df["SEQN"]).all()

Suppose further that we wish to check that the SEQN variable is indeed unique:

In [8]:
assert len(df["SEQN"].unique()) == df.shape[0]

We can also use _range checks_ to confirm that the data are coded as expected. For example, the age variable is _top coded_ at 80, so there should be no values greater than 80 in the data:

In [9]:
assert df["RIDAGEYR"].max() <= 80

Suppose we have been told that whenever ALQ101 is missing then ALQ110 must always be missing. The following assert confirms this:

In [10]:
assert (pd.notnull(df["ALQ101"]) | pd.isnull(df["ALQ110"])).all()

Now suppose that we wish to confirm that all columns of the dataframe contain numbers. This can be done as follows:

In [12]:
assert all([np.issubdtype(x, np.number) for x in df.dtypes])