# NHANES Demographics data

With the goal of sparking discussion surrounding responsible use of machine learning, especially as it applies to health data, it might make sense to first look at what the data says about us, even when it's been anonymized.

Using the NHANES demographic data, is it possible to find small pockets of unique sets of observations that could potentially lead to 'removing the veil' if we we know enough information about someone. This can highlight just how easy it is uniquely describe a person, given enough data

In [40]:
import pandas as pd

In [41]:
data = pd.read_sas('data/P_DEMO.XPT')
data

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,DMDBORN4,...,FIAINTRP,MIALANG,MIAPROXY,MIAINTRP,AIALANGA,WTINTPRP,WTMECPRP,SDMVPSU,SDMVSTRA,INDFMPIR
0,109263.0,66.0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,...,2.0,,,,,7891.762435,8.951816e+03,3.0,156.0,4.66
1,109264.0,66.0,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,...,2.0,1.0,2.0,2.0,1.0,11689.747264,1.227116e+04,1.0,155.0,0.83
2,109265.0,66.0,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,...,2.0,,,,,16273.825939,1.665876e+04,1.0,157.0,3.06
3,109266.0,66.0,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,...,2.0,1.0,2.0,2.0,1.0,7825.646112,8.154968e+03,2.0,168.0,5.00
4,109267.0,66.0,1.0,2.0,21.0,,2.0,2.0,,2.0,...,2.0,,,,,26379.991724,5.397605e-79,1.0,156.0,5.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15555,124818.0,66.0,2.0,1.0,40.0,,4.0,4.0,1.0,1.0,...,2.0,1.0,2.0,2.0,1.0,21586.596728,2.166689e+04,1.0,166.0,3.82
15556,124819.0,66.0,2.0,1.0,2.0,,4.0,4.0,2.0,1.0,...,2.0,,,,,1664.919253,1.838170e+03,2.0,171.0,0.07
15557,124820.0,66.0,2.0,2.0,7.0,,3.0,3.0,2.0,1.0,...,2.0,,,,,14819.783161,1.649781e+04,1.0,157.0,1.22
15558,124821.0,66.0,2.0,1.0,63.0,,4.0,4.0,1.0,1.0,...,2.0,1.0,2.0,2.0,1.0,4666.817952,4.853430e+03,1.0,158.0,3.71


In [42]:
# Remove administrative data, statistical weightings to isolate demographic answers
data = data.drop(['SEQN', 
           'SDDSRVYR',
           'WTINTPRP',
           'WTMECPRP',
           'SDMVPSU',
           'SDMVSTRA',
           ],
         axis=1)
data

Unnamed: 0,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,DMDBORN4,DMDYRUSZ,DMDEDUC2,...,SIAPROXY,SIAINTRP,FIALANG,FIAPROXY,FIAINTRP,MIALANG,MIAPROXY,MIAINTRP,AIALANGA,INDFMPIR
0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,4.66
1,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,0.83
2,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,3.06
3,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,2.0,5.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,5.00
4,1.0,2.0,21.0,,2.0,2.0,,2.0,3.0,4.0,...,2.0,2.0,1.0,2.0,2.0,,,,,5.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15555,2.0,1.0,40.0,,4.0,4.0,1.0,1.0,,5.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,3.82
15556,2.0,1.0,2.0,,4.0,4.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,0.07
15557,2.0,2.0,7.0,,3.0,3.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,1.22
15558,2.0,1.0,63.0,,4.0,4.0,1.0,1.0,,2.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,3.71


In [43]:
# Count unique rows in dataset
uniques = len(data[list(data.columns)].drop_duplicates())
print(uniques)
print(uniques/len(data))

14748
0.947814910025707


In the demographic dataset above, of the 15,560 participants interviewed, 14,748 responses (~95%) are completely unique. Put another way, if you had a complete knowledge of a participant's answers to these questions, you could almost certainly pick them out of this dataset without concern of selecting the wrong person.

For exactly that reason, any information that directly links an observation to an actual person has been stripped from this dataset (and would be a cardinal sin to leave in *any* health dataset). That being said, these questions are fairly high-level demographic questions, and on the surface don't seem that they could be descriptive enough.

Let's look at another dataset and see if we can do the same analysis: