# NHANES Demographics data

CC-1089

With the goal of sparking discussion surrounding responsible use of machine learning, especially as it applies to health data, it might make sense to first look at what the data says about us, even when it's been anonymized.

Using the NHANES demographic data, is it possible to find small pockets of unique sets of observations that could potentially lead to 'removing the veil' if we we know enough information about someone? This can highlight just how easy it is uniquely describe a person, given enough data. An important part of data literacy is understanding the value of privacy, and just how easily we give up our data without thinking about what it reveals about us.

The first dataset is the NHANES datat on **Demographics** found at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&Cycle=2017-2020

Data description:
https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm

In [2]:
import pandas as pd

In [3]:
demdata = pd.read_sas('data/P_DEMO.XPT')
demcols = {
    'SEQN' : 'Respondent sequence number',
    'SDDSRVYR' : 'Data release cycle',
    'RIDSTATR' : 'Interview/Examination status',
    'RIAGENDR' : 'Gender',
    'RIDAGEYR' : 'Age in years at screening',
    'RIDAGEMN' : 'Age in months at screening - 0 to 24 mos',
    'RIDRETH1' : 'Race/Hispanic origin',
    'RIDRETH3' : 'Race/Hispanic origin w/ NH Asian',
    'RIDEXMON' : 'Six-month time period',
    'DMDBORN4' : 'Country of birth',
    'DMDYRUSZ' : 'Length of time in US',
    'DMDEDUC2' : 'Education level - Adults 20+',
    'DMDMARTZ' : 'Marital status',
    'RIDEXPRG' : 'Pregnancy status at exam',
    'SIALANG' : 'Language of SP Interview',
    'SIAPROXY' : 'Proxy used in SP Interview?',
    'SIAINTRP' : 'Interpreter used in SP Interview?',
    'FIALANG' : 'Language of Family Interview',
    'FIAPROXY' : 'Proxy used in Family Interview?',
    'FIAINTRP' : 'Interpreter used in Family Interview?',
    'MIALANG' : 'Language of MEC Interview',
    'MIAPROXY' : 'Proxy used in MEC Interview?',
    'MIAINTRP' : 'Interpreter used in MEC Interview?',
    'AIALANGA' : 'Language of ACASI Interview',
    'WTINTPRP' : 'Full sample interview weight',
    'WTMECPRP' : 'Full sample MEC exam weight',
    'SDMVPSU' : 'Masked variance pseudo-PSU',
    'SDMVSTRA' : 'Masked variance pseudo-stratum',
    'INDFMPIR' : 'Ratio of family income to poverty'
}

demdata = demdata.rename(demcols, axis=1)
demdata

Unnamed: 0,Respondent sequence number,Data release cycle,Interview/Examination status,Gender,Age in years at screening,Age in months at screening - 0 to 24 mos,Race/Hispanic origin,Race/Hispanic origin w/ NH Asian,Six-month time period,Country of birth,...,Interpreter used in Family Interview?,Language of MEC Interview,Proxy used in MEC Interview?,Interpreter used in MEC Interview?,Language of ACASI Interview,Full sample interview weight,Full sample MEC exam weight,Masked variance pseudo-PSU,Masked variance pseudo-stratum,Ratio of family income to poverty
0,109263.0,66.0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,...,2.0,,,,,7891.762435,8.951816e+03,3.0,156.0,4.66
1,109264.0,66.0,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,...,2.0,1.0,2.0,2.0,1.0,11689.747264,1.227116e+04,1.0,155.0,0.83
2,109265.0,66.0,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,...,2.0,,,,,16273.825939,1.665876e+04,1.0,157.0,3.06
3,109266.0,66.0,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,...,2.0,1.0,2.0,2.0,1.0,7825.646112,8.154968e+03,2.0,168.0,5.00
4,109267.0,66.0,1.0,2.0,21.0,,2.0,2.0,,2.0,...,2.0,,,,,26379.991724,5.397605e-79,1.0,156.0,5.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15555,124818.0,66.0,2.0,1.0,40.0,,4.0,4.0,1.0,1.0,...,2.0,1.0,2.0,2.0,1.0,21586.596728,2.166689e+04,1.0,166.0,3.82
15556,124819.0,66.0,2.0,1.0,2.0,,4.0,4.0,2.0,1.0,...,2.0,,,,,1664.919253,1.838170e+03,2.0,171.0,0.07
15557,124820.0,66.0,2.0,2.0,7.0,,3.0,3.0,2.0,1.0,...,2.0,,,,,14819.783161,1.649781e+04,1.0,157.0,1.22
15558,124821.0,66.0,2.0,1.0,63.0,,4.0,4.0,1.0,1.0,...,2.0,1.0,2.0,2.0,1.0,4666.817952,4.853430e+03,1.0,158.0,3.71


In [4]:
# Remove administrative data, statistical weightings to isolate demographic answers
demdata = demdata.drop(['Respondent sequence number', 
                       'Data release cycle',
                       'Full sample interview weight',
                       'Full sample MEC exam weight',
                       'Masked variance pseudo-PSU',
                       'Masked variance pseudo-stratum',
                       ],
                     axis=1)
demdata

Unnamed: 0,Interview/Examination status,Gender,Age in years at screening,Age in months at screening - 0 to 24 mos,Race/Hispanic origin,Race/Hispanic origin w/ NH Asian,Six-month time period,Country of birth,Length of time in US,Education level - Adults 20+,...,Proxy used in SP Interview?,Interpreter used in SP Interview?,Language of Family Interview,Proxy used in Family Interview?,Interpreter used in Family Interview?,Language of MEC Interview,Proxy used in MEC Interview?,Interpreter used in MEC Interview?,Language of ACASI Interview,Ratio of family income to poverty
0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,4.66
1,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,0.83
2,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,3.06
3,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,2.0,5.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,5.00
4,1.0,2.0,21.0,,2.0,2.0,,2.0,3.0,4.0,...,2.0,2.0,1.0,2.0,2.0,,,,,5.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15555,2.0,1.0,40.0,,4.0,4.0,1.0,1.0,,5.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,3.82
15556,2.0,1.0,2.0,,4.0,4.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,0.07
15557,2.0,2.0,7.0,,3.0,3.0,2.0,1.0,,,...,1.0,2.0,1.0,2.0,2.0,,,,,1.22
15558,2.0,1.0,63.0,,4.0,4.0,1.0,1.0,,2.0,...,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,3.71


In [10]:
# Count unique rows in dataset
dem_uniques = len(demdata.drop_duplicates())
print(dem_uniques)
print(dem_uniques/len(demdata))

14748
0.947814910025707


In the demographic dataset above, of the 15,560 participants interviewed, 14,748 responses (~95%) are completely unique. Put another way, if you had a complete knowledge of a participant's answers to these questions, you could almost certainly pick them out of this dataset without concern of selecting the wrong person.

For exactly that reason, any information that directly links an observation to an actual person has been stripped from this dataset (and would be a cardinal sin to leave in *any* health dataset). That being said, these questions are fairly high-level demographic questions, and on the surface don't seem that they could be descriptive enough to identify anyone.

Let's look at another dataset and see if we can do the same analysis. This second dataset is regarding **Medical Conditions**

Dataset description:
https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_MCQ.htm

In [6]:
meddata = pd.read_sas('data/P_MCQ.XPT')
medcols = {
    'SEQN' : 'Respondent sequence number',
    'MCQ010' : 'Ever been told you have asthma',
    'MCQ025' : 'Age when first had asthma',
    'MCQ035' : 'Still have asthma',
    'MCQ040' : 'Had asthma attack in past year',
    'MCQ050' : 'Emergency care visit for asthma/past yr',
    'AGQ030' : 'Did SP have episode of hay fever/past yr',
    'MCQ053' : 'Taking treatment for anemia/past 3 mos',
    'MCQ080' : 'Doctor ever said you were overweight',
    'MCQ092' : 'Ever receive blood transfusion',
    'MCD093' : 'Year receive blood transfusion',
    'MCQ145' : 'CHECK ITEM',
    'MCQ149' : 'Menstrual periods started yet?',
    'MCQ151' : 'Age in years at first menstrual period',
    'RHD018' : 'Estimated age in months at menarche',
    'MCQ160a' : 'Doctor ever said you had arthritis',
    'MCQ195' : 'Which type of arthritis was it?',
    'MCQ160B' : 'Ever told had congestive heart failure',
    'MCD180B' : 'Age when told you had heart failure',
    'MCQ160C' : 'Ever told you had coronary heart disease',
    'MCD180C' : 'Age when told had coronary heart disease',
    'MCQ160D' : 'Ever told you had angina/angina pectoris',
    'MCD180D' : 'Age when told you had angina pectoris',
    'MCQ160E' : 'Ever told you had heart attack',
    'MCD180E' : 'Age when told you had heart attack',
    'MCQ160F' : 'Ever told you had a stroke',
    'MCD180F' : 'Age when told you had a stroke',
    'MCQ160M' : 'Ever told you had thyroid problem',
    'MCQ170M' : 'Do you still have thyroid problem',
    'MCD180M' : 'Age when told you had thyroid problem',
    'MCQ160P' : 'Ever told you had COPD, emphysema, ChB',
    'MCQ160L' : 'Ever told you had any liver condition',
    'MCQ170L' : 'Do you still have a liver condition',
    'MCD180L' : 'Age when told you had a liver condition',
    'MCQ500' : 'Ever told you had any liver condition',
    'MCQ510A' : 'Liver condition: Fatty liver',
    'MCQ510B' : 'Liver condition: Liver fibrosis',
    'MCQ510C' : 'Liver condition: Liver cirrhosis',
    'MCQ510D' : 'Liver condition: Viral hepatitis',
    'MCQ510E' : 'Liver condition: Autoimmune hepatitis',
    'MCQ510F' : 'Liver condition: Other liver disease',
    'MCQ515' : 'CHECK ITEM',
    'MCQ520' : 'Abdominal pain during past 12 months?',
    'MCQ530' : 'Where was the most uncomfortable pain',
    'MCQ540' : 'Ever seen a DR about this pain',
    'MCQ550' : 'Has DR ever said you have gallstones',
    'MCQ560' : 'Ever had gallbladder surgery?',
    'MCQ570' : 'Age when 1st had gallbladder surgery?',
    'MCQ220' : 'Ever told you had cancer or malignancy',
    'MCQ230A' : '1st cancer - what kind was it?',
    'MCQ230B' : '2nd cancer - what kind was it?',
    'MCQ230C' : '3rd cancer - what kind was it?',
    'MCQ230D' : 'More than 3 kinds of cancer',
    'MCQ300B' : 'Close relative had asthma?',
    'MCQ300C' : 'Close relative had diabetes?',
    'MCQ300A' : 'Close relative had heart attack?',
    'MCQ366A' : 'Doctor told you to control/lose weight',
    'MCQ366B' : 'Doctor told you to exercise',
    'MCQ366C' : 'Doctor told you to reduce salt in diet',
    'MCQ366D' : 'Doctor told you to reduce fat/calories',
    'MCQ371A' : 'Are you now controlling or losing weight',
    'MCQ371B' : 'Are you now increasing exercise',
    'MCQ371C' : 'Are you now reducing salt in diet',
    'MCQ371D' : 'Are you now reducing fat in diet',
    'OSQ230' : 'Any metal objects inside your body?'
}

meddata = meddata.rename(medcols, axis=1)
meddata

Unnamed: 0,Respondent sequence number,Ever been told you have asthma,Age when first had asthma,Still have asthma,Had asthma attack in past year,Emergency care visit for asthma/past yr,Did SP have episode of hay fever/past yr,Taking treatment for anemia/past 3 mos,Doctor ever said you were overweight,Ever receive blood transfusion,...,Close relative had heart attack?,Doctor told you to control/lose weight,Doctor told you to exercise,Doctor told you to reduce salt in diet,Doctor told you to reduce fat/calories,Are you now controlling or losing weight,Are you now increasing exercise,Are you now reducing salt in diet,Are you now reducing fat in diet,Any metal objects inside your body?
0,109263.0,2.0,,,,,2.0,2.0,,,...,,,,,,,,,,
1,109264.0,2.0,,,,,,2.0,,2.0,...,,,,,,,,,,
2,109265.0,2.0,,,,,,2.0,,,...,,,,,,,,,,
3,109266.0,2.0,,,,,2.0,2.0,1.0,9.0,...,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,
4,109267.0,2.0,,,,,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14981,124818.0,2.0,,,,,,2.0,1.0,2.0,...,2.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0
14982,124819.0,2.0,,,,,2.0,2.0,,,...,,,,,,,,,,
14983,124820.0,2.0,,,,,,2.0,,2.0,...,,,,,,,,,,
14984,124821.0,1.0,58.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Remove administrative and error-checking variables:
meddata = meddata.drop(['Respondent sequence number',
                        'MCQ160A'],
                         axis=1)

In [9]:
# Count unique rows in dataset
med_uniques = len(meddata.drop_duplicates())
print(med_uniques)
print(med_uniques/len(meddata))

8272
0.5519818497264113


In this dataset, we see that around 55% of the observations are unique, or a little over half the people in the dataset can be uniquely described with their responses to the questions, which are now more personal than the first set. Again, there's no way to link the data here to any names or other personal details, but it highlights how with even relatively small amounts of data, enough information can be connected to lead to unique observations.