# 解析带有 SAS 标签的NHANES数据

## 目的
在NHANES数据集中，由于数据变量很多，很难找到想要的变量，或者在特征选择后说明每个变量的意义；所以，我们决定用**Beautiful Soup**来做html解析

本文参考: https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/discussion/47796


本文主要意义在从下面几个目录中，提取SAS标签:
- Demographics: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013
- Dietary: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013
- Examination: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013
- Laboratory: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013
- Questionnaire: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import regex as re
import urllib

In [3]:
def parse_main(URL, links, category):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find('table')

    for link in table.find_all('a'):
        if str(link.get('href')).endswith('.htm') == True:
            link_j = urllib.parse.urljoin('https://wwwn.cdc.gov/', link.get('href'))
            links[category].append(link_j)


urls = {'DM':'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013',
        'DIET':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013',
        'EXAM':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013',
        'LAB':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013',
        'QUES':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013'}

links = {v:[] for v in urls.keys()}
for c, URL in urls.items():
    parse_main(URL, links, c)
links

{'DM': ['https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm'],
 'DIET': ['https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DR1IFF_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DR2IFF_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DR1TOT_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DR2TOT_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DRXFCD_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DSBI.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DSII.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DSPI.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DS1IDS_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DS2IDS_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DS1TOT_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DS2TOT_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DSQIDS_H.htm',
  'https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DSQTOT_H.htm'],
 'EXAM': ['https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPX_H.htm',
  'https://wwwn.c

In [4]:
def parse_nhanes(links, codes):
    for c, URLs in links.items():
        for URL in URLs:
            # access webs site
            page = requests.get(URL)

            # parse data
            soup = BeautifulSoup(page.content, 'html.parser')
            containers = soup.find_all('dl')
            for i in containers:
                try:
                    varname = str(i.find("dt",string="Variable Name: ").findNext("dd").text)
                    saslabel = str(i.find("dt",string="SAS Label: ").findNext("dd").text)
#                     print(varname, saslabel)
                    codes['category'].append(c)
                    codes['variable'].append(varname.strip())
                    codes['label'].append(saslabel.strip())
                except:
#                     print(f'error in {URL} {i}')
                    pass
    return codes

codes = {"category": [], "variable": [], "label": []}


parse_nhanes(links, codes)


codebook = pd.DataFrame(codes)


In [5]:
codebook.value_counts()

category  variable  label                             
LAB       SEQN      Respondent sequence number            79
QUES      SEQN      Respondent sequence number            43
EXAM      SEQN      Respondent sequence number            19
          DXXPT76Y  y-coordinates of outline points 77    13
          DXXPT71Y  y-coordinates of outline points 72    13
                                                          ..
          OHX08PCP  LOA: Max R(CI) DL FGM-sulcus(mm)       1
          OHX08PCS  LOA: Max R(CI) MF FGM-sulcus(mm)       1
          OHX08TC   Tooth Count:  #8                       1
          OHX09CJA  LOA: Max L(CI) ML FGM-CEJ(mm)          1
QUES      WTSVOC2Y  VOC Subsample Weight                   1
Length: 3905, dtype: int64

From the value_counts() above, you can see that there are several repeated varaibles due to the data design for the NHANES dataset. To easily match each variable I have list the unique variables separately.

In [6]:
code_unique = codebook[['variable', 'label']].drop_duplicates(subset=['variable'])
print(code_unique)

      variable                                   label
0         SEQN              Respondent sequence number
1     SDDSRVYR                      Data release cycle
2     RIDSTATR            Interview/Examination status
3     RIAGENDR                                  Gender
4     RIDAGEYR               Age in years at screening
...        ...                                     ...
7097    WHD140  Self-reported greatest weight (pounds)
7098    WHQ150                Age when heaviest weight
7100   WHQ030M         How do you consider your weight
7101    WHQ500               Trying to do about weight
7102    WHQ520          How often tried to lose weight

[3847 rows x 2 columns]


In [7]:
code_unique.to_csv('nhanes_2013_2014_codebook.csv', index=False)