# HTML Parsing NHANES Data Codebook with SAS label
### Li-Chia Chen
### 27Feb2021

## Purpose
While I am doing a project using the NHANES dataset, I find that it was hard to locate the desired variable and difficult to tell the meaning of each variable after feature selection. Therefore, I decided to parse the information from the NHANES website using **Beautiful Soup**. 

There is already one codebook with detailed description here: https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/discussion/47796


In this notebook the main purpose is to extract the sas labels from the data documentations websites in the 5 main categories:
- Demographics: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013
- Dietary: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013
- Examination: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013
- Laboratory: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013
- Questionnaire: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import regex as re
import urllib

In [None]:
def parse_main(URL, links, category):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find('table')

    for link in table.find_all('a'):
        if str(link.get('href')).endswith('.htm') == True:
            link_j = urllib.parse.urljoin('https://wwwn.cdc.gov/', link.get('href'))
            links[category].append(link_j)


urls = {'DM':'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013',
        'DIET':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013',
        'EXAM':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013',
        'LAB':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013',
        'QUES':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013'}

links = {v:[] for v in urls.keys()}

for c, URL in urls.items():
    print(c, URL)
    parse_main(URL, links, c)


In [None]:
def parse_nhanes(links, codes):
    for c, URLs in links.items():
        for URL in URLs:
            # access webs site
            page = requests.get(URL)

            # parse data
            soup = BeautifulSoup(page.content, 'html.parser')
            containers = soup.find_all('dl')
            for i in containers:
                try:
                    varname = str(i.find("dt",string="Variable Name: ").findNext("dd").text)
                    saslabel = str(i.find("dt",string="SAS Label: ").findNext("dd").text)
#                     print(varname, saslabel)
                    codes['category'].append(c)
                    codes['variable'].append(varname.strip())
                    codes['label'].append(saslabel.strip())
                except:
#                     print(f'error in {URL} {i}')
                    pass
    return codes

codes = {"category": [], "variable": [], "label": []}


parse_nhanes(links, codes)


codebook = pd.DataFrame(codes)


In [None]:
codebook.value_counts()

From the value_counts() above, you can see that there are several repeated varaibles due to the data design for the NHANES dataset. To easily match each variable I have list the unique variables separately.

In [None]:
code_unique = codebook[['variable', 'label']].drop_duplicates(subset=['variable'])
print(code_unique)

In [None]:
code_unique.to_csv('nhanes_2013_2014_codebook.csv', index=False)