# Notebook Intentions

Once we have the raw data files we also want to associate each file with a codebook. The codebook will be used to map data values to their meanings. In this notebook we will explore the most efficent way to read these codebooks. We will start with selenium, as a simple test of pdf readers demostrates that the time required for processing is quite long.

Notes:
Unfortunately there is not a clear structure to the codebook documentation. The variable name, description, and format can all be used but the mapping of keys to values is difficult. In some cases a key is an integer, or a character which maps to a string. In other cases there is not value and a float value is only available for the key. Occasionally these three formats are combined. The processing time and number of exceptions required for a single codebook would likely slow down progress in the project for several weeks with only a small increase in clarity. For now we will encode data straight from the ssp files and reference a codebook when doing analysis.

In [195]:
import os
import time

from os.path import expanduser
from selenium import webdriver

import re

In [196]:
executable_path = os.path.join(os.path.join(expanduser("~"), "meps", "meps_dev", "chromedriver"))
driver = webdriver.Chrome(executable_path=executable_path)

In [197]:
driver.get("https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H209")
table_css = driver.find_element_by_css_selector(
    "body > table:nth-child(24) > tbody > tr:nth-child(2) > td:nth-child(3) > table:nth-child(5) > tbody > tr > td > table:nth-child(3)"
)

# generate codebook
table_str = table_css.text
table_text = table_str.split("\n")
last_header_ind = table_text.index("Description") + 1

keys = table_text[:last_header_ind]
vals = table_text[last_header_ind:]

variables = []
for field_num in range(int(len(vals)/len(keys))):
    variables.append(vals[field_num*last_header_ind+0])
time.sleep(5)

In [206]:
codebook = {}
for variable in variables:
    driver.get(
    f"https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H209"
    f"&varName={variable}"
    )
    table_css = driver.find_element_by_css_selector(
        "body > table:nth-child(24) > tbody > tr:nth-child(2) > td:nth-child(3) > table:nth-child(5) > tbody > tr > td > table:nth-child(3)"
    )
    var_table = table_css.text
    variable_dict = {}
    for row in var_table.split("\n"):
        key_val = row.split(":")
        variable_dict.update({key_val[0]: key_val[1].strip(" ")})
    
    # get value to description map
    table_css = driver.find_element_by_css_selector(
        "body > table:nth-child(24) > tbody > tr:nth-child(2) > td:nth-child(3) > table:nth-child(5) > tbody > tr > td > table:nth-child(5)"
    )

    table_text = table_css.text.split("\n")
    if "WEIGHTED" in table_text:
        last_header_ind = table_text.index("WEIGHTED") + 1
    else:
        last_header_ind = table_text.index("UNWEIGHTED") + 1
    last_val_ind = table_text.index("TOTAL")

    keys = table_text[:last_header_ind]
    vals = table_text[last_header_ind:last_val_ind]
    map_vals = [val for num, val in enumerate(vals) if num%len(keys)==0]
    if float(variable_dict["Format"]) > 3:
        variable_dict["values_map"] = {}
    else:
        values_map = {}
        for map_val in map_vals:
            # handle case where key is an Integer
            key_val = [item for item in re.split(r"(-?\d*\.?\d+)", map_val, 1) if item != ""]
            # handle case where key is a Character
            if len(key_val) == 1:
                key_val = [item for item in re.split(r"([A-Z])", map_val, 1) if item != ""]
            values_map.update({key_val[0]: key_val[1].strip(" ")})
        variable_dict["values_map"] = values_map
    
    if any(val[0] == "-" for val in variable_dict["values_map"].values()):
        variable_dict["values_map"] = {}
    
    codebook[variable] = variable_dict


IndexError: list index out of range