# Overview

Today we'll be digging into how to get started with a web scraping task and how to structure your thinking about approaching the task. For the remainder of the boot camp we'll be working on scraping state-level health insurance premium values from the KFF [Health Insurance Marketplace Calculator](https://www.kff.org/interactive/subsidy-calculator/). In this example, the project team needs the cost of the Sliver Plan Premium for each county for people aged 14, 20, 40, and 60. The final output should look something like this:

| State | County    | Age 14 | Age 20 | Age 40 | Age 60 |
|-------|-----------|--------|--------|--------|--------|
| AL    | St. Clair | 281    | 434    | 566    | 1824   |
| AL    | Jefferson | 294    | 420    | 540    | 1830   |
| AL    | Shelby    | 273    | 451    | 589    | 1801   |

## What packages do we need? 

Looking through the website, there's drop down menus, select buttons, and text input that we'll need to navigate. Based on the rule of "if a human needs to click something" we'll need to use the `selenium` package. Luckily, this isn't Urban's first web scraping rodeo and we have sample code functions for completing each of these types of actions.

Below we go over a few simple functions that will be the basis for our scraping tasks. 

### Click Button

Use the function when you need to click a button on the page. 


In [None]:
def click_button(identifier, driver, by=By.XPATH, timeout=15):   
    '''
    This function waits until a button is clickable and then clicks on it.`

    Inputs:
        identifier (string): The Id, XPath, or other way of identifying the element to be clicked on
        by (By object): How to identify the identifier (Options include By.XPATH, By.ID, By.Name and others).
            Make sure 'by' and 'identifier' correspond to one other as they are used as a tuple pair below.
        timeout (int): How long to wait for the object to be clickable

    Returns:
        None (just clicks on button)
    '''

    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=timeout).until(element_clickable)
    driver.execute_script("arguments[0].click();", element)

### Select a Dropdown

Use this function to select a value in a dropdown menu 


In [None]:
def select_dropdown(identifier, driver,  by=By.XPATH, value=None, option=None,  index=None):
    '''
    This function clicks on the correct dropdown option in a dropdown object.
    It first waits until the element becomes selectable before locating the proper drop down menu. Then it selects the proper option.
    If the page doesn't load within 15 seconds, it will return a timeout message.

    Inputs:
        id (string): This is the HTML 'value' of the dropdown menu to be selected, 
            found through inspecting the web page.
        value (string): The value to select from the dropdown menu.
        index (int): If index is not None, function assumes we want to select an option by its index instead of by specific value. 
            In this case, should specify that value = None.
    
    Returns:
        None (just selects the right item in the dropdown menu)
    '''
    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=15).until(element_clickable)
    if value is not None:
        Select(element).select_by_value(value)
    elif option is not None: 
        Select(element).select_by_visible_text(option)
    else:
        Select(element).select_by_index(index)

### Enter Text 

Use this function to enter text in a text box. the `enter_text` function is accompanied by the `is_textbox_empty` function to test is there is already a value in the text box. Later in the boot camp when we start to loop through variables, in some cases we'll want to skip over the text box if there's already text, in others we'll want to make sure to clear the value first before we enter something else. 


In [None]:
def enter_text(identifier, text, driver, by=By.XPATH):
    element_clickable = EC.element_to_be_clickable((by, identifier))
    element = WebDriverWait(driver, timeout=15).until(element_clickable)
     # Clear the text from the text box (zip code wasn't overwritting)
    element.clear()
    element.send_keys(text)

In [None]:
def is_textbox_empty(driver, textbox_id):
    '''
    This function checks if a text box is empty
    Use this for the income variable so that we don't rewrite it
    every loop
    '''
    textbox = driver.find_element('xpath',textbox_id)
    textbox_value = textbox.get_attribute("value")

    return not bool(textbox_value)

## Drivers

To start, we'll need to launch a web browser that will be controlled by our python code,  `selenium` calls this a driver. First we need to specify the URL that we want the driver to navigate to. The following chunk of code specifies that we want to navigate to the Health Insurance Marketplace Calculator, and then opens a web browser and navigates to the page. 


In [None]:
url = "https://www.kff.org/interactive/subsidy-calculator/"
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get(url)

### TASK 1

Now that we have a web driver running, we want to write code that selects each of the options on the page that we want and then displays the premium plan value. **Create a list of each of the actions that you need `selenium` to do to get the value to display**


In [None]:
# Select state dropdown
select_dropdown(identifier='//*[@id="state-dd"]', driver = driver,value='il')
# Enter zip code
enter_text(identifier='//*[@id="zip-wrapper"]/div/input',  driver = driver,text = '62401')
# select county when given the option
select_dropdown(identifier='//*[@id="locale-inner"]/select',  driver = driver,option="Shelby")
# Enter yearly household income
enter_text(identifier='//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input',driver = driver, text = '100000')

textbox = driver.find_element('xpath', '//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input')

# Is coverage available from your or your spouse's job? 
click_button(identifier='//*[@id="employer-coverage-0"]', driver = driver)
# Number of adults (21 to 64) enrolled in Marketplace coverage? 
select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[1]/div/select', value = "1", driver = driver)
# Age? (index is age - 21)
select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[2]/div/div[1]/select', index = '39', driver = driver)
# Number of children (20 and younger) enrolling in Marketplace coverage
select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[3]/div/select', index = '0')

# Submit
click_button(identifier='//*[@id="subsidy-form"]/p/input[2]', driver = driver)

###--- Beautiful Soup ---###
# Beautiful Soup setup using the desired URL
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')  # we use the 'lxml' parser here to scrape this page, which is very fast
bold_blue = str(soup.find_all('span', class_ = "bold-blue")[4])# select the 4th element which has the value we want 

try:
    premium_val = str(soup.find_all('span', class_ = "bold-blue")[4])# select the 4th element which has the value we want 
except:
    driver.quit()
    number = None

extracted_number = re.search(r'\$([\d,]+(?:\.\d{1,2})?)', bold_blue)

if extracted_number:
    number = float(extracted_number.group())
    print(number) 

if extracted_number:
    # Extract the matched group (number with $ and commas)
    matched_string = extracted_number.group(0)

    # Remove $ and commas from the matched string
    clean_number = matched_string.replace('$', '').replace(',', '')
    
    # Convert the cleaned string to a numeric value (float or int)
    numeric_value = float(clean_number)  # 