# Building a Web Scraping Workflow

## Loops, Lists, and other Pythonic things!
Now that we know how to set up the page how we want it and scrape the text that we're interested in we need to think through how to loop through each of the different combinations of variables that we're interested in. This is where things get interesting! I like to think of it as a puzzle, and we need to figure out the right order and combination of how things fit together to get it to run. 

Let's start with a quick review of some Python objects and syntax - and a reminder that deeper dives on all these concepts can be found in our Intro to Python series: https://ui-research.github.io/python-at-urban/content/intro-to-python.html

In [1]:
# Lists contain a collection of values.
my_list = [3, 5, 7, 9, 11]
# Dictionaries contain a collection of key, value pairs. The keys should be strings or numbers, while the values can be any data type, even lists or dictionaries themselves.
my_dict = {'a': 3, 'b': 5, 'c': 7, 'd': 9, 'e': 11}
# In a web scraping context, these will be helpful objects for storing data as we loop through various web page objects and append the text we extract. We can add new values like this:
my_list.append(13)
my_dict['f'] = 13

print(my_list)
print(my_dict)

[3, 5, 7, 9, 11, 13]
{'a': 3, 'b': 5, 'c': 7, 'd': 9, 'e': 11, 'f': 13}


Loops allow you to iterate over all values in some sort of collection (e.g. a list). There are two main types of loops in Python: for loops and while loops.

For loops are probably more common for this sort of application, because we have a fixed number of iterations, but while loops can be nice when you have an indefinite number. Just be careful that the condition eventually evaluates to False, or you'll have an infinite loop!

Finally, as you'll see below, loops can be nested within each other, meaning that you can have a loop within a loop. In this case, for *every* iteration of the outer loop, the inner loop would run through *all* of its iterations.

In [2]:
# Loops allow you to iterate over all values in some sort of collection (e.g. a list). There are two main types of loops in Python: for loops and while loops.
for item in my_list:
    print(item + 1)

# Example while loop
i = 0
while i < len(my_list):
    print(my_list[i] + 1)
    i += 1

# Nested for loop
list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']

for i in list1:
    for j in list2:
        print(i, j)

4
6
8
10
12
14
4
6
8
10
12
14
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c



### Iterating

Using the list above as a guide for what we want to accomplish, we need to think through what the most efficient way to set everything up is. Some values only need to be set when the county changes, while others need to be set every time we want to get a new premium value for different ages. Therefore, we can split the `selenium` code we wrote yesterday into two functions. 

This is where things get interesting! Think of it like a puzzle, and we need to figure out the right order and combination of how things fit together to get it to run. 

First, try to break the problem down into which pieces we want to change at the same time, and which we want to stay consistent. As a reminder, the project team we're working with needs the cost of the Silver Plan Premium for each county for people aged 14, 20, 40, and 60. 

### TASK 1 

**Looking at the web page, what values do we need to set when the county changes? What values do we need to set when the age changes? Do some fields reset to blank? Do some default to a value?**


![Full page](images/full-page.png)


### TASK 1 SOLUTION

<details>

County Set-up
- State
- Zip code
- Yearly household income
- Is coverage available for spouse's job?

Age Set-up
- Number of adults enrolled in Marketplace coverage
- Number of children enrolled in marketplace coverage
- Age

Now that we've thought through this hierarchy, we can start to break up these tasks into functions!! 
</details


In [3]:
# Import functions saved in utils.py
from utils import *

## Iterating over options on the website

Let's read in a CSV file with the states, counties, and zip codes that we want to scrape. This will just be a set of 10 to keep things simple. We can then convert it to a list of lists (aka a nested list) to make it even easier to iterate over.

In [4]:
import pandas as pd
# Read in CSV with states, counties, zips and sort by state
state_counties_zipcodes = pd.read_csv('data/zip_data_small.csv').sort_values('state_abbr')
# Convert df to list of lists
state_counties_zipcodes = state_counties_zipcodes.values.tolist()
state_counties_zipcodes

[['co', 'Kit Carson', 80805],
 ['ga', 'Dooly', 31007],
 ['il', 'Marion', 62801],
 ['il', 'Kane', 60109],
 ['ks', 'Harvey', 67020],
 ['ky', 'Bourbon', 40348],
 ['md', 'Wicomico', 21801],
 ['mo', 'Buchanan', 64401],
 ['ok', 'Pawnee', 74020],
 ['sc', 'Abbeville', 29620]]

In [5]:
# Print the first state/county/zip combo
print(state_counties_zipcodes[0])
# Print just the first state in the list
print(state_counties_zipcodes[0][0])

['co', 'Kit Carson', 80805]
co


### Initialize a driver 

In [6]:
# Launch driver
url = "https://www.kff.org/interactive/subsidy-calculator/"
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get(url)

# indexes for 14, 20, 40, and 60 years
age_values = [14, 20, 19, 39]  # 19 and 39 are the indexes for 20 and 40 years old

### Nested `for` loops

We know that we want to iterate through each of the zip codes in our list of lists, and that for each of those zip codes we want to iterate through 4 age values. To do this, we'll set up a nested for loop. This structure is going to guide how we set up our helper functions. 

For now, we just have `pass` inside the loops, which is a placeholder that does nothing. We'll replace this with actual code next.

In [8]:
# for each state and county combo in the state nested dict
for i in range(len(state_counties_zipcodes)):
    pass #Code we want to run for each state/county combo (but not each age)
    # loop through ages we want to get premium values for
    for age in age_values:
        pass #Code we want to run for each age WITHIN a given state/county combo

### Functional Programming

In the following two tasks, we're going to create helper functions using the selenium code we wrote in session 3, and some beautiful soup to select each of the options on the page that we want based on the zip code and age we're on in our loops. 

1) `setup_page()`
    - This function will set the values for: 
        - State
        - Zip code
        - Yearly income
        - Is coverage available for spouse's job?
    - We have this code in it's own function so that we only have to set the values once for each zip code that we're on rather than 4 times 

2) `scrape_data()`
    - This functions will set the values for: 
        - Number of kids
        - Number of adults
        - Age
    - Once these values are set, it also includes `BeautifulSoup` code to scrape the data that premium plan value that we're interested in.

### TASK 2

**Fill in the following function with `selenium` code from last week for the sections of the page that only need to be updated when the zipcode/county changes**

In [None]:
def setup_page(state, driver, zipcode, county):
    '''
     This functions sets up the initial page conditions that don't need to change with each iteration through the age groups 

     Inputs: 
        zipcode is input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        counter (int): Counts how many times this function has been called in the script below
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        Nothing, just sets up the page
     '''
	# function to set state, county, and income which don't change when looping through a county
    # pring the state and zip code to the console 
    print(f'State = {state}, zip Code ={zipcode}')
    
     # Select state dropdown
    select_dropdown(identifier= [FILL IN XPATH],driver = driver, value= [FILL IN])
    # Enter zip code
    enter_text(identifier=[FILL IN XPATH] , driver = driver, text = [FILL IN] )
    # if there's an option to select a county, select the county associated with the zip code
    # Attempt to locate and click the dropdown element
    try:
    # select county when given the option
        select_dropdown(identifier= [FILL IN XPATH],  driver = driver, option = county)
    except:
    # The dropdown element is not present, proceed with the next steps
        pass
    
     # Enter yearly household income
    if is_textbox_empty(driver=driver, textbox_id=[FILL IN XPATH]):
        # enter income value if the text box does not already have text 
        enter_text(identifier=[FILL IN XPATH], driver = driver, text = [FILL IN INCOME VALUE]) # this stays the same each time
    
    # Is coverage available from your or your spouse's job? 
    click_button(identifier=[FILL IN XPATH] , driver = driver) # this stays the same each time


### TASK 2 SOLUTION

In [9]:
def setup_page(state, driver, zipcode, county):
    '''
     This functions sets up the initial page conditions that don't need to change with each iteration through the age groups 

     Inputs: 
        zipcode is input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        counter (int): Counts how many times this function has been called in the script below
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        Nothing, just sets up the page
     '''
	# function to set state, county, and income which don't change when looping through a county
    # pring the state and zip code to the console 
    print(f'State = {state}, zip Code ={zipcode}')
    
     # Select state dropdown
    select_dropdown(identifier='//*[@id="state-dd"]',driver = driver, value=state)
    # Enter zip code
    enter_text(identifier='//*[@id="zip-wrapper"]/div/input', driver = driver, text = zipcode)
    # if there's an option to select a county, select the county associated with the zip code
    # Attempt to locate and click the dropdown element
    try:
    # select county when given the option
        select_dropdown(identifier='//*[@id="locale-inner"]/select',  driver = driver, option = county)
    except:
    # The dropdown element is not present, proceed with the next steps
        pass

    # Enter yearly household income
    if is_textbox_empty(driver=driver, textbox_id='//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input'):
        # enter income value if the text box does not already have text 
        enter_text(identifier='//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input', driver = driver, text = '100000') # this stays the same each time
    
    # Is coverage available from your or your spouse's job? 
    click_button(identifier='//*[@id="employer-coverage-0"]', driver = driver) # this stays the same each time


In the second function, we want to set up the correct values for the age value that we're on in the loop. Note that when the age is 14 or 20, we need to set the number of adults to 0 and number of children to 1 and then select the correct age. When the age is 40 or 60, we need to do the opposite. 

Once all of the specifics on the page are set up, we will talk through the `BeautifulSoup` code at the bottom of this function that actually extracts the data we're interested in. Again, a reminder that all of the Selenium functions navigate us to the correct place by choosing menu options, entering text, and clicking buttons, and then BeautifulSoup is used to pull the specific values.

### TASK 3

Fill in the function below with the appropriate  `selenium` code from the last session.

In [None]:
def scrape_data(age, driver):
     '''
     This functions scrapes the price of the silver plan premium (without financial help) for non-smoking 14, 20, 40, and 60 year old people for each county nationally and saves it as a dataframe 

     Inputs: 
        age: an input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        The Silver Plan Premium for the specified age
     '''
     # print the age that we are scraping to the console
     print(f'age = {age}')

     # Number of adults (21 to 64) enrolled in Marketplace coverage? (changes based on input age)
     if age in [[FILL IN 40YO INDEX],[FILL IN 60YO INDEX] ]: # fill in with these are the indexes for 40yo and 60yo
      # set number of kids to 0
        select_dropdown(identifier=[FILL IN XPATH],driver = driver, index = [FILL IN INDEX]) # fill in with the index for 0 kids
        # set number of adults to 1
        select_dropdown(identifier=[FILL IN XPATH],driver = driver, value = [FILL IN VALUE]) # fill in with the value for 1 adult
        # Age? (index is age minus 21)
        select_dropdown(identifier= [FILL IN XPATH], driver = driver,index = [FILL IN AGE]) # fill in with the age variable
     else: 
        # set number of adults to 0
        select_dropdown(identifier= [FILL IN XPATH],driver = driver, value = [FILL IN VALUE]) # fill in with the value for 0 adults
        # Number of children (20 and younger) enrolling in Marketplace coverage
        select_dropdown(identifier= [FILL IN XPATH],driver = driver, index = [FILL IN INDEX]) # fill in with the index for 1 kid
        select_dropdown(identifier= [FILL IN XPATH],driver = driver, index = [FILL IN AGE]) # fill in with the age variable
        
     # print update to console    
     print('Found data download button now clicking it')
     # Submit
     click_button(identifier=[FILL IN XPATH], driver = driver,)

     time.sleep(1)       
     print('Now scraping data')

     ##--- Beautiful Soup ---###
     # Beautiful Soup setup using the desired URL
     html = driver.page_source
     soup = BeautifulSoup(html, 'html.parser') #Another parsing option is 'lxml' which we talked about in week 2
     
     premium_val = str(soup.find_all('span', class_ = "bold-blue")[4])# select the 4th element which has the value we want 

     # This is Regular Expression (regex) to extract the number from the string
     extracted_number = re.search(r'\$([\d,]+(?:\.\d{1,2})?)', premium_val)
     
    # Extract the matched group (number with $ and commas)
     matched_string = extracted_number.group(0)
    # Remove $ and commas from the matched string
     clean_number = matched_string.replace('$', '').replace(',', '')
    # Convert the cleaned string to a numeric value (float or int)
     number = float(clean_number) 
     print(number)
        
     return (number)


### TASK 3 SOLUTION

In [11]:
def scrape_data(age, driver):
     '''
     This functions scrapes the price of the silver plan premium (without financial help) for non-smoking 14, 20, 40, and 60 year old people for each county nationally and saves it as a dataframe 

     Inputs: 
        age: an input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        The Silver Plan Premium for the specified age
     '''
     # print the age that we are scraping to the console
     print(f'age = {age}')

     # Number of adults (21 to 64) enrolled in Marketplace coverage? (changes based on input age)
     if age in [19 ,39]: # fill in with these are the indexes for 40yo and 60yo
        # set number of kids to 0
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[3]/div/select',driver = driver, index = '0')
        # set number of adults to 1
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[1]/div/select',driver = driver, value = "1")
        # Age? (index is age minus 21)
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[2]/div/div[1]/select', driver = driver,index = age)
     else: 
        # set number of adults to 0
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[1]/div/select',driver = driver, value = "0")
        # Number of children (20 and younger) enrolling in Marketplace coverage
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[3]/div/select',driver = driver, index = '1')
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[4]/div/div[1]/select',driver = driver, index = age)
        
     # print update to console    
     print('Found data download button now clicking it')
     # Submit
     click_button(identifier='//*[@id="subsidy-form"]/p/input[2]', driver = driver,)

     time.sleep(1)       
     print('Now scraping data')

     ##--- Beautiful Soup ---###
     # Beautiful Soup setup using the desired URL
     html = driver.page_source
     soup = BeautifulSoup(html, 'html.parser') #Another parsing option is 'lxml' which we talked about in week 2
     
     premium_val = str(soup.find_all('span', class_ = "bold-blue")[4])# select the 4th element which has the value we want 

     
     extracted_number = re.search(r'\$([\d,]+(?:\.\d{1,2})?)', premium_val)
     
    # Extract the matched group (number with $ and commas)
     matched_string = extracted_number.group(0)
    # Remove $ and commas from the matched string
     clean_number = matched_string.replace('$', '').replace(',', '')
    # Convert the cleaned string to a numeric value (float or int)
     number = float(clean_number) 
     print(number)
        
     return (number)


### TASK 4
Going back to our nested `for` loops, where would you place each of the functions we defined above? What values would you pass in as the arguments? 

In [None]:
# for each state and county combo in the state nested dict
for i in range(len(state_counties_zipcodes)):
    
    # loop through ages we want to get premium values for
    for age in age_values:

### TASK 4 SOLUTION

In [None]:
# for each state, county, zip combo in the list of lists
for i in range(len(state_counties_zipcodes)):
    # call the setup_page function to set top dropdown values on page
    setup_page(state = state_counties_zipcodes[i][0], county= state_counties_zipcodes[i][1], zipcode = state_counties_zipcodes[i][2], driver = driver)
    # loop through ages we want to get premium values for
    for age in age_values:
        # call the scrape_data function to get the premium value 
        number = scrape_data(age = age, driver = driver)

### TASK 5

Almost there!!! Now that we have everything in place to scape the data, we need to think about how to store the premium plan values that we did all of this work to get! To do that, again we need to think about what is happening at each step in our nested for loops and where the best place to save the data is. Within the age for loop, we want to capture all of the scraped values in a list called `premium_val_list`. And then we want to save that list with the county and state that it came from in the dictionary that we initialized at the beginning called `premium_val_dict`.

Try to fill in those items where they belong in the skeleton below.

In [12]:
# initialize empty dictionary to capture the scraped values
premium_val_dict = {} 

# for each state, county, zip combo in the list of lists
for i in range(len(state_counties_zipcodes)):
    # set the state as the top key in the dictionary
    state_dict = premium_val_dict.setdefault(state_counties_zipcodes[i][0], {})
    # FILL IN: initialize empty list to store premium plan values for 
    # the county that we're on in the loop 
    premium_val_list =    # Initialize empty list
    
    # call the setup_page function to set top dropdown values on page
    setup_page(state = state_counties_zipcodes[i][0], county = state_counties_zipcodes[i][1], zipcode = state_counties_zipcodes[i][2], driver = driver)
        
    # loop through ages we want to get premium values for
    for age in age_values:

        # call the scrape_data function to get the premium value 
        number = scrape_data(age = age, driver = driver)

        # FILL IN: append the value that we just scraped (number) to the premium_val_list
        
                  
    # FILL IN: assign premium_val_list to the county key in the state_dict
    # note that this step is outside of the age for loop 
    state_dict[state_counties_zipcodes[i][1]] = 

# Save the dictionary as a JSON file at the end of each loop
output_filename = f'data/output.json'
with open(output_filename, 'w') as json_file:
    json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)

dict_items([('ky', {'Bourbon': '40348'}), ('il', {'Marion': '62801', 'Kane': '60109'}), ('ok', {'Pawnee': '74020'}), ('ks', {'Harvey': '67020'}), ('mo', {'Buchanan': '64401'}), ('ga', {'Dooly': '31007'}), ('sc', {'Abbeville': '29620'}), ('co', {'Kit Carson': '80805'}), ('md', {'Wicomico': '21801'})])

### TASK 5 SOLUTION

In [None]:
# initialize empty dictionary to capture the scraped values
premium_val_dict = {} 

# for each state, county, zip combo in the list of lists
for i in range(len(state_counties_zipcodes)):
    # set the state as the top key in the dictionary
    state_dict = premium_val_dict.setdefault(state_counties_zipcodes[i][0], {})
    # initialize empty list to store premium plan values for 
    # the county that we're on in the loop 
    premium_val_list = []
    
    # call the setup_page function to set top dropdown values on page
    setup_page(state = state_counties_zipcodes[i][0], county = state_counties_zipcodes[i][1], zipcode = state_counties_zipcodes[i][2], driver = driver)
        
    # loop through ages we want to get premium values for
    for age in age_values:

        # call the scrape_data function to get the premium value 
        number = scrape_data(age = age, driver = driver)

        # for each zipcode, create a list of all of the premium plan costs for each age
        # this will be saved with the zipcode key in the dictionary
        premium_val_list.append(number)
                  
    # at the end of looping through all ages in the zip code add premium values to dictionary
    # note that this step is outside of the age for loop 
    # at the end of looping through all ages in the zip code add premium values to dictionary
    state_dict[state_counties_zipcodes[i][1]] = premium_val_list

# Save the dictionary as a JSON file at the end of each loop
output_filename = f'data/output.json'
with open(output_filename, 'w') as json_file:
    json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)


## Quality Assurance
We'd like to end on a note about quality assurance. Unlike a traditional code review process at Urban, if the web scraping task runs for a very long time, a reviewer is not going to feasibly replicate everything you've done. Of course, errors in the code can still be caught, but we want to emphasize that even more than normal, it's crucial to quality check your own results. A few questions to ask yourself after you finish putting together the type of workflow we just walked through:

- **Have I gone through *every* iteration?** Do the math on how many files or values or iterations you expect to run in total and make sure you have that many outputs.
- **Are there edge cases that I might be missing?** We will see more examples of this next week, and you will see how much exceptions vary by use case! It's not feasible to check every single combination of options, but this is where the project team's domain knowledge can be key (and also probably the hardest thing for them to catch without your help).
- **Does the resulting text or data output seem sensible?** Another place where project team input can be helpful. 
- **Have I commented my code clearly to outline my assumptions?** This one is as much for future-you as it is for anyone else. Relatedly...
- **Have I added print statements within my workflow?** This isn't always necessary but can be incredibly helpful for tracking which iteration you're on, seeing any logic your code might be skipping over or incorrectly processing, and just generally making your code easier to understand and debug.