# Loops, Lists, and other Pythonic things!
Now that we know how to set up the page how we want it and scrape the text that we're interested in we need to think through how to loop through each of the different combinations of variables that we're interested in. This is where things get interesting! I like to think of it as a puzzle, and we need to figure out the right order and combination of how things fit together to get it to run. 

Let's start with a quick review of some Python objects and syntax - and a reminder that deeper dives on all these concepts can be found in our Intro to Python series: https://ui-research.github.io/python-at-urban/content/intro-to-python.html

In [1]:
# Lists contain a collection of values.
my_list = [3, 5, 7, 9, 11]
# Dictionaries contain a collection of key, value pairs. The keys should be strings or numbers, while the values can be any data type, even lists or dictionaries themselves.
my_dict = {'a': 3, 'b': 5, 'c': 7, 'd': 9, 'e': 11}
# In a web scraping context, these will be helpful objects for storing data as we loop through various web page objects and append the text we extract. We can add new values like this:
my_list.append(13)
my_dict['f'] = 13

print(my_list)
print(my_dict)

[3, 5, 7, 9, 11, 13]
{'a': 3, 'b': 5, 'c': 7, 'd': 9, 'e': 11, 'f': 13}


Loops allow you to iterate over all values in some sort of collection (e.g. a list). There are two main types of loops in Python: for loops and while loops.

For loops are probably more common for this sort of application, because we have a fixed number of iterations, but while loops can be nice when you have an indefinite number. Just be careful that the condition eventually evaluates to False, or you'll have an infinite loop!

Finally, as you'll see below, loops can be nested within each other, meaning that you can have a loop within a loop. In this case, for *every* iteration of the outer loop, the inner loop would run through *all* of its iterations.

In [2]:
# Loops allow you to iterate over all values in some sort of collection (e.g. a list). There are two main types of loops in Python: for loops and while loops.
for item in my_list:
    print(item + 1)

# Example while loop
i = 0
while i < len(my_list):
    print(my_list[i] + 1)
    i += 1

# Nested for loop
list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']

for i in list1:
    for j in list2:
        print(i, j)

4
6
8
10
12
14
4
6
8
10
12
14
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c


Last week, we learned how to set up the page how we want it and scrape the text that we're interested in. Now we need to think through how to loop through each of the different combinations of variables that we're interested in. This is where things get interesting! I like to think of it as a puzzle, and we need to figure out the right order and combination of how things fit together to get it to run. 

At this stage in the web scraping process, rather than diving straight into the code, we're going to focus on pseudo code to learn how to ** *think* ** about approaching the problem. Once you know what you want to do conceptually, the syntax is something you can either look to past examples for or look to Google/Stack Overflow for. As we introduced on week 1, concept > syntax.

First, try to break the problem down into which pieces we want to change at the same time, and which we want to stay consistent. As a reminder, the project team we're working with needs the cost of the Silver Plan Premium for each county for people aged 14, 20, 40, and 60. 

### TASK 1 

**Looking at the web page, what values need to change *every* time we want to scrape a new value, and what values only need to be set *some* of the times we want to scrape the page?**


![Full page](images/full-page.png)



**ANSWER**

STATE only needs to change once we've gone through each of the ZIP CODE/COUNTY pairs in the state. And AGE changes four times for each zip code/county pair.

Now that we've thought through this hierarchy, we have the three levels of nested `for` loops that we need!! 


### Functional Programming


We'll use functions to apply the code we wrote in Session 3 to our nested for loops. Some values only need to be set when the county changes, while others need to be set every time we want to get a new premium value for different ages. Therefore, we can split the `selemium` code we wrote yesterday into two functions. 


### TASK 2

**Which values on the page should be set only when the county changes? Which should be set every time we change the age? What should each function return (if anything)**


In [3]:
# Import functions saved in utils.py
from utils import *

### TASK 3

**Fill in the following function with `selenium` code from yesterday for the sections of the page that only need to be updated when the county changes**

In [4]:
def setup_page(state, driver, zipcode):
    '''
     This functions sets up the initial page conditions that don't need to change with each iteration through the age groups 

     Inputs: 
        zipcode is input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        counter (int): Counts how many times this function has been called in the script below
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        Nothing, just sets up the page
     '''
	# function to set state, county, and income which don't change when looping through a county
    # pring the state and zip code to the console 
    print(f'State = {state}, zip Code ={zipcode}')
    
    # Select state dropdown
 
    # Enter zip code

    # Enter yearly household income
    
    # use the is_textbox_empty() function to check if there is already a value in the income textbox

        # if there isnt, enter income value of 100000
    
    # Is coverage available from your or your spouse's job? 


#### Solution

In [22]:
def setup_page(state, driver, zipcode):
    '''
     This functions sets up the initial page conditions that don't need to change with each iteration through the age groups 

     Inputs: 
        zipcode is input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        counter (int): Counts how many times this function has been called in the script below
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        Nothing, just sets up the page
     '''
	# function to set state, county, and income which don't change when looping through a county
    # pring the state and zip code to the console 
    print(f'State = {state}, zip Code ={zipcode}')
    
     # Select state dropdown
    select_dropdown(identifier='//*[@id="state-dd"]',driver = driver, value=state)
    # Enter zip code
    enter_text(identifier='//*[@id="zip-wrapper"]/div/input', driver = driver, text = zipcode)
        # if there's an option to select a county, select the county associated with the zip code
        # Attempt to locate and click the dropdown element

        # Enter yearly household income
    if is_textbox_empty(driver=driver, textbox_id='//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input'):
        # enter income value if the text box does not already have text 
        enter_text(identifier='//*[@id="subsidy-form"]/div[2]/div[1]/div[2]/div[2]/input', driver = driver, text = '100000') # this stays the same each time
    
    # Is coverage available from your or your spouse's job? 
    click_button(identifier='//*[@id="employer-coverage-0"]', driver = driver,) # this stays the same each time


In the second function, we want to set up the correct values for the age value that we're on in the loop. Note that when the age is 14 or 20, we need to set the number of adults to 0 and number of children to 1 and then select the correct age. When the age is 40 or 60, we need to do the opposite. Once all of the specifics on the page are set up, we want to use the `BeautifulSoup` code that we wrote last session to actually scrape the data. 

### TASK 4

**Fill in the function below with the appropriate  `selenium` code from the last session.

In [None]:
def scrape_data(age, driver):
     '''
     This functions scrapes the price of the silver plan premium (without financial help) for non-smoking 14, 20, 40, and 60 year old people for each county nationally and saves it as a dataframe 

     Inputs: 
        age: an input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        The Silver Plan Premium for the specified age
     '''
     # print the age that we are scraping to the console
     print(f'age = {age}')

     # Number of adults (21 to 64) enrolled in Marketplace coverage? (changes based on input age)
     if age in [ , ]: # fill in with these are the indexes for 40yo and 60yo
        # set number of kids to 0

        # set number of adults to 1

        # Age? (index is age minus 21)

     else: 
        # set number of adults to 0

        # set number of kids to 1 

        # Age
        
     # print update to console    
     print('Found data download button now clicking it')

     # Click submit button

     print('Now scraping data')

     ##--- Beautiful Soup ---###
     # Beautiful Soup setup using the desired URL
    
    
     # return scraped value


In [21]:
def scrape_data(age, driver):
     '''
     This functions scrapes the price of the silver plan premium (without financial help) for non-smoking 14, 20, 40, and 60 year old people for each county nationally and saves it as a dataframe 

     Inputs: 
        age: an input to the dropdown menus on the website. 
            Each time this function is called below, it is for a different cut of the data.
        driver: Specify the driver with the web page we are navigating 

    Returns: 
        The Silver Plan Premium for the specified age
     '''
     # print the age that we are scraping to the console
     print(f'age = {age}')

     # Number of adults (21 to 64) enrolled in Marketplace coverage? (changes based on input age)
     if age in [19 ,39]: # fill in with these are the indexes for 40yo and 60yo
        # set number of kids to 0
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[3]/div/select',driver = driver, index = '0')
        # set number of adults to 1
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[1]/div/select',driver = driver, value = "1")
        # Age? (index is age minus 21)
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[2]/div/div[1]/select', driver = driver,index = age)
     else: 
        # set number of adults to 0
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[1]/div/select',driver = driver, value = "0")
        # Number of children (20 and younger) enrolling in Marketplace coverage
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[3]/div/select',driver = driver, index = '1')
        select_dropdown(identifier='//*[@id="subsidy-form"]/div[2]/div[3]/div[4]/div/div[1]/select',driver = driver, index = age)
        
     # print update to console    
     print('Found data download button now clicking it')
     # Submit
     click_button(identifier='//*[@id="subsidy-form"]/p/input[2]', driver = driver,)

     time.sleep(1)       
     print('Now scraping data')

     ##--- Beautiful Soup ---###
     # Beautiful Soup setup using the desired URL
     html = driver.page_source
     soup = BeautifulSoup(html, 'html.parser')  # we use the 'lxml' parser here to scrape this page, which is very fast
     
     premium_val = str(soup.find_all('span', class_ = "bold-blue")[4])# select the 4th element which has the value we want 

     
     extracted_number = re.search(r'\$([\d,]+(?:\.\d{1,2})?)', premium_val)
     
    # Extract the matched group (number with $ and commas)
     matched_string = extracted_number.group(0)
    # Remove $ and commas from the matched string
     clean_number = matched_string.replace('$', '').replace(',', '')
    # Convert the cleaned string to a numeric value (float or int)
     number = float(clean_number) 
     print(number)
        
     return (number)


## Nested Dictionaries

Now that we've figured out how to loop through all of the combinations of variables we need, and how to turn our web scraping code into functions, we need to figure out how to actually save the values we're scraping! Just like nested loops, we can have nested dictionaries in Python. JSON files are essentially nested dictionaries, and can be really useful for storing data in a structured way. We've written code in `Reference.qmd` that takes a random set of 10 counties and zip codes and outputs them into a JSON file which we'll work with here. 

Loading in the file below, we can see that we have a dictionary with the state as the key and another dictionary as the value. This set of inner dictionaries contains the counties for that state and the zip codes for that county (note that we need both counties and zips accounted for in this dictionary because zip codes can cross county lines).

In [16]:
# Import JSON file
import json
with open('data/zip_data_small.json') as f:
    state_counties_zipcodes = json.load(f)

state_counties_zipcodes

{'ky': {'Bourbon': '40348'},
 'il': {'Marion': '62801', 'Kane': '60109'},
 'ok': {'Pawnee': '74020'},
 'ks': {'Harvey': '67020'},
 'mo': {'Buchanan': '64401'},
 'ga': {'Dooly': '31007'},
 'sc': {'Abbeville': '29620'},
 'co': {'Kit Carson': '80805'},
 'md': {'Wicomico': '21801'}}

Stepping through this dictionary is pretty straightforward. To access the value corresponding to Illinois, we can write:

In [17]:
state_counties_zipcodes['il']

{'Marion': '62801', 'Kane': '60109'}

If we then further wanted to access the data for a specific zip county, we could do so with either of the lines of code below. Note that we can continue to step through nested dictionaries with the same syntax we would use to access a single dictionary, the `[]` brackets. The `items()` method is also useful as a way to obtain the full set of keys and values for a dictionary.

In [18]:
print(state_counties_zipcodes['il']['Marion'])
state_counties_zipcodes['il'].items()

62801


dict_items([('Marion', '62801'), ('Kane', '60109')])

### TASK 4 - Putting it all together

Now that we have functions to 1) set up sections of the page that need to update for every county, and 2) set up what needs to change for different ages and scrape the value, we can put all of the pieces together. As a reminder, at a high-level, we want to:

1) Initialize a web driver
2) Initialize and empty dictionary to capture the values that we want to scrape 
3) Specify the ages and state/counties that we want to loop over
4) Loop over each state
5) Loop over each county
6) Loop over each age we need
7) Save the output 


In [19]:
# Launch driver
url = "https://www.kff.org/interactive/subsidy-calculator/"
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get(url)

# initialize empty dictionary to capture the scraped values
premium_val_dict = {} 
# indexes for 14, 20, 40, and 60 years
age_values = [14, 20, 19, 39] 
    

Now that we have the nested for loops, where would you place each of the functions we defined above? 

In [20]:
# for each state and county combo in the state nested dict
for state, counties in state_counties_zipcodes.items():

# loop through county, zip pairs
    for county, zip_code in counties.items():
            
        # loop through ages we want to get premium values for
        for age in age_values:

14
40348
Bourbon
20
40348
Bourbon
19
40348
Bourbon
39
40348
Bourbon
14
62801
Marion
20
62801
Marion
19
62801
Marion
39
62801
Marion
14
60109
Kane
20
60109
Kane
19
60109
Kane
39
60109
Kane
14
74020
Pawnee
20
74020
Pawnee
19
74020
Pawnee
39
74020
Pawnee
14
67020
Harvey
20
67020
Harvey
19
67020
Harvey
39
67020
Harvey
14
64401
Buchanan
20
64401
Buchanan
19
64401
Buchanan
39
64401
Buchanan
14
31007
Dooly
20
31007
Dooly
19
31007
Dooly
39
31007
Dooly
14
29620
Abbeville
20
29620
Abbeville
19
29620
Abbeville
39
29620
Abbeville
14
80805
Kit Carson
20
80805
Kit Carson
19
80805
Kit Carson
39
80805
Kit Carson
14
21801
Wicomico
20
21801
Wicomico
19
21801
Wicomico
39
21801
Wicomico


#### Solution

In [24]:
# for each state and county combo in the state nested dict
for state, counties in state_counties_zipcodes.items():

# loop through county, zip pairs
    for county, zip_code in counties.items():

        # call the setup_page function to set top dropdown values on page
        setup_page(state=state, driver = driver, zipcode=zip_code)
            
        # loop through ages we want to get premium values for
        for age in age_values:

            # call the scrape_data function to get the premium value 
            number = scrape_data(age = age, driver = driver)

State = ky, zip Code =40348
age = 14
Found data download button now clicking it
Now scraping data
248.0
age = 20
Found data download button now clicking it
Now scraping data
315.0
age = 19
Found data download button now clicking it
Now scraping data
415.0
age = 39
Found data download button now clicking it
Now scraping data
881.0
State = il, zip Code =62801
age = 14
Found data download button now clicking it
Now scraping data
341.0
age = 20
Found data download button now clicking it
Now scraping data
433.0
age = 19
Found data download button now clicking it
Now scraping data
570.0
age = 39
Found data download button now clicking it
Now scraping data
1211.0
State = il, zip Code =60109
age = 14
Found data download button now clicking it
Now scraping data
249.0
age = 20
Found data download button now clicking it
Now scraping data
316.0
age = 19
Found data download button now clicking it
Now scraping data
416.0
age = 39
Found data download button now clicking it
Now scraping data
883.0
Sta

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=123.0.6312.107)
Stacktrace:
0   chromedriver                        0x00000001092470c8 chromedriver + 4595912
1   chromedriver                        0x000000010923ee33 chromedriver + 4562483
2   chromedriver                        0x0000000108e4239a chromedriver + 381850
3   chromedriver                        0x0000000108e1b17d chromedriver + 221565
4   chromedriver                        0x0000000108eb5e4d chromedriver + 855629
5   chromedriver                        0x0000000108ecd858 chromedriver + 952408
6   chromedriver                        0x0000000108eadee3 chromedriver + 823011
7   chromedriver                        0x0000000108e7ebe4 chromedriver + 629732
8   chromedriver                        0x0000000108e7f79e chromedriver + 632734
9   chromedriver                        0x000000010920cfe2 chromedriver + 4358114
10  chromedriver                        0x0000000109211c2d chromedriver + 4377645
11  chromedriver                        0x00000001092115a3 chromedriver + 4375971
12  chromedriver                        0x0000000109211ed5 chromedriver + 4378325
13  chromedriver                        0x00000001091f6a05 chromedriver + 4266501
14  chromedriver                        0x000000010921225d chromedriver + 4379229
15  chromedriver                        0x00000001091e9050 chromedriver + 4210768
16  chromedriver                        0x000000010922fa98 chromedriver + 4500120
17  chromedriver                        0x000000010922fc11 chromedriver + 4500497
18  chromedriver                        0x000000010923ea73 chromedriver + 4561523
19  libsystem_pthread.dylib             0x00007ff80898d202 _pthread_start + 99
20  libsystem_pthread.dylib             0x00007ff808988bab thread_start + 15


### TASK X

Almost there!!! Now that we have everything in place to scape the data, we need to think about how to store the premium plan values that we did all of this work to get! To do that, again we need to think about what is happening at each step in our nested for loops and where the best place to save the data is. Within the age for loop, we want to capture all of the scraped values in a list. And then we want to save that list with the county and state that it came from. 

In [5]:
# for each state and county combo in the state nested dict
for state, counties in state_counties_zipcodes.items():
# set the state as the top key in the output dictionary
    state_dict = premium_val_dict.setdefault(state, {})
# loop through county, zip pairs
    for county, zip_code in counties.items():
    # initialize empty list to store premium plan values for 
    # the county that we're on in the loop 
        premium_val_list = []

        # call the setup_page function to set top dropdown values on page
        setup_page(state=state, driver = driver, county = county, zipcode=zip_code)
            
        # loop through ages we want to get premium values for
        for age in age_values:

            # call the scrape_data function to get the premium value 
            number = scrape_data(age = age, driver = driver)

            # for each zipcode, create a list of all of the premium plan costs for each age
            # this will be saved with the zipcode key in the dictionary
            premium_val_list.append(number)
                  
        # at the end of looping through all ages in the zip code add premium values to dictionary
        # note that this step is outside of the age for loop 
        state_dict[county] = premium_val_list 
