# Welcome to the Real World 

Okay, but what happens when its not a nice clean training example? In today's lesson, we'll go over some of the ways that web scraping can get messy and work around solutions. 

**TO ADD**
    - time.sleep()
    - skip_county function get to work (flatten json)
    - get run_entire_loop to work 

In [2]:
from utils_session5 import *

## Changes between loops

In an ideal world, each time we loop through our variables, the web page will stay the same. However, sometimes new elements show up on the page based on the state/county/variable you want. In our KFF premium plan example, when a zip code covers multiple counties, an additional county dropdown menu will show up so you can select which county you want data for.

![IL counties](images/il-counties.png)

Since there are ~40,000 zip codes in the US, we only have one zip code in our data for each county. This means that when there are multiple counties associated with a zip code, we need to make sure we select the correct one we want, or else we'll have error in our data! This is where the trusty `try: except:` statement comes in handy.


## `Try` your best
The `try` statement in Python is used for catching "exceptions" (i.e. errors). It is a way to prevent expected errors from crashing your code. The way it is structured is as follows:

```python
try:
    # code that might raise an error
except:
    # what to do if an error is raised
```

Here are a couple of simple examples. In the first, we set a value for x and then try to divide 10 by x. In cases where division is possible, we simply want to print out the result. But if x is 0, we want to print out a message saying that division by zero is not possible. There is a specific kind of exception that is raised when you try to divide by zero, called `ZeroDivisionError`. We can catch this specific error by adding it to the `except` statement.

In [3]:
x = 1

try:
    print(10 / x)
except ZeroDivisionError:
    print("You can't divide by zero!")

10.0


Other common types of errors in Python that could be used in this structure include NameError, TypeError, and ValueError. You can catch these errors by specifying the type of error in the except block.

- NameError: Raised when a variable name is not found
- TypeError: Raised when an operation or function is applied to an object of an inappropriate type
- ValueError: Raised when a built-in operation or function receives an argument that has the right type but an inappropriate value. (For instance, if you create a function that should only take in an integer, and a user supplies a character).

One other way we might use try and except is when we want to ignore *any* kind of error. This is logic you should be think twice about, as it can hide even errors that you may not have anticipated. But it is powerful in cases where you are expecting a certain kind of error, know it won't crash your program, and want to specifically ignore it and move on.

Let's come back to our county example from above. In cases where there is a county dropdown because the zip code might correspond to multiple counties, we want to run the `select_dropdown` option. However, if there is no county dropdown, this would raise an exception that we want to simply ignore and move on from. The code below shows how we can do that with a `pass` statement, that literally means "do nothing".

In [8]:
# 1. Select state dropdown
select_dropdown(identifier='//*[@id="state-dd"]', driver = driver,value='il')
# 2. Enter zip code
enter_text(identifier='//*[@id="zip-wrapper"]/div/input',  driver = driver,text = '62401') # Try 60022 to see example with no county dropdown!
county='Shelby'
try:
    # select county when given the option
    select_dropdown(identifier='//*[@id="locale-inner"]/select',  driver = driver, option = county)

except:
    # The dropdown element is not present, proceed with the next steps
    pass


Additionally, a bigger curve ball was that New York and Vermont each have completely different pages, where you can only select which members of the family are enrolling in the marketplace. This was something I only figured out as the code was running and I noticed that it was getting stuck on those states because the options I was telling `selenium` to select weren't available. For this case, I checked-in with the project team to see how they wanted me to proceed and ended up creating a seperate, slgihtly tailored script that would scrape the available values for those two states. 

![NY example](images/ny-example.png)

## ABORT MISSION

Oh no! Our code stopped running! 

How can we create safety nets within the code so that when something stops running (say you're computer went to sleep) the code automatically keeps running from where we left of? 

A few important things that we need to handle are: 

1) keeping track of how many iterations we've already done to know where to start if the code gets interrupted
2) skipping to the correct spot in the list that we're iterating over when we start again (in this case the right county)
3) continuing to add values to the dictionary on top of what we've already scraped


To address the firt point, we'll create a counter variable that we set to 0 at the start of running the code. We'll then increment the value of the counter each time we go through a county in our dataset. 

To address the second point, we need to think through how to get back to the state/county/zip code that we were on. To do this, we'll write a function that skips ahead `n` items in our state/county/zip code JSON file. We can use this to start where we left off if the counter is > 0, otherwise it'll just read in the whole file. 

In [42]:
# Helper function to access the nth county in the state_counties_zipcodes
# dictionary if the counter is not 0
def skip_counties(counter):
    ###--- Get list of all state and zip codes ---###

    # skip to relevant county
    if counter != 0: 
        # open JSON with state, county, and zip data
        with open('data/zip_data_small.json') as file:
            # Read and parse JSON from the current position
            state_counties_zipcodes_temp = json.load(file)
            # initialize empyt list
            state_counties_zipcodes = []
            # turn into a flat list
            for k,v in state_counties_zipcodes_temp.items():
                for k1, v1 in v.items():
                    state_counties_zipcodes.append([k, k1, v1])
            
            state_counties_zipcodes = skip_list[counter:]


                
    # if it's the first iteration and counter is 0, read in the full json without skipping anything
    else: 

       # open JSON with state, county, and zip data
        with open('data/zip_data_small.json') as file:
            # Read and parse JSON from the current position
            state_counties_zipcodes_temp = json.load(file)
            # initialize empyt list
            state_counties_zipcodes = []
            # turn into a flat list
            for k,v in state_counties_zipcodes_temp.items():
                for k1, v1 in v.items():
                    state_counties_zipcodes.append([k, k1, v1])
            
    return(state_counties_zipcodes)

[['il', 'Marion', '62801'],
 ['il', 'Kane', '60109'],
 ['ok', 'Pawnee', '74020'],
 ['ks', 'Harvey', '67020'],
 ['mo', 'Buchanan', '64401'],
 ['ga', 'Dooly', '31007'],
 ['sc', 'Abbeville', '29620'],
 ['co', 'Kit Carson', '80805'],
 ['md', 'Wicomico', '21801']]


Now we're going to take the nested loop code that we wrote in session 4 and turn it into a function. The value add of turning it into a function is that we can have a counter value as the input. 

### TASK 2

1) Turn the code below into a function called `run_entire_loop` that takes a counter value as the input and returns a counter and the output file name

2) Within the function, if the counter isn't at 0 (meaning that we're not at the beginning of the loop), we'll want to read in the `output.json` file and assign it to `premium_val_dict`. This lets us keep adding to the list of values that we've already scraped. Write code (or pseudo code) where you think this belongs

3) Call the `skip_counties` function that we defined above to skip to the correct row from where we left off

4) Increase the value of counter when we loop through each county and print the value that we're on to the console

*NOTE: The code below is already indented to work if you add the function definition to the top row*

In [None]:
    
    url = "https://www.kff.org/interactive/subsidy-calculator/"
    service = Service(executable_path=ChromeDriverManager().install())
   
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    
    # NOTE: STILL NEED TO SAVE VALUES TO DICT
    premium_val_dict = {} # initialize empty dictionary to capture the scraped values
    # Set THRESHOLD at number of values you already have scraped + 1
    age_values = [14, 17, 20, 19, 39] # indexes for 14, 20, 40, and 60 years

 
    
    

 
 
    for i in range(len(state_counties_zipcodes)):
        # set the state as the top key in the dictionary
        state_dict = premium_val_dict.setdefault(state_counties_zipcodes[i][0], {})
        # initialize empty list to store premium plan values 
        premium_val_list = []
        # increment counter
    
        # print counter
      
        # setup top half of page
        setup_page(state = state_counties_zipcodes[i][0], zipcode = state_counties_zipcodes[i][2], driver = driver)
        
        for age in age_values:
                # scrape plan value
                number = scrape_data(age = age, driver = driver)
                # for each zipcode, create a list of all of the premium plan costs for each age
                # this will be saved with the zipcode key in the dictionary
                premium_val_list.append(number)
               
        # at the end of looping through all ages in the zip code add premium values to dictionary
        state_dict[state_counties_zipcodes[i][1]] = premium_val_list

        # Save the dictionary as a JSON file at the end of each loop
        output_filename = f'output.json'
        with open(output_filename, 'w') as json_file:
            json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)
   

 

### Solution

In [52]:
def run_entire_loop(counter):
    
    url = "https://www.kff.org/interactive/subsidy-calculator/"
    service = Service(executable_path=ChromeDriverManager().install())
   
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    
    # NOTE: STILL NEED TO SAVE VALUES TO DICT
    premium_val_dict = {} # initialize empty dictionary to capture the scraped values
    # Set THRESHOLD at number of values you already have scraped + 1
    age_values = [14, 17, 20, 19, 39] # indexes for 14, 20, 40, and 60 years

    # Read the JSON file
    if counter != 0:
        with open(f'output.json', 'r') as file:
            premium_val_dict = json.load(file)

    # if counter is not 0, skip to correct spot in 
    # state_counties_zipcodes dictionary
    state_counties_zipcodes = skip_counties(counter)
 
 
    for i in range(len(state_counties_zipcodes)):
        # set the state as the top key in the dictionary
        state_dict = premium_val_dict.setdefault(state_counties_zipcodes[i][0], {})
        # initialize empty list to store premium plan values 
        premium_val_list = []
        # increment counter
        counter += 1
        # print counter
        print(counter)
        # setup top half of page
        setup_page(state = state_counties_zipcodes[i][0], zipcode = state_counties_zipcodes[i][2], driver = driver)
        for age in age_values:
                
                # scrape plan value
                number = scrape_data(age = age, driver = driver)

                # for each zipcode, create a list of all of the premium plan costs for each age
                # this will be saved with the zipcode key in the dictionary
                premium_val_list.append(number)
               
        # at the end of looping through all ages in the zip code add premium values to dictionary
        state_dict[state_counties_zipcodes[i][1]] = premium_val_list

        # Save the dictionary as a JSON file at the end of each loop
        output_filename = f'output.json'
        with open(output_filename, 'w') as json_file:
            json.dump(premium_val_dict, json_file, indent=2)  # 'indent' for pretty formatting (optional)
    
    return(counter)
   

## Putting it all together 

In [54]:
if __name__ == '__main__':
    
   counter = 9
   
   run_entire_loop(counter)
      

10
State = md, zip Code =21801
age = 14
Found data download button now clicking it
Now scraping data
207.0
age = 17
Found data download button now clicking it
Now scraping data
239.0
age = 20
Found data download button now clicking it
Now scraping data
262.0
age = 19
Found data download button now clicking it
Now scraping data
346.0
age = 39
Found data download button now clicking it
Now scraping data
734.0


## So Long, Farewell 

Hopefully you had fun these past few weeks and learned some new skills, it's been an absolute delight to have you all in class! 

Remember: 

1) Understanding how to **think** about web scraping challenges is the most important step. It's all about the logic! 

2) When in doubt, break down the task to small steps

3) It's going to take longer than you think, and there's going to be more bugs than you expect! 

4) Do quality checks throughout and and at the end! 
    - I didn't realize that when I was scraping values that had a comma, the first digit was getting dropped
    - I assigned one zip code to every county, in about 30 counties, the zip code I assigned didn't have data, but a different one did so I had to manually assign those values to the state data frame 
    - Make sure to clear text boxes between iterations! 