# Web Scraping the MSA through Browser Automation

This notebook shows how we were able to scrape historical data from the Maryland State Archives website on the Legacy of Slavery, several online databases that present and describe the records that document enslaved people's lives in Maryland.

## Browser Automation through Selenium

Selenium is a free browser automation and website testing kit that has programming libraries for various programming languages including Python. Selenium automates the interactions between a real browser and a website. It is often used by web application developers to perform automated testing of websites. In our case we used it to automate steps required to gather information from an MSA database.

[Selenium](https://www.selenium.dev/documentation/getting_started/) requires three tools to be installed before we can use it from Python code:

* A browser, such as Firefox or Chrome
* A Selenium "browser driver" for the browser above
* The Selenium Python library

For those used to running your notebooks on a remotely hosted web server, this notebook is different. You will want to run it locally so that it can access your browser and the required "browser driver". When you run the browser automation script in this notebook, you will actually see your browser open the website and take the actions that you have scripted. This makes for an interesting learning experience and may make it possible to identify problems in your own scripts more naturally.

## Website "Hooks" and "Anchors"

To be able to navigate the web through a script, we need to describe desired behavior in a way that the computer can clearly interpret. In the case of the MSA site, the behaviors needed consist of finding and clicking on links. If you can tell Selenium how to find a link within a page, then clicking on it is as simple as calling a "click()" method. So the main thing to understand first is how Selenium finds the components on a web page. For details on the subject, please refer to the [Selenium documentation](https://www.selenium.dev/documentation/webdriver/locating_elements/).

In our script we used three methods to get to the first page of data:

1. ```driver.find_element(By.NAME, 'ctl00$main$btnTabCollections').click()```
1. ```driver.find_element(By.LINK_TEXT, collection).click()```
1. ```driver.find_element(By.ID, 'main_rblDisplayMode_1').click()```

The method names are pretty descriptive above, using element name, link text, and id to find the right part of the page to click on. "collection" above is a variable name, which is set to the name of the database we wish to scrap. In this project it was set to "Domestic Traffic Ads". After these three steps, the browser will reach the "details" view for the first record in that collection, as seen below.

![Screen capture of browser showing a database details page for the first record.](./first_detail_page.png "First record details view")

## Capturing Page Data

Now that we have our first record displayed on a page in the browser, we need to read the data in and then write it out to a comma-separated file. We are in luck with the MSA site that data is displayed in a structured way in a table, with each label and value displayed in a table row. We can use that structure to capture the data cleanly, row by row.

First we need to find the table itself within the page:

    table = driver.find_element(By.CSS_SELECTOR, 'span#main_lblDetails + table')


Then we need to get a list of all of the table row elements:

    trs = table.find_elements(By.TAG_NAME, 'tr')

The table rows are sent to a helper function that encapsulates the job of writing each record to CSV.

In [2]:
def save_record(pageno, trs):
    global headers_written, writer  # These variables are defined in the main script.
    if headers_written == False:  # The first time through, save_record() writes a header row to CSV.
        writer.writerow([x.find_element(By.TAG_NAME, 'b').text for x in trs])  # Labels are in <b> tags..
        headers_written = True
    data = [x.find_elements(By.TAG_NAME, 'td')[1].text for x in trs]  # Values are in the second <td> tag on each row.
    try:
        data[6] = trs[6].find_element(By.TAG_NAME, 'a').get_attribute('href')  # Row 7 values are hyperlinks, so we grab the url alone.
    except NoSuchElementException:
        pass
    writer.writerow(data)

The save_record() function uses a Python-specific trick called [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to unpack the data within the table rows. This is a way of creating a new list from an existing one, while doing some processing on each element in the original list.

```data = [x.find_elements_by_tag_name('td')[1].text for x in trs]
```

The code above will make a list of the text values of the second "td" tag in each row (x) in the list of rows (trs). This syntax is much more compact and less error-prone than looping through the list in other ways.

The "writer" variable in the code above is the CSV file writer. It writes any list to the next line in a CSV file. In the first pass through this function it will also write a line of headers to CSV, based on the bolded text on each table row.

## Reaching the Next Page

After writing out the first record, we need to click on a "next" link to reach the next record page. This is straightforward enough with another find and click operation.

```        
try:
    pageno += 1
    driver.find_element(By.ID, "main_imgButtonNext").click()
except NoSuchElementException:
    break
```

On the very last record page, the next button is grayed out and has a different element ID. This means that the find function above will throw the exception "NoSuchElementException", which means that the request element was not found. In this case the call to "break" then stops the loop that is doing the record paging.

## Waiting for Data

The MSA database website is dynamically loaded via javascript, so the page does not completely reload or if it does, the data is fetched and inserted into the table separately. This means that if we immediately try to read the table after clicking "next page", the data may not have arrived yet. Scripts run very fast and websites sometimes respond relatively slowly, so it can be necessary to have the script wait until the data has been loaded. The Selenium documentation includes many examples of waiting for pages based on certain conditions. The code below waits for a particular page element to be present in the page:

````
element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "main_lblDetails")))
````

The script will pause until the element with the ID of "main_lblDetails" is present, then it will continue. This is the element within which the data is presented on the MSA site. If the element does not appear in 10 seconds, the wait function will throw a "TimeoutException" and then we exit the script entirely. (This is not commonly expected and may indicate that the target website is very slow.)

We also use another wait condition to ensure that we are looking at a new page record and not the same record we just processed:

````
element = WebDriverWait(driver, 10).until(
                EC.staleness_of(staleness_check))
````

Here the script will wait until the first table row element from the previous record becomes "stale" before processing the data for the next record. "Stale" is a condition that is defined in Selenium and it means that a DOM element is no longer a part of the current browser page.

## Putting it all Together

The complete script is included below and as a separate Python script file "scraper.py". If you have Selenium set up correctly, you should be able to watch as this program gathers MSA site data. You might try changing the collection variable to gather data from a different MSA collection.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
import csv

collection = 'Domestic Traffic Ads'
csvfile = open(collection+'.csv', 'w', newline='')
writer = csv.writer(csvfile)
headers_written = False

def save_record(pageno, trs):
    global headers_written, writer
    if headers_written == False:
        writer.writerow([x.find_element(By.TAG_NAME, 'b').text for x in trs])
        headers_written = True
    data = [x.find_elements(By.TAG_NAME, 'td')[1].text for x in trs]
    try:
        data[6] = trs[6].find_element(By.TAG_NAME, 'a').get_attribute('href')
    except NoSuchElementException:
        pass
    writer.writerow(data)

driver = webdriver.Firefox()
driver.get("http://slavery2.msa.maryland.gov/pages/Search.aspx")
driver.find_element(By.NAME, 'ctl00$main$btnTabCollections').click()
driver.find_element(By.LINK_TEXT, collection).click()
driver.find_element(By.ID, 'main_rblDisplayMode_1').click()

pageno = 1
staleness_check = None
while True:
    #if pageno > 1:
    #    break
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "main_lblDetails")))
        if staleness_check != None:
            element = WebDriverWait(driver, 10).until(
                EC.staleness_of(staleness_check))
        table = driver.find_element(By.CSS_SELECTOR, 'span#main_lblDetails + table')
        trs = table.find_elements(By.TAG_NAME, 'tr')
        staleness_check = trs[0]
        save_record(pageno, trs)
        try:
            pageno += 1
            driver.find_element(By.ID, "main_imgButtonNext").click()
        except NoSuchElementException:
            break
    except TimeoutException:
        print('timeout waiting for page '+pageno)
        exit()

csvfile.close()
driver.close()

## Video of this Notebook Running

I've made a video of this notebook running on my own computer, since it will not run on many common hosted Jupyter environments and requires installation of the Selenium software as well. In the video you can see how it looks when the browser is automated by a script.

[Video Demonstration](https://youtu.be/-QWvWrclfgk)