# Scraping and Crawling Webpages using the Selenium Library

Brad Marx


- Introduction to Selenium
- Web Scraping Considerations
- Selenium Setup
- Components of a Selenium Webscraper
- Demo

## Introduction to Selenium

###  What is Selenium?
[Selenium](https://www.selenium.dev/) is a tool used to automate interactions with a web browser. 

While the library's initial purpose was to automate web application testing, the functionality it offers is comprehensive enough to find use in almost any web scraping/interaction use case.

### How Does Selenium Work?

Selenium simulates opening a web page in a web browser (like Chrome, Firefox, IE). A developer may then interact with the web page in the way it would be displayed to a user/customer/etc browsing the internet. 

Examples of interactions include (non-exhaustive):
- Clicking on buttons
- Executing JavaScript
- Filling out embedded web forms
- **Expanding drop-down menus**
- **Scraping rendered HTML**
- **Following hyperlinks to other pages**

The focus of this demo will be on the last three general uses.


## Web Scraping Considerations

### Ethical Considerations

One should consider the effect of their web scraping use case on the websites to be scraped and the data to be collected: 
- Make sure the web scraper is not going to tax the server(s) of any target websites.
    - Rapidly requesting resources (HTML, images, or other documents) from a site may strain the capabilities of their server and slow down (or even crash!) the website.
    - This is more common for smaller sites with less resources available.
- Only scrape data intended for the public. Try to avoid collecting personally identifiable information from sites unless given explicit permission for the use case.


### Implementation Considerations

Before doing **any** web scraping, check if the data you are looking to collect is available in a public database or API endpoint! 

The functionality and sohpistication of Selenium also makes it a very **heavy-weight** library for web scraping. Opening and simulating a browser incurs additional costs in time and resources that simpler web scraping implementations would avoid. 

If one only needs to retrieve data embedded in the HTML of *simple* web sites, they should see if the BeautifulSoup and URLLib/requests libraries alone would work for their use case. 
- This approach would simply retrieve the main DOM ([Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model), basically the HTML and JavaScript) directly from a web server as a string for parsing in BeautifulSoup.
- Skipping the overhead of first rendering a web page in a browser makes web scraping MUCH FASTER!

#### So, why even bother with Selenium for web scraping? 

Many modern websites have more sophisticated HTML/CSS/JavaScript that may render information in ways that cannot be accessed directly from the main DOM. For example: 
- Links from dropdown menus (or any HTML tag set to `[aria-expanded='false']`)
- Data in [Iframes](https://www.w3schools.com/html/html_iframe.asp) (embedded web pages in another web page)
- Dynamic text, or data that depends on the browser window size to appear

Any interaction with elements on a web page will also require Selenium.  

## Selenium Setup

1. Install the Selenium library into your environment. Run `conda env create -f selenium.yml` in the root of this repository to create such an environment.
     

2. Install browser of choice. I am using Chrome.

3. Install required web driver
   - Selenium requires a web driver executable to use in simulating a browser.
   - Managing the proper browser and webdriver versions is a pain! We can use webdriver_manager to install and use the correct driver for our browser version.
   - *Note*: More recent versions of Selenium have a built-in 'selenium manager' that managese the driver for you behind the scenes. However, the functionality is a little finicky, so I am opting to use the webdriver_manager object in the demo to be more explicit.

In [1]:
# Import libraries
import numpy as np
import time

from bs4 import BeautifulSoup, NavigableString, CData, Tag

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

from webdriver_manager.chrome import ChromeDriverManager


In [2]:
# define service object that contains info on web driver
service = Service(ChromeDriverManager().install())

# Define options object that contains metadata on the browser. This can be customized in a number of ways (more on that later) 
options = webdriver.ChromeOptions()

# Uncomment to run without displaying a browser
# options.headless = True

In [3]:

# Create the driver. This is the 'browser simulation' object! If this cell runs correctly, you should see a new empty window appear using your browser of choice.
driver = webdriver.Chrome(service=service, options=options)
driver.set_window_size(1600, 1600)

actions = ActionChains(driver)
wait = WebDriverWait(driver, 50)

## Components of a Selenium Web Scraper

#### Driver

#### ActionChains

ActionChains are a way to automate low level interactions such as mouse movements, mouse button actions, key press, and context menu interactions.
This is useful for doing more complex actions like hover over and drag and drop.

**Generate user actions.**

When you call methods for actions on the ActionChains object, the actions are stored in a queue in the ActionChains object.

When you call perform(), the events are fired in the order they are queued

ActionChains can be used in a chain pattern:

```
menu = driver.find_element(By.CSS_SELECTOR, ".nav")

hidden_submenu = driver.find_element(By.CSS_SELECTOR, ".nav #submenu1")

ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()
```

...Or actions can be queued up one by one, then performed.:

```
menu = driver.find_element(By.CSS_SELECTOR, ".nav")
hidden_submenu = driver.find_element(By.CSS_SELECTOR, ".nav #submenu1")

actions = ActionChains(driver)
actions.move_to_element(menu)
actions.click(hidden_submenu)
actions.perform()

```

#### Wait

## Selenium Demo 1: Navigating Between Pages through Links and Accessing Data from HTML

### Replace the string with a URL of your choosing! Remember the ethical considerations.

In [4]:
url_demo_1 = 'https://www.weather.gov/'
driver.get(url_demo_1)

### Take a look at the HTML of your page using driver.page_source

Pretty messy, huh? Good thing Selenium has methods designed to parse this efficiently! 

In [5]:
driver.page_source[:3000]


'<html xmlns="http://www.w3.org/1999/xhtml"><head>\n        <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/"><title>National Weather Service</title><meta name="DC.title" content="National Weather Service"><meta name="DC.description" content="NOAA National Weather Service National Weather Service"><meta name="DC.creator" content="US Department of Commerce, NOAA, National Weather Service"><meta name="DC.date.created" scheme="ISO8601" content=""><meta name="DC.language" scheme="DCTERMS.RFC1766" content="EN-US"><meta name="DC.keywords" content="weather, National Weather Service"><meta name="DC.publisher" content="NOAA\'s National Weather Service"><meta name="DC.contributor" content="National Weather Service"><meta name="DC.rights" content="http://www.weather.gov/disclaimer.php"><meta name="rating" content="General"><meta name="robots" content="index,follow">\n\n        <link href="/css/weatherstyle.css" rel="stylesheet" type="text/css">\n        <link href="/css/template.css" 

### First, lets find all links on this web page. 

We access the web page via the driver. The WebElement objects returned alone dont do much...

*Note*: 
- `find_elements` returns a list of 'WebElement' objects that represent sections of the HTML containing the query of interest. In this case, these are elements with the link tag, 'a'
- There is another method called `find_element` (without the 's') that does the exact same thing, except returns only one WebElement instead of a list of all of them.

In [6]:
links = driver.find_elements(By.TAG_NAME, "a")

links[:5]

[<selenium.webdriver.remote.webelement.WebElement (session="ec45f5d7ea95b99b68e80d829b661d87", element="331191365C9D6BE4B35A82B98A3975A9_element_36")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec45f5d7ea95b99b68e80d829b661d87", element="331191365C9D6BE4B35A82B98A3975A9_element_37")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec45f5d7ea95b99b68e80d829b661d87", element="331191365C9D6BE4B35A82B98A3975A9_element_38")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec45f5d7ea95b99b68e80d829b661d87", element="331191365C9D6BE4B35A82B98A3975A9_element_39")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec45f5d7ea95b99b68e80d829b661d87", element="331191365C9D6BE4B35A82B98A3975A9_element_40")>]

### But WebElements have helpful attributes we can access! 

Attributes can be accessed with the method `get_attribute`, and text can be accessed directly with `.text`

In [7]:

print(f'URL of the first link: {links[0].get_attribute('href')}')
print(f'Text of the first link: {links[0].text}')

URL of the first link: http://www.weather.gov/
Text of the first link: 


### Lets only consider links with corresponding text

You can print out the link names and URL if you want (Could be a long list, lets only print 10)

In [8]:

visual_links = [elem for elem in links if len(elem.text) > 0]

print([(viz_lnk.text, viz_lnk.get_attribute('href')) for viz_lnk in visual_links][:10])

[('HOME', 'https://www.weather.gov/#'), ('FORECAST', 'http://www.weather.gov/forecastmaps'), ('PAST WEATHER', 'https://www.weather.gov/wrh/climate/'), ('SAFETY', 'http://www.weather.gov/safety'), ('INFORMATION', 'http://www.weather.gov/informationcenter'), ('EDUCATION', 'http://www.weather.gov/education'), ('NEWS', 'http://www.weather.gov/news/'), ('SEARCH', 'http://www.weather.gov/search'), ('ABOUT', 'http://www.weather.gov/about'), ('Location Help', "javascript:void(window.open('http://weather.gov/ForecastSearchHelp.html','locsearchhelp','status=0,toolbar=0,location=0,menubar=0,directories=0,resizable=1,scrollbars=1,height=500,width=530').focus());")]


### Now, lets pick a link we want to follow from the list above

First, index or hardcode your new URL here. Remember, Specify the href string, not the WebElement object directly.

Then we call `driver.get` again with the new link and check the Selenium browser to see the new page!

In [10]:
link_to_follow = visual_links[1].get_attribute('href')

driver.get(link_to_follow)

### lets look at the text from the new HTML

We can do this with BeautifulSoup.

In [11]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Remove the replace() method to see how the text looks with newlines
print(soup.get_text().replace('\n', ''))



## Selenium Demo 2: Scrape Class Info from CAB! 

**GOAL**: Look up a series of course codes in CAB and extract a dictionary of course information for each. Return a list of class dictionaries at the end. 

We can use Selenium functionality to interact with a dynamic web page, and scrape data that would not originally be displayed in the source HTML.

**WARNING**: When we work with elements on a web page, we need to make sure they are *loaded* before we try accessing them. Trying to access an element that has not loaded yet is a very common error that arises when using Selenium.

Possible solutions:
- Use the `wait` object to make the program stop until a certain element appears: `wait.until(EC.presence_of_element_located((By.<choose a selector to use>, "<target tag,class,etc.>")))`
- Use the `time` library to force the program to pause for a fixed number of seconds at a given point in the stack: `time.sleep(<time in seconds>)` (This does NOT stop the browser from loading the elements while the program waits!)

In [14]:

def lookup_class(class_code: str, driver, actions, wait) -> None:
    ### Moving to the search text box and selecting it ###
    # Find the text box we will use to type in our search query
    search_box = driver.find_element(By.CSS_SELECTOR, "[id='crit-keyword']")
    
    # Define the ActionChain. This will simulate moving your cursor to the search box and clicking on it
    mouse = actions.move_to_element(search_box)
    mouse.click(on_element=search_box)
    search_box.clear()
    ### Typing in a query and clicking enter ###
    mouse.send_keys(class_code)
    mouse.send_keys(Keys.ENTER)
    mouse.perform()
    
    time.sleep(2)
    # Make sure the web page was updated before moving on

    ### Moving to the first result and clicking it ###
    first_result = driver.find_element(By.CSS_SELECTOR, "[class='result result--group-start']")
    mouse.move_to_element(first_result)
    mouse.click(on_element=first_result)
    mouse.perform()
    

def scrape_class_info(driver, wait) -> dict:
    ### Scraping class information of interest ###
    class_detail_dict = {}

    # Make sure the web page was updated before moving on
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[class*='course-code']")))
    # Collect data not in the
    class_detail_dict['Course Code'] = driver.find_element(By.CSS_SELECTOR,  "[class*='course-code']").text
    class_detail_dict['Course Name'] = driver.find_element(By.CSS_SELECTOR,  "[class*='title text']").text
    
    # Iterate through each text section to extract course details
    # The syntax [class^='section section'] with the '^' means to find every HTML element that has a class attribute STARTING with 'section section'
    for course_detail_sect in driver.find_elements(By.CSS_SELECTOR,  "[class^='section section']"):
        
        sect_title = course_detail_sect.find_element(By.CSS_SELECTOR, "[class='section__title']")
        sect_text = course_detail_sect.find_element(By.CSS_SELECTOR, "[class='section__content']")
        
        class_detail_dict[sect_title.text] = sect_text.text
    
    return class_detail_dict

def reset_search(driver, actions) -> None:
    ### Find and click the button that resets the search ###
    reset_box = driver.find_element(By.CSS_SELECTOR, "[data-action='reset-search']")
    mouse = actions.move_to_element(reset_box)
    mouse.click()
    mouse.perform()



In [15]:
url = 'https://cab.brown.edu/'
# Load the URL on the Selenium browser instance
driver.get(url)

schedule = ['DATA2020', 'CS2470', 'DATA2050', 'baddata']
class_info_list = []

for class_code in schedule:
    try:
        lookup_class(class_code=class_code, driver=driver, actions=actions, wait=wait)
    except:
        print(f'Unable to find class {class_code}!')
        continue
    class_data = scrape_class_info(driver=driver, wait=wait)

    reset_search(driver=driver, actions=actions)

    class_info_list.append(class_data)

driver.quit()
class_info_list

Unable to find class baddata!


[{'Course Code': 'DATA 2020',
  'Course Name': 'Statistical Learning',
  'Course Description': 'A modern introduction to inferential methods for regression analysis and statistical learning, with an emphasis on application in practical settings in the context of learning relationships from observed data. Topics will include basics of linear regression, variable selection and dimension reduction, and approaches to nonlinear regression. Extensions to other data structures such as longitudinal data and the fundamentals of causal inference will also be introduced.',
  'Registration Restrictions': 'Prerequisites: DATA 1010 or 1030.\nEnrollment limited to students in the Data Science (SCM) program.\nEnrollment is limited to Graduate level students.\nStudents in the Remote Study / Study Abroad cohort may not enroll.',
  'Final Exam': "No final exam has been scheduled for this course by the department through the registrar's office. Please consult syllabus or contact instructor.\nIf an exam we