# Scraping dynamic websites with selenium

To install Selenium for Python `conda install -c conda-forge selenium`

Driver for Firefox available on Mozilla's GitHub page https://github.com/mozilla/geckodriver/releases

Some helpful documentation: https://pypi.org/project/selenium/ (also links to other drivers) and https://selenium-python.readthedocs.io/ (not the official documentation)

####  Scraping available programmes at the LSE as an example

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
import time

In [None]:
# For headless mode
options = Options()
options.headless = False

In [None]:
# Linking the driver (file in the same folder here)
service = Service("./geckodriver")
driver = webdriver.Firefox(options=options, service=service)

The following code scrapes all programmes that the website associates with a search term. It should best be wrapped into a function, but is left in a simple code block for this notebook to make exploring the objects easier. Also note that for larger scraping projects the XPaths etc. can once be stored in a dictionary and accessed there.

In [None]:
# Adjust for other terms
search_term = "Mathematics"


# List to store programme names
all_programme_names = []

# Go to the website
driver.get("https://www.lse.ac.uk/Programmes/Search-Courses")
time.sleep(1)

# Search for programmes
search_box = driver.find_element(By.XPATH, '//*[@id="coursesSearch"]')
search_box.clear() # clear box first
time.sleep(1)
search_box.send_keys(search_term) # enter term
time.sleep(1)
search_box.send_keys(Keys.RETURN) # press enter
time.sleep(4) # needs to be long enough depending on the time the page takes to load,
# otherwise no elements might be found

# Loop over programmes
continue_flag = True
while continue_flag == True:

    # 10 programme names per page
    for i in range(1,11):

        # The method find_elements() returns list of length zero if element not found
        # instead of an error, this matters here because the final page will often have
        # less than 10 elements
        elements_found = driver.find_elements(By.XPATH, f"/html/body/form/main/div/div/div/div[2]/div[1]/article[{i}]/a/header")
        if len(elements_found) == 0:
            break # end of final page, break loop over programme titles
        elif len(elements_found) > 1:
            raise ValueError("XPath returned more than one programme; check XPaths")
        else:
            all_programme_names.append(elements_found[0].text)

        # Scroll to end of page (where the 'next' button is located)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(0.5)
        next_button_list = driver.find_elements(By.CLASS_NAME, "pagination__link--next") # XPath does not work as well for locating this button consistently
        if len(next_button_list) == 0: # no next button anymore / last page
            continue_flag = False

    if continue_flag == True:
        next_button_list[0].click()
        
all_programme_names = list(set(all_programme_names))
all_programme_names

In [None]:
# Quit driver at the end
driver.quit()