# Scraping and Crawling Webpages using the Selenium Library

Brad Marx


- Introduction to Selenium
- Web Scraping Considerations
- Selenium Setup
- Components of a Selenium Webscraper
- Demo

## Introduction to Selenium

###  What is Selenium?
[Selenium](https://www.selenium.dev/) is a tool used to automate interactions with a web browser. 

While the library's initial purpose was to automate web application testing, the functionality it offers is comprehensive enough to find use in almost any web scraping/interaction use case.

### How Does Selenium Work?

Selenium simulates opening a web page in a web browser (like Chrome, Firefox, IE). A developer may then interact with the web page in the way it would be displayed to a user/customer/etc browsing the internet. 

Examples of interactions include (non-exhaustive):
- Clicking on buttons
- Executing JavaScript
- Filling out embedded web forms
- **Expanding drop-down menus**
- **Scraping rendered HTML**
- **Following hyperlinks to other pages**

The focus of this demo will be on the last three general uses.


## Web Scraping Considerations

### Ethical Considerations

One should consider the effect of their web scraping use case on the websites to be scraped and the data to be collected: 
- Make sure the web scraper is not going to tax the server(s) of any target websites.
    - Rapidly requesting resources (HTML, images, or other documents) from a site may strain the capabilities of their server and slow down (or even crash!) the website.
    - This is more common for smaller sites with less resources available.
- Only scrape data intended for the public. Try to avoid collecting personally identifiable information from sites unless given explicit permission for the use case.


### Implementation Considerations

The functionality and sohpistication of Selenium also makes it a very **heavy-weight** library for web scraping. Opening and simulating a browser incurs additional costs in time and resources that simpler web scraping implementations would avoid. 

If one only needs to retrieve data embedded in the HTML of *simple* web sites, they should see if the BeautifulSoup and URLLib/requests libraries alone would work for their use case. 
- This approach would simply retrieve the main DOM ([Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model), basically the HTML and JavaScript) directly from a web server as a string for parsing in BeautifulSoup.
- Skipping the overhead of first rendering a web page in a browser makes web scraping MUCH FASTER!

#### So, why even bother with Selenium for web scraping? 

Many modern websites have more sophisticated HTML/CSS/JavaScript that may render information in ways that cannot be accessed directly from the main DOM. For example: 
- Links from dropdown menus (or any HTML tag set to `[aria-expanded='false']`)
- Data in [Iframes](https://www.w3schools.com/html/html_iframe.asp) (embedded web pages in another web page)
- Dynamic text, or data that depends on the browser window size to appear

Any interaction with elements on a web page will also require Selenium.  

## Selenium Setup

1. Install the Selenium library into your environment. Run `conda env create -f selenium.yml` in the root of this repository to create such an environment.
     

2. Install browser of choice. I am using Chrome.

3. Install required web driver
   - Selenium requires a web driver executable to use in simulating a browser.
   - Managing the proper browser and webdriver versions is a pain! We can use webdriver_manager to install and use the correct driver for our browser version.
   - *Note*: More recent versions of Selenium have a built-in 'selenium manager' that managese the driver for you behind the scenes. However, the functionality is a little finicky, so I am opting to use the webdriver_manager object in the demo to be more explicit.

In [1]:
# Import libraries
import numpy as np

from bs4 import BeautifulSoup, NavigableString, CData, Tag

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

from webdriver_manager.chrome import ChromeDriverManager


In [5]:
# define service object that contains info on web driver
service = Service(ChromeDriverManager().install())
# Define options object that contains metadata on the browser. This can be customized in a number of ways (more on that later) 
options = webdriver.ChromeOptions()

# Create the driver. This is the 'browser simulation' object! If this cell runs correctly, you should see a new empty window appear using your browser of choice.
driver = webdriver.Chrome(service=service)

## Components of a Selenium Web Scraper


## Selenium Demo: 

In [7]:
driver.quit()