# Lecture 5.7 - Web Scraping with Selenium

In [1]:
import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get('https://wsu-datascience.github.io/binomial_simulation/index.html')
binom_sim = BeautifulSoup(r.content, 'html.parser')

In [2]:
binom_sim

<!DOCTYPE html>

<html lang="en">
<head>
<title>Binomial Simulation</title>
</head>
<link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css" integrity="sha384-wITovz90syo1dJWVh32uuETPVEtGigN07tkttEqPv+uR2SE/mbQcG7ATL28aI9H0" rel="stylesheet"/>
<style>
    /* LaTeX display environment will effect the LaTeX characters but not the layout on the page */
    span.katex-display {
      display: inherit; /* You may comment this out if you want the default behavior */
    }
  </style>
<script crossorigin="anonymous" integrity="sha384-/y1Nn9+QQAipbNQWU65krzJralCnuOasHncUFXGkdwntGeSvQicrYkiUBwsgUqc1" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-dq1/gEHSxPZQ7DdrM82ID4YVol9BYyU7GbWlIwnwyPzotpoc57wDw/guX8EaYGPx" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<!-- These scripts link to the Vega-Lite runtime -->
<script src="h

## <font color="red"> Exercise 1 </font>

Load the [binomial simulation app](https://wsu-datascience.github.io/binomial_simulation/index.html) in a browser and inspect some elements.  Verify that our usual technique failed to load the elements you find on the page.

## Why use `selenium`

`selenium` allows us to interact with the web page by

1. Clicking on non-HTML elements on the page
2. Filling in forms
3. Waiting for elements to load
4. Taking screen shots of the current state.

## Installation -- Installing the browser driver.

First, we need to install a special driver that allows Python to interact with your browser.  Here we will be using Google Chrome and the [Chrome Driver](https://chromedriver.chromium.org/getting-started).  Be sure to install this driver (or the [one for your favorite browser](https://selenium-python.readthedocs.io/installation.html#drivers). 

## Installation -- Installing `selenium`

Next we will use `pip` to install the `selenium` package.

In [2]:
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 2.3 MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0


## Starting a `selenium` session

Note that this will pop up a Chrome window that will be controlled by your Python `driver` object.  **Don't close this window**

In [3]:
from selenium import webdriver

DRIVER_PATH = '/mnt/c/Users/le7858ey/Desktop/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://duckduckgo.com/')

In [5]:
driver.page_source

--ad:focus {border-color: #eeeeee;}.badge-link__title,.badge-link__bullets {color: #eeeeee;}.faq__title,.faq__question,.faq__answer,.is-atb-fullbleed .badge-link__title,.is-atb-fullbleed .badge-link__subtitle,.is-atb-fullbleed .badge-link__bullet,.is-atb-fullbleed .badge-link__rating,.is-atb-fullbleed .content-info__title,.is-atb-fullbleed .content-info__item,.is-atb-fullbleed .badge-link__rating__star--third::before ,.is-atb-fullbleed .badge-link__rating__star--half::before {color: #eeeeee !important;}.is-atb-fullbleed.dark-bg .badge-link__rating__star--empty {color: #161616;}.is-atb-fullbleed.is-bb-exp.dark-bg .badge-link__rating__star--empty {color: #444 !important;}.is-atb-fullbleed .content-info__item__link {color: #678FFF;}.settings-page-wrapper .set-side a,.settings-page-wrapper .set-side a:hover,.settings-page-wrapper .set-side a:active {color: #eeeeee;text-decoration: underline;}.result--sep--hr:before,.result--more.is-loading,.feedback--button.btn--wire,.place-list-item:hover

In [6]:
driver.title

'DuckDuckGo — Privacy, simplified.'

In [7]:
driver.current_url

'https://duckduckgo.com/'

## Locating the First Element

Can locate by

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

**Note:** This is the same as `find` in `bs4`

In [8]:
p = driver.find_element_by_tag_name('p')
p

<selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="fede2b91-365e-42d4-bf03-38740b2c156f")>

In [9]:
faq_button = driver.find_element_by_class_name('faq__button')
faq_button

<selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="08ad00b7-f1a2-4111-8b8e-03b006b51fac")>

In [10]:
error_div = driver.find_element_by_id('error_homepage')
error_div

<selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="04d9987d-31ff-42fb-9538-ecd1434f7f34")>

## Locating all the Elements

Use the plural (i.e. `elements`) to `find_all` tags.

In [11]:
a_tags = driver.find_elements_by_tag_name('a')
a_tags

[<selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="1434c731-989a-4d3c-992c-42794035285f")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="3fb97c2a-f203-4fa1-9090-3e03dc413743")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="dd823bdf-9e60-40df-b375-ab7a049054c6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="fd393dab-04f6-4052-af00-21e08e284e92")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="580afe96-1fcd-4bfe-b112-9c7edbaf9d42")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="4b5feb46-12c7-473f-9243-c8607b1b96f4")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f49eb581e7d62c16a2f9ba063513019", element="f8bb3653-7f76-4463-b7e9-33

## The `WebElement` object

`selenium` object representing an `html` tag.

In [12]:
p.text

''

In [13]:
faq_button.click()

In [14]:
faq_button.get_attribute('class')

'faq__button'

In [15]:
[a.get_attribute('href') for a in a_tags]

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settings#theme',
 'https://duckduckgo.com/sett

In [16]:
from composable import pipeable
get_attribute = pipeable(lambda attr, tag: tag.get_attribute(attr))

(p
>> get_attribute('class')
)

'showcase__subheading'

In [17]:
from composable.strict import map

(a_tags
>> map(get_attribute('href'))
)

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settings#theme',
 'https://duckduckgo.com/sett

## Anatomy of a Search

<img src="./img/duckduck_search.png" width="400"/>

In [18]:
input_field = driver.find_element_by_id('search_form_input_homepage')
input_button = driver.find_element_by_id('search_button_homepage')

In [19]:
input_field.send_keys('Silas Bergen')

In [20]:
input_button.click()

## IMPORTANT -- The page will change!

In [21]:
input_field = driver.find_element_by_id('search_form_input_homepage')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="search_form_input_homepage"]"}
  (Session info: chrome=86.0.4240.183)


## <font color="red"> Exercise 2 </font> 

Use `selenium` to get the links for all of the results shown on the page.

In [28]:
# Your code here
find_elements_by_tag_name = pipeable(lambda tag, driver: driver.find_elements_by_tag_name(tag))
find_element_by_tag_name = pipeable(lambda tag, driver: driver.find_element_by_tag_name(tag))
find_elements_by_class_name = pipeable(lambda class_, driver: driver.find_elements_by_class_name(class_))
find_element_by_class_name = pipeable(lambda class_, driver: driver.find_element_by_class_name(class_))
find_elements_by_id = pipeable(lambda id, driver: driver.find_elements_by_id(id))

In [36]:
(driver
>> find_elements_by_class_name('results_links_deep')
>> map(find_element_by_tag_name('a'))
>> map(get_attribute('href'))
)

['http://driftlessdata.space/',
 'https://www.linkedin.com/in/silas-bergen-3039b785',
 'https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1946369',
 'https://www.instantcheckmate.com/people/silas-bergen/',
 'https://github.com/silasbergen/',
 'https://public.tableau.com/profile/silas.bergen#!',
 'https://myspace.com/287027346',
 'http://driftlessdata.space/courses/dsci310/midterm/',
 'https://silasbergen.github.io/DSCI310-F17/Syllabus.html',
 'https://public.tableau.com/profile/silas.bergen#!/vizhome/Uncertainty_0/Dashboard1',
 'https://jimmyjhickey.github.io/old-site/silas.html',
 'https://silasbergen.github.io/DSCI310-F17/Midterm.html',
 'https://www.researchgate.net/scientific-contributions/Silas-Bergen-2009213336',
 'https://es-la.facebook.com/public/Silas-Bergen',
 'https://www.goodreads.com/user/show/55807579-silas-bergen',
 'https://www.findagrave.com/memorial/71758675/laverne-silas-dewees',
 'https://www.intelius.com/people-search/Silas-Bergen',
 'https://vimeo.com/user37861

## <font color="red"> Exercise 3 </font> 

Suppose that we want more than 1 page of results.  Inspect the page and find the more button. **Hint:** Right click won't work here, you will need to inspect a nearby element then navigate to the element in the console.

Once you have found this button, click it at least three times, then get all of search result links.

In [35]:
# Your code here
more_button = driver.find_element_by_class_name('result--more')
more_button.click()

In [37]:
all_links = (driver
>> find_elements_by_class_name('results_links_deep')
>> map(find_element_by_tag_name('a'))
>> map(get_attribute('href'))
)

## Headless operation

While the extra window is useful for exploring results, it gets annoying when rerunning common searches.  In this case, we want to run the search in headless mode.

In [38]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

cdn.cookielaw.org/logos/50b45312-ab08-436b-adc7-79156d8394f7/fbf3549c-4010-4bc2-8e8e-7077219d886e/a9e434c0-45f5-46cf-8f9c-c267f3130104/grey_racetrack.png&quot;)"></div></div><h3 id="pc-title">Cookie Collection Practices</h3><div id="pc-policy-text">When you visit our websites or utilize our services, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is used to make the sites or services work as you expect it to and to provide a more personalized experience when using our websites and services. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. Please note that you cannot opt-out of our Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our web