# Lecture 5.7 - Web Scraping with Selenium

In [3]:
import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get('https://wsu-datascience.github.io/binomial_simulation/index.html')
binom_sim = BeautifulSoup(r.content, 'html.parser')

In [6]:
binom_sim

<!DOCTYPE html>

<html lang="en">
<head>
<title>Binomial Simulation</title>
</head>
<link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css" integrity="sha384-wITovz90syo1dJWVh32uuETPVEtGigN07tkttEqPv+uR2SE/mbQcG7ATL28aI9H0" rel="stylesheet"/>
<style>
    /* LaTeX display environment will effect the LaTeX characters but not the layout on the page */
    span.katex-display {
      display: inherit; /* You may comment this out if you want the default behavior */
    }
  </style>
<script crossorigin="anonymous" integrity="sha384-/y1Nn9+QQAipbNQWU65krzJralCnuOasHncUFXGkdwntGeSvQicrYkiUBwsgUqc1" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-dq1/gEHSxPZQ7DdrM82ID4YVol9BYyU7GbWlIwnwyPzotpoc57wDw/guX8EaYGPx" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<!-- These scripts link to the Vega-Lite runtime -->
<script src="h

## <font color="red"> Exercise 1 </font>

Load the [binomial simulation app](https://wsu-datascience.github.io/binomial_simulation/index.html) in a browser and inspect some elements.  Verify that our usual technique failed to load the elements you find on the page.

## Why use `selenium`

`selenium` allows us to interact with the web page by

1. Clicking on non-HTML elements on the page
2. Filling in forms
3. Waiting for elements to load
4. Taking screen shots of the current state.

## Installation -- Installing the browser driver.

First, we need to install a special driver that allows Python to interact with your browser.  Here we will be using Google Chrome and the [Chrome Driver](https://chromedriver.chromium.org/getting-started).  Be sure to install this driver (or the [one for your favorite browser](https://selenium-python.readthedocs.io/installation.html#drivers). 

## Installation -- Installing `selenium`

Next we will use `pip` to install the `selenium` package.

In [59]:
!pip install selenium

You should consider upgrading via the '/Users/bn8210wy/.pyenv/versions/anaconda3-5.3.1/bin/python -m pip install --upgrade pip' command.[0m


## Starting a `selenium` session

Note that this will pop up a Chrome window that will be controlled by your Python `driver` object.  **Don't close this window**

In [143]:
from selenium import webdriver

DRIVER_PATH = '/Users/bn8210wy/Downloads/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://duckduckgo.com/')

In [144]:
driver.page_source

'<html class="js no-touch opacity csstransforms3d csstransitions svg cssfilters is-not-mobile-device full-urls is-atb-fullbleed" lang="en_US"><!--<![endif]--><head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">\n<meta name="HandheldFriendly" content="true">\n\n<link rel="canonical" href="https://duckduckgo.com/">\n\n<link rel="stylesheet" href="/s1934.css" type="text/css">\n\n<link rel="stylesheet" href="/o1934.css" type="text/css">\n\n\n\n<link rel="preload" href="/font/ProximaNova-Reg-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\n<link rel="preload" href="/font/ProximaNova-Sbold-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\n<link rel="preload" href="/font/ProximaNova-ExtraBold-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\n\n<link rel="sho

In [145]:
driver.title

'DuckDuckGo — Privacy, simplified.'

In [146]:
driver.current_url

'https://duckduckgo.com/'

## Locating the First Element

Can locate by

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

**Note:** This is the same as `find` in `bs4`

In [152]:
p = driver.find_element_by_tag_name('p')
p

<selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="6dbaa18f-fb93-41d9-8a92-540a4ab6fc0b")>

In [153]:
faq_button = driver.find_element_by_class_name('faq__button')
faq_button

<selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="21452793-4a80-47f4-bc51-e6c7185e05c8")>

In [154]:
error_div = driver.find_element_by_id('error_homepage')
error_div

<selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="0acfd863-9aa7-4c76-92dc-bd3c56da09bd")>

## Locating all the Elements

Use the plural (i.e. `elements`) to `find_all` tags.

In [155]:
a_tags = driver.find_elements_by_tag_name('a')
a_tags

[<selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="4ffc5a66-d34f-4633-a614-8b01c4c96bef")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="acbe5c87-4052-47e3-9bc2-23943f974c1c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="773370aa-d569-4ff4-a54c-cd2b78d3ae5c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="a5ef8aa0-c0f0-4a5a-a6ee-1d71b16fb04d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="f971a854-9539-40ed-992e-cf44f125c605")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="88ec178e-2bf9-4358-a89e-35460307791a")>,
 <selenium.webdriver.remote.webelement.WebElement (session="c12b854d755399f3923baa7697dc713e", element="7311579b-5ba7-44b1-a69f-98

## The `WebElement` object

`selenium` object representing an `html` tag.

In [156]:
p.text

''

In [157]:
faq_button.click()

In [158]:
faq_button.get_attribute('class')

'faq__button'

In [159]:
[a.get_attribute('href') for a in a_tags]

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settings#theme',
 'https://duckduckgo.com/sett

In [160]:
from composable import pipeable
get_attribute = pipeable(lambda attr, tag: tag.get_attribute(attr))

(p
>> get_attribute('class')
)

'showcase__subheading'

In [161]:
from composable.strict import map

(a_tags
>> map(get_attribute('href'))
)

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settings#theme',
 'https://duckduckgo.com/sett

## Anatomy of a Search

<img src="./img/duckduck_search.png" width="400"/>

In [162]:
input_field = driver.find_element_by_id('search_form_input_homepage')
input_button = driver.find_element_by_id('search_button_homepage')

In [163]:
input_field.send_keys('Silas Bergen')

In [164]:
input_button.click()

## IMPORTANT -- The page will change!

In [165]:
input_field = driver.find_element_by_id('search_form_input_homepage')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="search_form_input_homepage"]"}
  (Session info: chrome=86.0.4240.111)


## <font color="red"> Exercise 2 </font> 

Use `selenium` to get the links for all of the results shown on the page.

In [166]:
# Your code here

## <font color="red"> Exercise 3 </font> 

Suppose that we want more than 1 page of results.  Inspect the page and find the more button. **Hint:** Right click won't work here, you will need to inspect a nearby element then navigate to the element in the console.

Once you have found this button, click it at least three times, then get all of search result links.

In [113]:
# Your code here

## Headless operation

While the extra window is useful for exploring results, it gets annoying when rerunning common searches.  In this case, we want to run the search in headless mode.

In [29]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

<html lang="en-US" class="js flexbox canvas canvastext webgl no-touch geolocation postmessage websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers no-applicationcache svg inlinesvg smil svgclippaths no-isios flexboxlegacy supports alps-os-mac" data-triggered="true"><head><style class="vjs-styles-defaults">
      .video-js {
        width: 300px;
        height: 150px;
      }

      .vjs-fluid {
        padding-top: 56.25%
      }
    </style>
    <meta charset="UTF-8">
    <title>
    Nintendo - Official Site - Video Game Consoles, Games
</title>
    
    <meta name="description" content="Discover Nintendo Switch, the video game system you can play at home or on the go. Plus, get the latest games and news on t