# Lecture 5.7 - Web Scraping with Selenium

In [None]:
import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get('https://wsu-datascience.github.io/binomial_simulation/index.html')
binom_sim = BeautifulSoup(r.content, 'html.parser')

In [None]:
binom_sim

<!DOCTYPE html>

<html lang="en">
<head>
<title>Binomial Simulation</title>
</head>
<link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css" integrity="sha384-wITovz90syo1dJWVh32uuETPVEtGigN07tkttEqPv+uR2SE/mbQcG7ATL28aI9H0" rel="stylesheet"/>
<style>
    /* LaTeX display environment will effect the LaTeX characters but not the layout on the page */
    span.katex-display {
      display: inherit; /* You may comment this out if you want the default behavior */
    }
  </style>
<script crossorigin="anonymous" integrity="sha384-/y1Nn9+QQAipbNQWU65krzJralCnuOasHncUFXGkdwntGeSvQicrYkiUBwsgUqc1" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-dq1/gEHSxPZQ7DdrM82ID4YVol9BYyU7GbWlIwnwyPzotpoc57wDw/guX8EaYGPx" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<!-- These scripts link to the Vega-Lite runtime -->
<script src="h

## <font color="red"> Exercise 1 </font>

Load the [binomial simulation app](https://wsu-datascience.github.io/binomial_simulation/index.html) in a browser and inspect some elements.  Verify that our usual technique failed to load the elements you find on the page.

## Why use `selenium`

`selenium` allows us to interact with the web page by

1. Clicking on non-HTML elements on the page
2. Filling in forms
3. Waiting for elements to load
4. Taking screen shots of the current state.

## Installation -- Installing the browser driver.

First, we need to install a special driver that allows Python to interact with your browser.  Here we will be using Google Chrome and the [Chrome Driver](https://chromedriver.chromium.org/getting-started).  Be sure to install this driver (or the [one for your favorite browser](https://selenium-python.readthedocs.io/installation.html#drivers). 

## Installation -- Installing `selenium`

Next we will use `pip` to install the `selenium` package.

In [None]:
# For local machine
!pip install selenium

You should consider upgrading via the '/Users/bn8210wy/.pyenv/versions/anaconda3-5.3.1/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
# For running in colab
!pip install kora -q

## Starting a `selenium` session

Note that this will pop up a Chrome window that will be controlled by your Python `driver` object.  **Don't close this window**

In [None]:
# For running locally (with a pop up browser)
from selenium import webdriver

DRIVER_PATH = '/Users/bn8210wy/Downloads/chromedriver'
url = 'https://duckduckgo.com/'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)

In [18]:
# For running in colab
from kora.selenium import wd as driver
driver.get(url)

## Note on working in Google colab

Since you don't have the pop up browser, you will need to open another browser window and mimic the steps you perform with the selenium driver.

In [19]:
driver.page_source

'<html class="js no-touch opacity csstransforms3d csstransitions svg cssfilters is-not-mobile-device full-urls is-atb-fullbleed has-search-focus" lang="en_US"><!--<![endif]--><head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">\n<meta name="HandheldFriendly" content="true">\n\n<link rel="canonical" href="https://duckduckgo.com/">\n\n<link rel="stylesheet" href="/s1936.css" type="text/css">\n\n<link rel="stylesheet" href="/o1936.css" type="text/css">\n\n\n\n<link rel="preload" href="/font/ProximaNova-Reg-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\n<link rel="preload" href="/font/ProximaNova-Sbold-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\n<link rel="preload" href="/font/ProximaNova-ExtraBold-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous">\

In [20]:
driver.title

'DuckDuckGo — Privacy, simplified.'

In [21]:
driver.current_url

'https://duckduckgo.com/'

## Locating the First Element

Can locate by

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

**Note:** This is the same as `find` in `bs4`

In [22]:
p = driver.find_element_by_tag_name('p')
p

<p class="showcase__subheading">Protect your data on every device.</p>

In [23]:
faq_button = driver.find_element_by_class_name('faq__button')
faq_button

<button class="faq__button" aria-expanded="true" aria-disabled="true" aria-controls="faq-answer-0" id="faq-btn-0"><svg width="20" height="21" viewBox="0 0 20 21" fill="none" xmlns="http://www.w3.org/2000/svg"><circle cx="10" cy="10.5" r="10" transform="rotate(-180 10 10.5)" fill="#E5E5E5"></circle><path d="M9.94454 12.8483L13.5355 9.25736C13.7308 9.0621 14.0474 9.0621 14.2426 9.25736C14.4379 9.45262 14.4379 9.7692 14.2426 9.96447L10.3536 13.8536C10.2418 13.9653 10.0903 14.0131 9.94454 13.9969C9.79879 14.0131 9.64729 13.9653 9.53553 13.8536L5.64645 9.96447C5.45118 9.7692 5.45118 9.45262 5.64645 9.25736C5.84171 9.0621 6.15829 9.0621 6.35355 9.25736L9.94454 12.8483Z" fill="#353748"></path></svg></button>

In [24]:
error_div = driver.find_element_by_id('error_homepage')
error_div

<div id="error_homepage"></div>

## Locating all the Elements

Use the plural (i.e. `elements`) to `find_all` tags.

In [26]:
a_tags = driver.find_elements_by_tag_name('a')
a_tags

[<a class="header__button--menu  js-side-menu-open" href="#">⇶</a>,
 <a href="/app" class="eighteen js-hl-item" aria-hidden="true" data-type="showcase" data-id="app"><div class="woman-icon"></div></a>,
 <a href="/app" class="text-left showcase__link eighty js-hl-item" data-type="showcase" data-id="app"><h1 class="showcase__heading">Get Our App &amp; Extension</h1><p class="showcase__subheading">Protect your data on every device.</p></a>,
 <a href="/newsletter" class="eighteen js-hl-item" aria-hidden="true" data-type="showcase" data-id="newsletter"><div class="mailbox-icon"></div></a>,
 <a href="/newsletter" class="text-left showcase__link eighty js-hl-item" data-type="showcase" data-id="newsletter"><h1 class="showcase__heading">Privacy in Your Inbox</h1><p class="showcase__subheading">Stay protected and informed with our privacy newsletters.</p></a>,
 <a href="https://spreadprivacy.com/tag/device-privacy-tips/" class="eighteen js-hl-item" aria-hidden="true" data-type="showcase" data-id

## The `WebElement` object

`selenium` object representing an `html` tag.

In [27]:
p.text

''

In [28]:
faq_button.click()

In [29]:
faq_button.get_attribute('class')

'faq__button'

In [30]:
[a.get_attribute('href') for a in a_tags]

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/install?t=h_',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settin

In [33]:
! pip install composable

Collecting composable
  Downloading https://files.pythonhosted.org/packages/16/c0/c4de24bae8543d2f313783e98f3d6c92fc3393db2719d28a219c3ce2fbf9/composable-0.2.5-py3-none-any.whl
Collecting python-forge<19.0,>=18.6
  Downloading https://files.pythonhosted.org/packages/41/d6/e9af8e22d153ebbf584833c1c96d590046f522ae2a86978d4efe496b4aac/python_forge-18.6.0-py35-none-any.whl
Installing collected packages: python-forge, composable
Successfully installed composable-0.2.5 python-forge-18.6.0


In [34]:
from composable import pipeable
get_attribute = pipeable(lambda attr, tag: tag.get_attribute(attr))

(p
>> get_attribute('class')
)

'showcase__subheading'

In [35]:
from composable.strict import map

(a_tags
>> map(get_attribute('href'))
)

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/install?t=h_',
 'https://chrome.google.com/webstore/detail/bkdgflcldnnnapblkhphbgpggdiikppg/reviews',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/settin

## Anatomy of a Search

<img src="https://github.com/wsu-DSCI330/module_5_lectures/blob/main/img/duckduck_search.png?raw=1" width="400"/>

In [None]:
input_field = driver.find_element_by_id('search_form_input_homepage')
input_button = driver.find_element_by_id('search_button_homepage')

In [None]:
input_field.send_keys('Silas Bergen')

In [None]:
input_button.click()

## IMPORTANT -- The page will change!

In [None]:
input_field = driver.find_element_by_id('search_form_input_homepage')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="search_form_input_homepage"]"}
  (Session info: chrome=86.0.4240.111)


## <font color="red"> Exercise 2 </font> 

Use `selenium` to get the links for all of the results shown on the page.

In [None]:
# Your code here

## <font color="red"> Exercise 3 </font> 

Suppose that we want more than 1 page of results.  Inspect the page and find the more button. **Hint:** Right click won't work here, you will need to inspect a nearby element then navigate to the element in the console.

Once you have found this button, click it at least three times, then get all of search result links.

In [None]:
# Your code here

## Headless operation

While the extra window is useful for exploring results, it gets annoying when rerunning common searches.  In this case, we want to run the search in headless mode.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

<html lang="en-US" class="js flexbox canvas canvastext webgl no-touch geolocation postmessage websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers no-applicationcache svg inlinesvg smil svgclippaths no-isios flexboxlegacy supports alps-os-mac" data-triggered="true"><head><style class="vjs-styles-defaults">
      .video-js {
        width: 300px;
        height: 150px;
      }

      .vjs-fluid {
        padding-top: 56.25%
      }
    </style>
    <meta charset="UTF-8">
    <title>
    Nintendo - Official Site - Video Game Consoles, Games
</title>
    
    <meta name="description" content="Discover Nintendo Switch, the video game system you can play at home or on the go. Plus, get the latest games and news on t