# Web Scraping: Selenium
_Automate your browser._ <br>
_Collect data from dynamically generated web pages or those requiring user interaction._

### Docs

- [Selenium homepage](https://www.seleniumhq.org/) 
- [Selenium documentation](https://selenium-python.readthedocs.io/) - unofficial, but helpful

### Installation

With conda:
- `conda install -c conda-forge selenium`

With pip:
- `pip install -U selenium`

#### ChromeDriver

You will also need to install a web driver to use Selenium.  ChromeDriver is recommended but others are also available.

1. Check your browser's version _(Chrome > About Google Chrome)_
![Browser Version](images/browser_version.png) 
<br>
2. Navigate to the [ChromeDriver downloads page](https://sites.google.com/a/chromium.org/chromedriver/downloads).
<br><br>
3. Download appropriately based on your browser's version and your OS.
![Download ChromeDriver zip file](images/chromedriver_options.png)

4. Unzip the driver.
<br><br>
5. Move to Applications folder (or wherever your Chrome application is).

In [6]:
from bs4 import BeautifulSoup
import requests
import time, os

In [7]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

## Example 1 - YouTube

### Dynamic Pages

Some pages serve their content dynamically, which means they could look different each time they are loaded into the browser.  HTML that you see by inspecting elements in your browser might be missing from `requests` and `BeautifulSoup` because it is generated at access time.

In [15]:
query = "data science"
youtube_search = "https://www.youtube.com/results?search_query="
youtube_query = youtube_search + query.replace(' ', '+')

In [16]:
page = requests.get(youtube_query).text
soup = BeautifulSoup(page, 'html5lib')

In [17]:
soup.find('div', id='contents')

Uh oh.  The video links should be under the contents div, but it's missing from our request.

> **QUESTION**: Why do you think this happened?

One option is to first load the page with Selenium THEN parse the page's HTML with BeautifulSoup.

First we launch the YouTube search page through our ChromeDrive.  A new browser should pop up.  **To continue using Selenium, keep this window open!**

In [18]:
driver = webdriver.Chrome(chromedriver)
driver.get(youtube_query)

We can access the page's HTML through the driver:

In [12]:
driver.page_source[:1000]

'<html style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" lang="en-US" dir="ltr" gl="US"><head><script data-original-src="/s/player/54668ca9/player_ias.vflset/en_US/miniplayer.js" src="/s/player/54668ca9/player_ias.vflset/en_US/miniplayer.js"></script><script data-original-src="/s/player/54668ca9/player_ias.vflset/en_US/remote.js" src="/s/player/54668ca9/player_ias.vflset/en_US/remote.js"></script><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2020-10-23" content="AhbmRDASY7NuOZD9cFMgQihZ+mQpCwa8WTGdTx82vSar9ddBQbziBfZXZg+ScofvEZDdHQNCEwz4yM7HjBS9RgkAAABneyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJXZWJDb21wb25lbnRzVjAiLCJleHBpcnkiOjE2MDM0ODY4NTYsImlzU3ViZG9tYWluIjp0cnVlfQ=="><meta http-equiv="origin-trial" data-feature="Web Components V0" data-expires="2020-10-27" content="Av2+1qfUp3MwEfAFcCccykS1qFmvLiCrMZ//pHQKnRZWG9dldVo8HYuJmGj2wZ7nDg+xE4RQMQ+Ku1zKM3PvYAIAAABmeyJvcmlnaW4iOiJodHRwczovL2dvb2dsZS5jb206NDQzIiwiZmVhdHVyZSI6

Now we parse this with `BeautifulSoup` and the video information appears!

In [13]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [14]:
soup.find('div', id='contents')

<div class="style-scope ytd-section-list-renderer" id="contents"><ytd-item-section-renderer class="style-scope ytd-section-list-renderer" use-height-hack=""><!--css-build:shady-->
<div class="style-scope ytd-item-section-renderer" id="header"></div>
<div class="style-scope ytd-item-section-renderer" id="spinner-container">
<paper-spinner-lite aria-hidden="true" class="style-scope ytd-item-section-renderer"><!--css-build:shady--><div class="style-scope paper-spinner-lite" id="spinnerContainer"><div class="spinner-layer style-scope paper-spinner-lite"><div class="circle-clipper left style-scope paper-spinner-lite"><div class="circle style-scope paper-spinner-lite"></div></div><div class="circle-clipper right style-scope paper-spinner-lite"><div class="circle style-scope paper-spinner-lite"></div></div></div></div></paper-spinner-lite>
</div>
<div class="style-scope ytd-item-section-renderer" id="contents"><ytd-promoted-sparkles-text-search-renderer class="style-scope ytd-item-section-ren

In [40]:
contents_div = soup.find('div', id='contents')

for title in contents_div.find_all('a', id='video-title'):
    print(title.text.strip())

What REALLY is Data Science? Told by a Data Scientist
Data Science Training Videos 1 DataScience Tutorial for beginners Must Watch +91 8886552866
Learn Data Science Tutorial - Full Course for Beginners
How I Would Learn Data Science (If I Had to Start Over)
Data Science In 5 Minutes | Data Science For Beginners | What Is Data Science? | Simplilearn
Data Science Full Course - Learn Data Science in 10 Hours | Data Science For Beginners | Edureka
3 Reasons You Should NOT Become a Data Scientist
Is Data Science Really a Rising Career in 2020 ($100,000+ Salary)
Demystifying Data Science | Mr.Asitang Mishra | TEDxOakLawn
Intro to Data Science - Crash Course for Beginners
A Day In The Life Of A Data Scientist
Learn Python - Full Course for Beginners [Tutorial]
Python Tutorial - Python for Beginners [Full Course]
Data Analytics for Beginners
Real Talk with Instagram Data Scientist
Data Scientists vs Data Engineers: Which one is for you?
What Do You Need to Become a Data Scientist in 2020?
Stat

> **QUESTION**: We only got about 20 video titles -- surely there are more videos about data science.  What do you think is happening?

### Interacting with Pages

We can also interact with pages using Selenium.  For example, we can 
- click
- type in input cells
- scroll
- drag and drop, etc.

If we want more data science video titles, we need to scroll down to the bottom of the screen for more videos to populate.

In [41]:
for i in range(5):
    #Scroll
    driver.execute_script(
        "window.scrollTo(0, document.documentElement.scrollHeight);" #Alternatively, document.body.scrollHeight
    )
    
    #Wait for page to load
    time.sleep(1)

In [42]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [43]:
contents_div = soup.find('div', id='contents')

len(contents_div.find_all('a', id='video-title'))

116

Awesome!  Now we have several more videos to analyze and we could continue scrolling if we wanted even more.

What if we want to perform a new search for machine learning?

In [44]:
search_box = driver.find_element_by_xpath("//input[@id='search']")

#clear the current search
search_box.clear()

#input new search
search_box.send_keys("machine learning")

#hit enter
search_box.send_keys(Keys.RETURN)  

time.sleep(1)

And can we filter to short videos (< 4 minutes) only?

In [45]:
filter_button = driver.find_element_by_xpath(
    '//a[contains(@class, "ytd-toggle-button")]'
)
filter_button.click()

In [46]:
short_link = driver.find_element_by_xpath(
    '//div[contains(@title, "Search for Short")]'
)
short_link.click()

Now we can either parse the page source with Beautiful Soup like before or pull text directly.  

For example, the title of the first short ML video (that isn't an ad!) can be found with:

In [47]:
first_title = driver.find_element_by_xpath("//a[@id='video-title']")
first_title.text

'What is Machine Learning?'

In [48]:
first_author = driver.find_element_by_xpath(
    "//ytd-video-renderer//ytd-channel-name//a"
)
first_author.text

'OxfordSparks'

#### Notes

- Check [here](https://www.w3schools.com/xml/xpath_syntax.asp) for additonal help writing xpath selectors.

- To select multiple elements, just switch to `driver.find_elements_by_xpath(...)`, which will return a list of matching elements.

- You can also access elements by id, name, etc.  Check [the docs](https://selenium-python.readthedocs.io/locating-elements.html) for more options.

Finally, when you are finished with the driver, be sure to close it.

In [32]:
driver.close()

## Example 2 - Open Table  _(Optional)_

Let's try one more example: gathering information from Open Table about restaurants with available reservation slots.

In [20]:
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.opentable.com/')
time.sleep(1)  #pause to be sure page has loaded

Inspecting this page, we see the **name** of the drop down for picking the number of people is `Select_1`. Let's set the reservation for 4 people:

In [21]:
people_dropdown = driver.find_element_by_xpath('//select[@aria-label="Party size selector"]')
people_dropdown.send_keys("4 people")
time.sleep(1)

Now select the reservation date: 3 days from now.

In [22]:
from datetime import datetime, timedelta

In [23]:
today = datetime.today()
today_truncated = datetime(today.year, today.month, today.day)
res_date = (today + timedelta(days=3)).strftime('%a, %b %d, %Y')
res_date

'Sat, Apr 18, 2020'

In [24]:
#Expand the calendar
date_picker = driver.find_element_by_xpath('//div[@aria-label="Date selector"]')
date_picker.click()
time.sleep(1)

In [25]:
#Select the date three days from now
date_element = driver.find_element_by_xpath(f'//div[@aria-label="{res_date}"]')
date_element.click()
time.sleep(1)

Set our reservation time for 8 PM.

In [26]:
time_dropdown = driver.find_element_by_xpath('//select[@aria-label="Time selector"]')
time_dropdown.send_keys("8:00 PM")
time.sleep(1)

And search!

In [27]:
search_button = driver.find_element_by_xpath("//*[contains(text(), 'Let’s go')]")
search_button.click()
time.sleep(1)

On this new page we find a long list of restaurants with available reservations for 4 people at roughly our desired day/time.  At this point we could grab the HTML (`driver.page_source`) and parse with BeautifulSoup.  

In [28]:
soup = BeautifulSoup(driver.page_source)

In [29]:
for rest in soup.find_all('div', class_='rest-row-header')[:20]:
    print(rest.find('a').text)

Fratelli's Trattoria  Late Night Find 
3 Westerly Bar and Grill  Outdoor Dining 
Brothers Fish & Chips 
Restaurant X & Bully Boy Bar  Great for Brunch 
Aji Limo/Peruvian Cuisine 
Melike Turkish Cuisine  Neighborhood Gem 
Flames Bar and Grill 
The Station Kitchen and Bar  Best Service 
Le Jardin du Roi  Great for Brunch 
Vela Kitchen  Contemporary American 
Crabtree's Kittle House  Notable Wine List 
Mediterraneo Ristorante & Caffe 
RiverMarket Bar & Kitchen  Great for Brunch 
Fin & Brew  Outdoor Dining 
Lexington Square Cafe 
Bistro 146  Neighborhood Gem 
8 North Broadway 
Guadalajara Mexican Restaurant 
Velo  Notable Wine List 
The Hudson Room  Hot Spot 


Or we could click into an individual restaurant to learn more.

In [30]:
first_rest = driver.find_element_by_xpath('//div[@class="rest-row-header"]//a')
first_rest.click()

In [31]:
rest_soup = BeautifulSoup(driver.page_source)

print(rest_soup.title.text)

Restaurant Reservation Availability


> **QUESTION**:  Why doesn't the title of this page match up with this individual restaurant?

In [32]:
#Switch windows!
driver.switch_to.window(driver.window_handles[1])

In [33]:
rest_soup = BeautifulSoup(driver.page_source)

print(rest_soup.title.text)

Fratelli's Trattoria Restaurant - Croton-on-Hudson, NY | OpenTable


As usual when working with Selenium, make sure to close your browser.  Since we have two windows up, we use `driver.quit()` to close the entire browser session.

In [34]:
driver.quit()