# Selenium

- Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

- Selenium was originally developed for the purpose of automated website testing. It's also a powerful tool for web scraping with dynamic pages.

- Selenium can be controlled from various programming languages, such as Java, C#, PHP, and Python.
- Reference page: https://selenium-python.readthedocs.io/installation.html#introduction

## Configuration of Selenium
### step 1: install Selenium

```
pip install -U selenium

conda install selenium
```


In [1]:
%%cmd 
# pip install -U selenium

Microsoft Windows [Version 10.0.19044.1706]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Web_Scraping>pip install -U selenium
Collecting selenium
  Downloading selenium-4.2.0-py3-none-any.whl (983 kB)
     ------------------------------------- 983.2/983.2 kB 12.5 MB/s eta 0:00:00
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
     ------------------------------------- 359.0/359.0 kB 11.3 MB/s eta 0:00:00
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any

### Step 2: download a webdriver file matching your platform (Windows, Mac, or Linux) and your browser.

Webdriver: a compact object-oriented API that drives the browser effectively

- for Chrome user
    - head to https://sites.google.com/chromium.org/driver/downloads?authuser=0 and download the least version matching your chrome version
    - The ZIP file you downloaded will contain an executable called “chromedriver.exe” on Windows or just “chromedriver” otherwise. The easiest way to make sure Selenium can see this executable is to put it in the same directory as your Python scripts
- for Safari user
    - Safari now provides native support for the WebDriver API Starting with Safari 10 on OS X El Capitan and macOS Sierra
    - Enable Remote Automation in the Develop menu.This is toggled via Develop and allow Remote Automation in the menu bar
    - Reference: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

In [93]:
#test

from selenium import webdriver

#url = 'http://www.google.com'
url = 'https://the-internet.herokuapp.com/dynamic_loading/2'
#url = 'http://www.webscrapingfordatascience.com/postform2/'

# open the Chrome using ChromeDrive
driver = webdriver.Chrome()

# If you use Safari
# driver = webdriver.Safari()

driver.get(url)
#input('Press ENTER to close the automated browser')
#driver.quit()
print(driver.page_source)

<html class="no-js" lang="en"><!--<![endif]--><head>
    <script src="/js/vendor/298279967.js"></script>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width">
    <title>The Internet</title>
    <link href="/css/app.css" rel="stylesheet">
    <link href="/css/font-awesome.css" rel="stylesheet">
    <script src="/js/vendor/jquery-1.11.3.min.js"></script>
    <script src="/js/vendor/jquery-ui-1.11.4/jquery-ui.js"></script>
    <script src="/js/foundation/foundation.js"></script><meta class="foundation-mq-small"><meta class="foundation-mq-medium"><meta class="foundation-mq-large"><style></style>
    <script src="/js/foundation/foundation.alerts.js"></script>
    <script>
      $(document).foundation();
    </script>
  </head>
  <body>
    <div class="row">
      <div id="flash-messages" class="large-12 columns">
      
        
      
        
      
        
      
      </div>
    </div>
    <div class="row">
      <a href="https://github.com/tourdedave/the-

### Google colab

In [None]:
#method 1
#for google colab run following codes to set up selenium and webdriver (Chrome)
#step 1
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

In [5]:
#step 2
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

In [7]:
#step 3: test

driver = webdriver.Chrome('chromedriver',options=options)
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
driver.get(url)
print(driver.page_source) # results

AttributeError: 'str' object has no attribute '_ignore_local_proxy'

In [None]:
#method 2
#use kora module in google colab
!pip install kora -q


In [37]:
#test
from kora.selenium import wd
url = url = 'http://www.webscrapingfordatascience.com/postform2/'
wd.get(url)
print(wd.page_source)

ModuleNotFoundError: No module named 'kora'

## Common `selenium` methods

In [65]:
#find_element(By.ID)
#find_element(By.Name)
#find_element(By.XPATH)
#find_element(By.LINK_TEXT)
#find_element(By.PARTIAL_LINK_TEXT)
#find_element(By.TAG_NAME)
#find_element(By.CLASS_NAME)
#find_element(By.CSS_SELECTOR)

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'http://www.webscrapingfordatascience.com/postform2/'
driver = webdriver.Chrome()
driver.get(url)
# print(driver.page_source)
driver.find_element(By.TAG_NAME, 'tr')

<selenium.webdriver.remote.webelement.WebElement (session="34f27ffa3eabbba6a9c336dc3d2498b8", element="0373c6a5-b8e2-42c9-b5fe-3d1621234e5a")>

In [95]:
#dynamic web form example
from selenium import webdriver
url = 'https://www.w3schools.com/html/html_forms.asp'
driver = webdriver.Chrome()
driver.get(url)
#print(driver.page_source)

In [101]:
#url = 'http://www.webscrapingfordatascience.com/postform2/'
url = 'https://the-internet.herokuapp.com/dynamic_loading/2'
driver = webdriver.Chrome()
driver.get(url)

# remotely send data to a webform by xpath and css selector
# the click action

# driver.find_element(By.XPATH, '//*[@id="start"]/button').click()

# driver.find_element(By.CSS_SELECTOR, '#start > button').click()

In [52]:
#send data to a webform with HTML post method

import requests
url = 'http://www.webscrapingfordatascience.com/postform2/'
# First perform a GET request
r = requests.get(url)
# Followed by a POST request
formdata = {
'name': 'YL',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown',
'comments': 'no comment'
}
r = requests.post(url, data=formdata)
print(r.text)

<html>
	<body>


<h2>Thanks for submitting your information</h2>

<p>Here's a dump of the form data that was submitted:</p>

<pre>array(5) {
  ["name"]=>
  string(2) "YL"
  ["gender"]=>
  string(1) "F"
  ["pizza"]=>
  string(4) "like"
  ["haircolor"]=>
  string(5) "brown"
  ["comments"]=>
  string(10) "no comment"
}
</pre>


	</body>
</html>



In [72]:
#example 2
#remote control a web form
from selenium import webdriver
url = 'https://www.w3schools.com/html/html_forms.asp'
driver = webdriver.Chrome()
driver.get(url)

import time

#find_element_by_id
#id="fname"
#id="lname"

fname = driver.find_element(By.ID, 'fname')
lname = driver.find_element(By.ID, 'lname')
btn = driver.find_element(By.XPATH, '//*[@id="main"]/div[3]/div/form/input[3]')

# clear the text of the elements first
# then type into the element
fname.clear()
fname.send_keys('Yuxiao')
time.sleep(1)
lname.clear()
lname.send_keys('Luo')
time.sleep(1)

#btn.click()
lname.submit()



In [1]:
#example 2
#remote control a web form
from selenium import webdriver
url = 'https://www.w3schools.com/html/html_forms.asp'
driver = webdriver.Chrome()
driver.get(url)

In [102]:
#fetch infomation from a dynamic website: Youtube
from selenium.webdriver.common.by import By

url = 'https://www.youtube.com/c/PlayStation/videos'
driver = webdriver.Chrome()
driver.get(url)

#class="style-scope ytd-grid-video-renderer"
#//*[@id="video-title"]
#//*[@id="metadata-line"]/span[1]
#//*[@id="metadata-line"]/span[2]

# waits maximum 10 seconds for element's presence 
# since video site takes longer to load
driver.implicitly_wait(10) #10 second 

videos = driver.find_elements(By.CLASS_NAME,'style-scope ytd-grid-video-renderer')

newlist =[]
for video in videos:
    title = video.find_element(By.XPATH, './/*[@id="video-title"]').text
    views = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[1]').text
    time = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[2]').text
    #print(title,views,time)
    vd = {
        'Title':title,
        'Number of views':views,
        'Time uploaded':time
    }
    newlist.append(vd)

newlist


[{'Title': 'Exoprimal - Gameplay Trailer | PS5 & PS4 Games',
  'Number of views': '21K views',
  'Time uploaded': '4 hours ago'},
 {'Title': 'Hot Wheels Unleashed - Jurassic World Racing Season Trailer | PS5 & PS4 Games',
  'Number of views': '8K views',
  'Time uploaded': '4 hours ago'},
 {'Title': 'Redout 2 - Launch Trailer | PS5 & PS4 Games',
  'Number of views': '8.8K views',
  'Time uploaded': '4 hours ago'},
 {'Title': 'Destruction All Stars 101 Series - Gameplay Breakdown | PS5 Games',
  'Number of views': '19K views',
  'Time uploaded': '6 hours ago'},
 {'Title': 'Labyrinth of Galleria: The Moon Society - Announcement Trailer | PS5 & PS4 Games',
  'Number of views': '14K views',
  'Time uploaded': '7 hours ago'},
 {'Title': 'Sword and Fairy: Together Forever - Exploration Trailer | PS5 & PS4 Games',
  'Number of views': '36K views',
  'Time uploaded': '7 hours ago'},
 {'Title': "Teenage Mutant Ninja Turtles: Shredder's Revenge - Launch Trailer | PS4 Games",
  'Number of views':

In [76]:
#fetch infomation from a dynamic website: Youtube

url = 'https://www.youtube.com/user/DrakeOfficial/videos'
driver = webdriver.Chrome()
driver.get(url)

#class="style-scope ytd-grid-video-renderer"
#//*[@id="video-title"]
#//*[@id="metadata-line"]/span[1]
#//*[@id="metadata-line"]/span[2]

# waits maximum 10 seconds for element's presence 
# since video site takes longer to load
driver.implicitly_wait(10) #10 second 

videos = driver.find_elements(By.CLASS_NAME,'style-scope ytd-grid-video-renderer')

newlist =[]
for video in videos:
    title = video.find_element(By.XPATH, './/*[@id="video-title"]').text
    views = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[1]').text
    time = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[2]').text
    #print(title,views,time)
    vd = {
        'Title':title,
        'Number of views':views,
        'Time uploaded':time
    }
    newlist.append(vd)

newlist

[{'Title': 'Drake - Knife Talk (Official Video) ft. 21 Savage, Project Pat',
  'Number of views': '47M views',
  'Time uploaded': '7 months ago'},
 {'Title': 'Drake - Fair Trade (Audio) ft. Travis Scott',
  'Number of views': '57M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - TSU (Audio)',
  'Number of views': '11M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - Love All (Audio) ft. JAY-Z',
  'Number of views': '14M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - No Friends In The Industry',
  'Number of views': '13M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - Get Along Better (Audio) ft. Ty Dolla $ign',
  'Number of views': '4.5M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - In The Bible (Audio) ft. Lil Durk, Giveon',
  'Number of views': '18M views',
  'Time uploaded': '9 months ago'},
 {'Title': 'Drake - N 2 Deep (Audio) ft. Future',
  'Number of views': '11M views',
  'Time uploaded': '9 months 

### Selenium Wait 
Deal with "NoSuchElementException"
https://selenium-python.readthedocs.io/waits.html

Implicit wait:
- Makes WebDriver poll the page for a certain amount of time every time when trying to locate an element
- This can be useful when certain elements on the webpage are not available immediately and need some time to load
- driver.implicitly_wait(10)

Explicit wait:
- Makes the WebDriver wait for a certain, given condition to return a non-False value before proceeding further with execution
- presence_of_element_located(locator): checks whether there is at least one element present on the page matching a locator (see explanation below). If found, the condition returns the first matching element.

In [86]:
#Syntax

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

WebDriverException: Message: unknown error: net::ERR_NAME_NOT_RESOLVED
  (Session info: chrome=102.0.5005.115)
Stacktrace:
Backtrace:
	Ordinal0 [0x0080D953+2414931]
	Ordinal0 [0x0079F5E1+1963489]
	Ordinal0 [0x0068C6B8+837304]
	Ordinal0 [0x00688E18+822808]
	Ordinal0 [0x0067DE5D+777821]
	Ordinal0 [0x0067EA3B+780859]
	Ordinal0 [0x0067E06A+778346]
	Ordinal0 [0x0067D646+775750]
	Ordinal0 [0x0067C565+771429]
	Ordinal0 [0x0067CA3D+772669]
	Ordinal0 [0x0068DEA4+843428]
	Ordinal0 [0x006E4EBD+1199805]
	Ordinal0 [0x006D449C+1131676]
	Ordinal0 [0x006E4812+1198098]
	Ordinal0 [0x006D42B6+1131190]
	Ordinal0 [0x006AE860+976992]
	Ordinal0 [0x006AF756+980822]
	GetHandleVerifier [0x00A7CC62+2510274]
	GetHandleVerifier [0x00A6F760+2455744]
	GetHandleVerifier [0x0089EABA+551962]
	GetHandleVerifier [0x0089D916+547446]
	Ordinal0 [0x007A5F3B+1990459]
	Ordinal0 [0x007AA898+2009240]
	Ordinal0 [0x007AA985+2009477]
	Ordinal0 [0x007B3AD1+2046673]
	BaseThreadInitThunk [0x767BFA29+25]
	RtlGetAppContainerNamedObjectPath [0x77217A7E+286]
	RtlGetAppContainerNamedObjectPath [0x77217A4E+238]


In [78]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
import time


driver = webdriver.Chrome()
# Go to www.google.com
driver.get("https://www.google.com")

try:
    # Wait as long as required, or maximum of 5 sec for element to appear
    # If successful, retrieves the element
    element = WebDriverWait(driver,5).until(
         EC.presence_of_element_located((By.NAME, "q"))) #<q> defines a short quotation

    # Type "selenium"
    element.send_keys("selenium")
    
    #Type Enter
    element.send_keys(Keys.ENTER)
    time.sleep(3)
    print(driver.page_source.encode('utf-8'))


except TimeoutException:
    print("Failed to load search bar at www.google.com")
#finally:
#    driver.quit() 



b'<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>selenium - Google Search</title><script src="https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.9VzcbxpRKHk.O/m=gapi_iframes,googleapis_client/rt=j/sv=1/d=1/ed=1/rs=AHpOoo_aUoPPaITb9EEzSW7K7ij6VHBgCQ/cb=gapi.loaded_0" nonce="" async=""></script><script nonce="">(function(){\nvar b=window.addEventListener;window.addEventListener=function(a,c,d){"unload"!==a&&b(a,c,d)};}).call(this);(function(){window.google={kEI:\'Z1urYpLBBZD_tAaJyqPgCA\',kEXPI:\'31\',kBL:\'Je5W\'};google.sn=\'web\';google.kHL=\'en\';})();(function(){\nvar f=this||self;var h,k=[];function l(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||h}function m(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentN

### Ethical use of web scraping

- Linkedin sued anonymous data scrapers:
https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/.

- Screen the `robots.txt` document of websites before scraping. 

### Key principles for web scraping
- **Get written permission**: The best way to avoid legal issues is to get written permission from a website’s owner covering which data you can scrape and to what extent. 

- **Check the terms of use**: These will often include explicit provisions against automated extraction of data. Oftentimes, a site’s API will come with its own terms of use regarding usage, which you should check as well.

- **Public information only**: If a site exposes information publicly, without explicitly requiring acceptance of terms and conditions, moderated scraping is most likely fine. Sites that require you to log in is another story, however.

- **Don’t cause damage**: Be nice when scraping! Don’t hammer websites with lots of requests, overloading their network and blocking them of normal usage. Stay away from protected computers, and do not try to access servers you’re not given access to.

- **Copyright and fair use**: Copyright law seems to provide the strongest means for plaintiffs to argue their case. Check carefully whether your scraping case would fall under fair use, and do not use copyrighted works in commercial projects.