# Selenium

- Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

- Selenium was originally developed for the purpose of automated website testing. It's also a powerful tool for web scraping with dynamic pages.

- Selenium can be controlled from various programming languages, such as Java, C#, PHP, and Python.
- Reference page: https://selenium-python.readthedocs.io/installation.html#introduction


## Configuration of Selenium
### step 1: install Selenium

Use the `pip` or `conda` to install Python modules. 

```
pip install -U selenium

conda install selenium
```

Mac users can run it in Jupyter Notebook with the magic command:

In [3]:
# %%bash
# pip install -U selenium

Windows users can run it in Jupyter Notebook with the magic command:

In [2]:
# %%cmd 
# pip install -U selenium

### Step 2: download a webdriver file matching your platform (Windows, Mac, or Linux) and your browser.

Webdriver: a compact object-oriented API that drives the browser effectively

- for Chrome user
    - head to https://sites.google.com/chromium.org/driver/downloads?authuser=0 and download the least version matching your chrome version
    - The ZIP file you downloaded will contain an executable called “chromedriver.exe” on Windows or just “chromedriver” otherwise. The easiest way to make sure Selenium can see this executable is to put it in the same directory as your Python scripts
- for Safari user
    - Safari now provides native support for the WebDriver API Starting with Safari 10 on OS X El Capitan and macOS Sierra. This might be oudated, check the up-to-date information online.
    - Enable Remote Automation in the Develop menu.This is toggled via Develop and allow Remote Automation in the menu bar
    - Reference: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
    
**In the following section, we will use Chrome Webdriver for all the demonstrations**.

In [7]:
#test

from selenium import webdriver

#url = 'http://www.google.com'
url = 'https://the-internet.herokuapp.com/dynamic_loading/2'
#url = 'http://www.webscrapingfordatascience.com/postform2/'

# open the Chrome using ChromeDrive
driver = webdriver.Chrome()

# If you use Safari
# driver = webdriver.Safari()

driver.get(url)

# input('Press ENTER to close the automated browser')
# driver.quit()

# print(driver.page_source)

### Google colab

The following methods are gathered for running webdriver on Google colab FYI. However, we will stick to Jupyter Notebook in the rest of learning.

In [None]:
#method 1
#for google colab run following codes to set up selenium and webdriver (Chrome)
#step 1
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

In [5]:
#step 2
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

In [11]:
#step 3: test

# driver = webdriver.Chrome('chromedriver',options=options)
# url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
# driver.get(url)
# print(driver.page_source) # results

In [None]:
#method 2
#use kora module in google colab
!pip install kora -q


kora is a collection of tools to make programming on Google Colab easier. GitHub: https://github.com/korakot/kora.

In [12]:
#test
# from kora.selenium import wd
# url = url = 'http://www.webscrapingfordatascience.com/postform2/'
# wd.get(url)
# print(wd.page_source)

### WebDriver Location
If you prefer to keep the WebDriver executable somewhere else, it is also possible to pass its location as you construct the Selenium webdriver object in Python like so (however, we’ll assume that you keep the executable in the same directory for the
examples that follow to keep the code a bit shorter):

In [13]:
# driver_exe = 'C:/Users/Henry/Desktop/chromedriver.exe'
# driver = webdriver.Chrome(driver_exe)

## Common `selenium` methods

We can use Selenium methods to to handle HTML elements, like Beautiful Soup. Here is a list of the methods:
- find_element(By.ID) 
- find_element(By.Name)
       - selects elements based on the HTML “name” attribute
- find_element(By.XPATH)
       - selects elements using XPATH
- find_element(By.LINK_TEXT)
       - selects elements (e.g., link) by matching its inner text.     
- find_element(By.PARTIAL_LINK_TEXT)
       - does the same, but partially matches the inner text. 
- find_element(By.TAG_NAME)
       - selects elements using the actual tag name.
- find_element(By.CLASS_NAME)
- find_element(By.CSS_SELECTOR)
       - is similar to Beautiful Soup’s select method, but comes with a more robust CSS selector rule parser.

To find multiple elements (these methods will return a list): `find_elements`.

### XPath and XML
XPath is a language used for locating nodes in XML documents. As HTML can be regarded as an implementation of XML (also referred to as XHTML in this case), Selenium can use the XPath language to select elements. XPath extends beyond the simple methods of locating by id or name attributes.

A quick overview of common XPath expressions is given below:
- nodename selects all nodes with the name “nodename”;
- / selects from the root node;
- // can be used to skip multiple levels of nodes and search through all
descendants to perform a selection;
- . selects the current node;
- .. selects the parent of the current node;
- @ selects attributes.

Examples:

- /html/body/form[1]: get the first form element inside the “<body>” tag inside the “<html>” tag.
- //form[1]: get the first form element inside the document.
- //form[@id='my_form']: get the form element with an “id” attribute
set to “my_form”.

### Types of XPath

XPath can be used to locate any element on a page based on its tag name, ID, CSS class, and so on. There are two types of XPath in Selenium.

1. Absolute XPath: 
    - absolute Xpath is the simplest form of XPath in Selenium. It starts with a single slash ‘/’ and provides the absolute path of an element in the entire DOM.
    - However, even though it is simple, the biggest disadvantage of using absolute XPath is that they are very vulnerable to any changes in the DOM structure and, as a result, can bring you a lot of automation failures.
    
    ```
    //div/div/div/div[1]/div/a/img
    ```

2. Relative XPath
    - In the case of relative XPath in Selenium, the XPath expression starts from the middle of the DOM structure. It is represented by a double slash ‘//’ denoting the current node.
    - It is always preferred over an absolute XPath as it is not a complete path from the root element.
    
    ```
    //img[@alt='LambdaTest']
    ```

### How to write XPath in Selenium

1. XPath using Contains
`contains()` is a very useful method in XPath. It can be used for all such web elements whose value can change dynamically. The syntax for using Contains() method in XPath is
```
//tagname[contains(@attribute,constantvalue)]
```

2. XPath using Logical Operators: OR & AND
We can use logical operators such as OR & AND on the attributes condition. In the case of OR, any one of the conditions should be true or both, whereas, in the case of AND, both the conditions should be true.
    ```
    XPath=//tagname[@attribute1=value1 OR @attribute2=value1]

    XPath=//tagname[@attribute1=value1 AND @attribute2=value1]
    ```

3. XPath using Text() is used in XPath whenever we have a text defined in an HTML tag, and we wish to identify that element via text. This comes in handy when the other attribute values change dynamically with no substantial attribute value used via Starts-with or Contains.
    ```
    //tagname[text()='Text of the Web Element']
    ```
    
4. XPath using Starts-With() is similar to the Contains() method. It is helpful in the case of web elements whose attribute value can change dynamically. In the Starts-With method, the starting value of the attribute’s text is used for locating the element.


[Reference](https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/)

#### Special Keys 

In `selenium` webdriver, we can use the keys on our keyboard. (https://www.geeksforgeeks.org/special-keys-in-selenium-python/#)

|            |              |            |
|------------|--------------|------------|
| ADD        | ALT          | ARROW_DOWN |
| ARROW_LEFT | ARROW_RIGHT  | ARROW_UP   |
| BACKSPACE  | BACK_SPACE   | CANCEL     |
| CLEAR      | COMMAND      | CONTROL    |
| DECIMAL    | DELETE       | DIVIDE     |
| DOWN       | END          | ENTER      |
| EQUALS     | ESCAPE       | F1         |
| F10        | F11          | F12        |
| F2         | F3           | F4         |
| F5         | F6           | F7         |
| F8         | F9           | HELP       |
| HOME       | INSERT       | LEFT       |
| LEFT_ALT   | LEFT_CONTROL | LEFT_SHIFT |
| META       | MULTIPLY     | NULL       |
| NUMPAD0    | NUMPAD1      | NUMPAD2    |
| NUMPAD3    | NUMPAD4      | NUMPAD5    |
| NUMPAD6    | NUMPAD7      | NUMPAD8    |
| NUMPAD9    | PAGE_DOWN    | PAGE_UP    |
| PAUSE      | RETURN       | RIGHT      |
| SEMICOLON  | SEPARATOR    | SHIFT      |
| SPACE      | SUBTRACT     | TAB        |

### Code Demonstration

- import `webdrive` class.
- `Keys` and `By` are two commons classes we will use in `selenium` module. Import them as needed.
    - Keys class provide keys in the keyboard
    - By class are used to locate elements

In [None]:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

Let's use `selenium` to scrape the Wikipedia page of Information System: https://en.wikipedia.org/wiki/Information_system. 

In [77]:
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://en.wikipedia.org/wiki/Information_system'
driver = webdriver.Chrome()
driver.get(url)
# print(driver.page_source)
driver.find_element(By.TAG_NAME, 'h1')

<selenium.webdriver.remote.webelement.WebElement (session="15d2161d8051ed4551b6fc73c2b416ca", element="A5955F550FF65897A4A0096D3EEEE74A_element_30")>

In [80]:
# find the page title with tag name
driver.find_element(By.TAG_NAME, 'h1').text

'Information system'

In [24]:
# find the page title with class attribute
driver.find_element(By.CLASS_NAME,'mw-page-title-main').text

'Information system'

In [30]:
# find a link based on textual value 
print(driver.find_element(By.PARTIAL_LINK_TEXT, "Information system").text)
driver.find_element(By.PARTIAL_LINK_TEXT, "Information system")

Information systems for managers: with cases


<selenium.webdriver.remote.webelement.WebElement (session="7afa77372a487b0266c571623e6f685d", element="C08460BC0092BE8A39F1AECE18737377_element_222")>

In [34]:
#dynamic web form example
from selenium import webdriver

url = 'https://www.w3schools.com/html/html_forms.asp'
# driver = webdriver.Chrome()
driver.get(url)
#print(driver.page_source)

Perform operaitons on websites with `selenium
- Click the "start" button on this website: https://the-internet.herokuapp.com/dynamic_loading/2
- Remotely send data to a webform by xpath and css selector and the click action

In [38]:
#url = 'http://www.webscrapingfordatascience.com/postform2/'
url = 'https://the-internet.herokuapp.com/dynamic_loading/2'
# driver = webdriver.Chrome()
driver.get(url)

# Using XPATH
# driver.find_element(By.XPATH, '//*[@id="start"]/button').click()

# Using CSS Selector
# driver.find_element(By.CSS_SELECTOR, '#start > button').click()

Remeber what we did in the first Chapter with `requests.post()`: we send data to a webform with HTML post method. 

The example: 

In [52]:
import requests
url = 'http://www.webscrapingfordatascience.com/postform2/'
# First perform a GET request
r = requests.get(url)
# Followed by a POST request
formdata = {
'name': 'YL',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown',
'comments': 'no comment'
}
r = requests.post(url, data=formdata)
print(r.text)

<html>
	<body>


<h2>Thanks for submitting your information</h2>

<p>Here's a dump of the form data that was submitted:</p>

<pre>array(5) {
  ["name"]=>
  string(2) "YL"
  ["gender"]=>
  string(1) "F"
  ["pizza"]=>
  string(4) "like"
  ["haircolor"]=>
  string(5) "brown"
  ["comments"]=>
  string(10) "no comment"
}
</pre>


	</body>
</html>



Now, let's try to remotely control a web form with `selenium` webdriver.

#### Example 1 
Let 's submit our first name and last name to a web form on this page: https://www.w3schools.com/html/html_forms.asp

In [41]:
# from selenium import webdriver
url = 'https://www.w3schools.com/html/html_forms.asp'
# driver = webdriver.Chrome()
driver.get(url)

import time

#find_element_by_id
#id="fname"
#id="lname"

# find first name
fname = driver.find_element(By.ID, 'fname')
# find last name
lname = driver.find_element(By.ID, 'lname')
# find the button
btn = driver.find_element(By.XPATH, '//*[@id="main"]/div[3]/div/form/input[3]')  

# clear the text of the elements first
# then type into the element
fname.clear()
fname.send_keys('Yuxiao')
time.sleep(3)
lname.clear()
lname.send_keys('Luo')
time.sleep(3)

# submit the answer in the web form
# btn.click()
lname.submit()

To retrieve the html code of the page

In [None]:
# print(driver.page_source)

In [51]:
from bs4 import BeautifulSoup
BeautifulSoup(driver.page_source, 'html.parser').find('form')

<form action="/action_page.php" data-gtm-form-interact-id="0" target="_blank">
<label for="fname">First name:</label><br/>
<input data-gtm-form-interact-field-id="0" id="fname" name="fname" type="text" value="John"/><br/>
<label for="lname">Last name:</label><br/>
<input data-gtm-form-interact-field-id="1" id="lname" name="lname" type="text" value="Doe"/><br/><br/>
<input type="submit" value="Submit"/>
</form>

#### Example 2
Let' fetch the video information from a dynamic website: Youtube page of PlayStation https://www.youtube.com/c/PlayStation/videos.

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://www.youtube.com/c/PlayStation/videos'
driver = webdriver.Chrome()
driver.get(url)

#class="style-scope ytd-grid-video-renderer"
#//*[@id="video-title"]
#//*[@id="metadata-line"]/span[1]
#//*[@id="metadata-line"]/span[2]

#video-title

# waits maximum 10 seconds for element's presence 
# since video site takes longer to load
driver.implicitly_wait(10) #10 second 

# videos = driver.find_elements(By.CLASS_NAME,'style-scope ytd-rich-grid-media')
videos = driver.find_elements(By.TAG_NAME, 'ytd-rich-grid-media')

newlist =[]
for video in videos:
    title = video.find_element(By.XPATH, './/*[@id="video-title"]').text
    views = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[1]').text
    time = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[2]').text
    #print(title,views,time)
    vd = {
        'Title':title,
        'Number of views':views,
        'Time uploaded':time
    }
    newlist.append(vd)

newlist


[{'Title': 'Minecraft - Teenage Mutant Ninja Turtles Launch Trailer | PS4 & PS VR Games',
  'Number of views': '13K views',
  'Time uploaded': '5 hours ago'},
 {'Title': 'Call of Duty: Modern Warfare II & Warzone - Season 05 BlackCell Battle Pass Upgrade | PS5 & PS4',
  'Number of views': '24K views',
  'Time uploaded': '6 hours ago'},
 {'Title': 'The Jackbox Party Pack 10 - Dodo Re Mi Reveal Trailer | PS5 & PS4 Games',
  'Number of views': '19K views',
  'Time uploaded': '8 hours ago'},
 {'Title': 'Pistol Whip - Overdrive: Majesty Available Now | PS VR2 Games',
  'Number of views': '12K views',
  'Time uploaded': '10 hours ago'},
 {'Title': 'Atlas Fallen - Lord of the Sands | PS5 Games',
  'Number of views': '25K views',
  'Time uploaded': '11 hours ago'},
 {'Title': 'Bad Dreams - Launch Trailer | PS4 & PSVR Games',
  'Number of views': '26K views',
  'Time uploaded': '11 hours ago'},
 {'Title': 'Tower of Fantasy - Pre-Order Exclusives | PS5 & PS4 Games',
  'Number of views': '36K vie

In [4]:
# You can also use CSS selector
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://www.youtube.com/c/PlayStation/videos'
driver = webdriver.Chrome()
driver.get(url)

#class="style-scope ytd-grid-video-renderer"
# CSS selector:
#video-title
#metadata-line > span:nth-child(3)
#metadata-line > span:nth-child(4)

# waits maximum 10 seconds for element's presence 
# since video site takes longer to load
driver.implicitly_wait(10) #10 second 

# videos = driver.find_elements(By.CLASS_NAME,'style-scope ytd-rich-grid-media')
videos = driver.find_elements(By.TAG_NAME, 'ytd-rich-grid-media')

newlist =[]
for video in videos:
    title = video.find_element(By.CSS_SELECTOR, '#video-title').text
    views = video.find_element(By.CSS_SELECTOR, '#metadata-line > span:nth-child(3)').text
    time = video.find_element(By.CSS_SELECTOR, '#metadata-line > span:nth-child(4)').text
    #print(title,views,time)
    vd = {
        'Title':title,
        'Number of views':views,
        'Time uploaded':time
    }
    newlist.append(vd)

newlist


[{'Title': 'Minecraft - Teenage Mutant Ninja Turtles Launch Trailer | PS4 & PS VR Games',
  'Number of views': '13K views',
  'Time uploaded': '5 hours ago'},
 {'Title': 'Call of Duty: Modern Warfare II & Warzone - Season 05 BlackCell Battle Pass Upgrade | PS5 & PS4',
  'Number of views': '24K views',
  'Time uploaded': '6 hours ago'},
 {'Title': 'The Jackbox Party Pack 10 - Dodo Re Mi Reveal Trailer | PS5 & PS4 Games',
  'Number of views': '19K views',
  'Time uploaded': '8 hours ago'},
 {'Title': 'Pistol Whip - Overdrive: Majesty Available Now | PS VR2 Games',
  'Number of views': '12K views',
  'Time uploaded': '10 hours ago'},
 {'Title': 'Atlas Fallen - Lord of the Sands | PS5 Games',
  'Number of views': '25K views',
  'Time uploaded': '11 hours ago'},
 {'Title': 'Bad Dreams - Launch Trailer | PS4 & PSVR Games',
  'Number of views': '26K views',
  'Time uploaded': '11 hours ago'},
 {'Title': 'Tower of Fantasy - Pre-Order Exclusives | PS5 & PS4 Games',
  'Number of views': '36K vie

#### Example 3
Let's fetch the video information from the Youtube page of Drake.

In [69]:
from selenium import webdriver

url = 'https://www.youtube.com/user/DrakeOfficial/videos'
driver = webdriver.Chrome()
driver.get(url)

#class="style-scope ytd-grid-video-renderer"
#//*[@id="video-title"]
#//*[@id="metadata-line"]/span[1]
#//*[@id="metadata-line"]/span[2]

# waits maximum 10 seconds for element's presence 
# since video site takes longer to load
driver.implicitly_wait(10) #10 second 

# videos = driver.find_elements(By.CLASS_NAME,'style-scope ytd-grid-video-renderer')
videos = driver.find_elements(By.TAG_NAME, 'ytd-rich-item-renderer')

newlist =[]
for video in videos:
    title = video.find_element(By.XPATH, './/*[@id="video-title"]').text
    views = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[1]').text
    time = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[2]').text
    #print(title,views,time)
    vd = {
        'Title':title,
        'Number of views':views,
        'Time uploaded':time
    }
    newlist.append(vd)

newlist

[{'Title': 'Drake - Search & Rescue (Official Visualizer)',
  'Number of views': '20M views',
  'Time uploaded': '3 months ago'},
 {'Title': 'Drake, 21 Savage - Spin Bout U',
  'Number of views': '32M views',
  'Time uploaded': '5 months ago'},
 {'Title': 'Drake - Jumbotron Shit Poppin',
  'Number of views': '14M views',
  'Time uploaded': '6 months ago'},
 {'Title': 'Drake and 21 Savage - Rich Flex Her Loss Recap',
  'Number of views': '28M views',
  'Time uploaded': '8 months ago'},
 {'Title': 'Drake & 21 Savage - Privileged Rappers | A COLORS SHOW',
  'Number of views': '14M views',
  'Time uploaded': '8 months ago'},
 {'Title': 'Drake and 21 Savage performing “On BS” live on SNL',
  'Number of views': '9.1M views',
  'Time uploaded': '8 months ago'},
 {'Title': 'Drake - Middle of the Ocean (Audio)',
  'Number of views': '6M views',
  'Time uploaded': '8 months ago'},
 {'Title': 'Drake, 21 Savage - On BS (Audio)',
  'Number of views': '12M views',
  'Time uploaded': '8 months ago'},

### Selenium Wait 
Deal with "NoSuchElementException"
https://selenium-python.readthedocs.io/waits.html

Implicit wait:
- Makes WebDriver poll the page for a certain amount of time every time when trying to locate an element
- This can be useful when certain elements on the webpage are not available immediately and need some time to load
- driver.implicitly_wait(10)

Explicit wait:
- Makes the WebDriver wait for a certain, given condition to return a non-False value before proceeding further with execution
- `presence_of_element_located(locator)`: checks whether there is at least one element present on the page matching a locator (see explanation below). If found, the condition returns the first matching element. It's from the `expected_conditions` class or `EC`. More details are included in the tutorial "More_About_Selenium".

Synatx:

```
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://somedomain.com/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()
```

In [76]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
import time


driver = webdriver.Chrome()
# Go to www.google.com
driver.get("https://www.google.com")

try:
    # Wait as long as required, or maximum of 5 sec for element to appear
    # If successful, retrieves the element
    element = WebDriverWait(driver,5).until(
         EC.presence_of_element_located((By.NAME, "q"))) # <q> defines a short quotation

    # Type "selenium"
    element.send_keys("selenium")
    
    #Type Enter
    element.send_keys(Keys.ENTER)
    # time.sleep(3)
    # print(driver.page_source.encode('utf-8'))


except TimeoutException:
    print("Failed to load search bar at www.google.com")
    
# The finally block will always be executed, no matter if the try block raises an error or not
#finally:
#    driver.quit() 

### Ethical use of web scraping

- Linkedin sued anonymous data scrapers:
https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/.

- Screen the `robots.txt` document of websites before scraping. 

### Key principles for web scraping
- **Get written permission**: The best way to avoid legal issues is to get written permission from a website’s owner covering which data you can scrape and to what extent. 

- **Check the terms of use**: These will often include explicit provisions against automated extraction of data. Oftentimes, a site’s API will come with its own terms of use regarding usage, which you should check as well.

- **Public information only**: If a site exposes information publicly, without explicitly requiring acceptance of terms and conditions, moderated scraping is most likely fine. Sites that require you to log in is another story, however.

- **Don’t cause damage**: Be nice when scraping! Don’t hammer websites with lots of requests, overloading their network and blocking them of normal usage. Stay away from protected computers, and do not try to access servers you’re not given access to.

- **Copyright and fair use**: Copyright law seems to provide the strongest means for plaintiffs to argue their case. Check carefully whether your scraping case would fall under fair use, and do not use copyrighted works in commercial projects.