# 3.14.35 Web Scraping and Web Interaction

### Web Interactions

In our last class, we saw how we can scrape data from a web page by referencing specific tags, ids, names or classes from the HTML source code of the page itself. Now, in some cases it may be necessary to **interact** with the page we're on, just like when we use our mouse to click on a link or select an option or when we use our keyboard to insert some text. 

To perform those actions listed above, we introduce a new Python library called `selenium`. Selenium is not shipped with the Anaconda distribution, so we will have to install the package manually via the `pip install` command in your computer terminal. As always, refer to the [official documentation](https://selenium-python.readthedocs.io/) for more information as well as for an introduction to the library.

*Note: if you're using a Windows computer, you will need to run the following command from within a Conda Terminal*

In [1]:
# uncomment and run this only if you haven't installed it yet:
!pip install selenium

Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
[K     |████████████████████████████████| 981 kB 1.2 MB/s eta 0:00:01     |██████████████████████▊         | 696 kB 1.2 MB/s eta 0:00:01
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 3.5 MB/s eta 0:00:01
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.13.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 1.8 MB/s eta 0:00:01
[?25hInstalling collected packages: outcome, h11, async-generator, wsproto, trio, trio-websocket, selenium
Successfully installed async-generator-

Before we proceed to importing the library there is one more thing that needs to be done. In order to interact with the browser, we'll need a **WebDriver**, that is an open source tool for automated testing which provides capabilities for navigating to web pages, user input, JavaScript execution, and more. 

We will be using Google Chrome as our default browser (we recommend you to use Chrome as well, but this can be done with other browsers too), so you'll need to download the latest stable release of the **ChromeDriver** from [this website](https://sites.google.com/chromium.org/driver/) (make sure you select the .zip file appropriate for your Operating System). 

Once you've downloaded this file, copy it to the folder of this class lession. 

### A first simple example

At this point, we can import the necessary libraries: 

In [7]:
from selenium import webdriver
import os

The next step is to activate the WebDriver using the `webdriver.Chrome()` function, which requires the path to the `chromedriver` file as argument. We assign the result to an object that we call `driver` and that we'll use later on to interact with the browser instance that appeares. 

Note: if you're using a Mac and the command below gives you an error saying that "Error: “chromedriver” cannot be opened because the developer cannot be verified": 

- open a teminal at the folder where the `chromedriver` file is saved
- run the following command in the terminal: `xattr -d com.apple.quarantine chromedriver `

In [5]:
os.getcwd()

'/Users/carmidellamonica/Documents/dataAnalytics/git/3.14.35-Web-Scraping-and-Web-Interaction'

In [9]:
driver = webdriver.Chrome(os.getcwd() + '/chromedriver')

  driver = webdriver.Chrome(os.getcwd() + '/chromedriver')


Notice that the command above, if succesfully executed, will open up a new Chrome window that should say something like *Chrome is being controlled by automated test software*.

At this point we can use the `driver` object we just created to (phisically) navigate to a new web page. 

In [10]:
driver.get("https://www.scrapethissite.com/")

Executing the last cell should have taken you to the homepage of the [scrapethissite](https://www.scrapethissite.com/) website we've already seen in the last calss and that we're going to use for today's lesson as well. 

Similarly to `BeautifulSoup`, Selenium has some functions that allow us to retirive some content on the website by referencing HTML elements in the source code of the web page; specifically, there are two methods of particular interest: 

- `.find_element()` to retrieve (and return) a single element
- `.find_elements()` to retrieve multiple elements and return them in a list

As you can read in the [official documentation](https://selenium-python.readthedocs.io/locating-elements.html), these functions take two arguments: 

- a `by` parameter that specifies which element (id, name, class name, ...) to look for 
- a `value` parameter that specifies the value of the element (the id, the name, the class name, ...)

The `by` clause requires you to load a specific module from the library using the following code: 

In [15]:
from selenium.webdriver.common.by import By

Check out [this page](https://selenium-python.readthedocs.io/api.html#locate-elements-by) to see all the options at disposal for the `by` clause. 

Let's say you wanted to select the "Explore Sandbox" element in the webpage, you could use the `.find_element()` method and specify a `By.LINK_TEXT` clause with a `value` parameter set to the text itself, that is, `Explore Sandbox`. 

In [12]:
driver.find_element(By.LINK_TEXT, "Explore Sandbox")

<selenium.webdriver.remote.webelement.WebElement (session="99122741d1125d2434ec6c99c7877fa1", element="6c5bef6a-751d-4413-9aa3-8d538c92f297")>

Since this element is a button, we can click it using the `.click()` method: 

In [16]:
driver.find_element(By.LINK_TEXT, "Explore Sandbox").click()

We can go back to the previous page using the `.back()` method on the `driver` object: 

In [14]:
driver.back()

Let's see another method to reference that "Explore Sandbox" element; in this case, we'll use the `By.XPATH` clause. 

As can be found in [this documentation page](https://selenium-python.readthedocs.io/locating-elements.html#locating-by-xpath), XPath is the language used to locate nodes in an XML document and, since an HTML document can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. 

Basically, when we use XPATH to locate an element, we're giving this element's path in the tree-like structure of the HTML document to Selenium, so that it can locate the element for us. You can read more in the documentation, but in our specific case, if we want to locate the "Explore Sandbox" button with XPath, we can: 

- specify to look in the entire document with the special character `//`
- specify the `a` tag
- specify the `@class` attribute (inside `[ ]`)
- and its `'btn btn-lg btn-default'` value

All of this can be written in the following special syntax: `//a[@class='btn btn-lg btn-default']`.

In [17]:
driver.find_element(By.XPATH, "//a[@class='btn btn-lg btn-default']").click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@class='btn btn-lg btn-default']"}
  (Session info: chrome=103.0.5060.53)
Stacktrace:
0   chromedriver                        0x0000000101b360f9 chromedriver + 4448505
1   chromedriver                        0x0000000101ac1f73 chromedriver + 3972979
2   chromedriver                        0x000000010175c268 chromedriver + 410216
3   chromedriver                        0x000000010179252d chromedriver + 632109
4   chromedriver                        0x00000001017927b1 chromedriver + 632753
5   chromedriver                        0x00000001017c4784 chromedriver + 837508
6   chromedriver                        0x00000001017afa3d chromedriver + 752189
7   chromedriver                        0x00000001017c24a1 chromedriver + 828577
8   chromedriver                        0x00000001017af903 chromedriver + 751875
9   chromedriver                        0x0000000101785c55 chromedriver + 580693
10  chromedriver                        0x0000000101786ca5 chromedriver + 584869
11  chromedriver                        0x0000000101b0745d chromedriver + 4256861
12  chromedriver                        0x0000000101b0c3dd chromedriver + 4277213
13  chromedriver                        0x0000000101b1135f chromedriver + 4297567
14  chromedriver                        0x0000000101b0d03a chromedriver + 4280378
15  chromedriver                        0x0000000101ae686f chromedriver + 4122735
16  chromedriver                        0x0000000101b27448 chromedriver + 4387912
17  chromedriver                        0x0000000101b275cf chromedriver + 4388303
18  chromedriver                        0x0000000101b3d545 chromedriver + 4478277
19  libsystem_pthread.dylib             0x00007ff819a2a4e1 _pthread_start + 125
20  libsystem_pthread.dylib             0x00007ff819a25f6b thread_start + 15


To make things simpler, I recommend you install [this Chrome plugin](https://chrome.google.com/webstore/detail/selectorshub/ndgimibanhlabgdgjcpbbndiehljcpfh/related?hl=en) that helops you find a specific element's identifier, including its relative XPath. *Note: make sure you restart your browser after adding the plugin to Chrome.*

<img src="img/selectors-hub.png" width="600">

Let's move on to the Python script `script1_selenium.py` to see more about Selenium in action. 

### List Comprehension

At the beginning of this module, we learned how to cycle through an iterable ogject in Python using **for loops**, so that if we wanted to cycle through the first 10 numbers and save them in a list, we would write something like this:  

In [None]:
list = []
for i in range(1, 10): 
    list.append(i)
list

Introducing **list comprehension**: 

In [None]:
[i for i in range(1, 10)]

List comprehension offers a **more compact syntax** and is specifically used when you want to create a new list based on the values of an existing list, or some other iterable object. 

You can also include **conditional statements** in it. In a traditional for loop, if you wanted to just print out odd numbers, you would write something like: 

In [None]:
odd = []
for i in range(1, 10): 
    if i%2 != 0: 
        odd.append(i)
odd

In [None]:
[i for i in range(1, 10) if i%2 != 0]

List comprehension is a very handy and powerful tool, make sure you understand them and practice with them so that they become easy to read and to write in your everyday scripts. 

Check out [this tutorial](https://realpython.com/list-comprehension-python/#using-list-comprehensions) to find out more. 

### References

- ChromeDriver [website](https://sites.google.com/chromium.org/driver/)
- Selenium [Official Documentation](https://selenium-python.readthedocs.io/)
- YouTube [Tutorial Series](https://www.youtube.com/watch?v=Xjv1sY630Uc)
- Introductory [Online Tutorial](https://www.browserstack.com/guide/python-selenium-to-run-web-automation-test)