# CCSS Webscraping Workshop- Fall 2022
*CCSS Data Science Fellow Remy Stewart*

Welcoming to the coding demonstrations of the CCSS Web Scraping Fall 2022 workshop! 

The first module will introduce the fundamentals around interacting with websites to retrieve data by highlighting the Beautiful Soup library. The second module will explore how to navigate dynamically generated website content through interactive scraping powered by Selenium.

We'll be using WebScraper.io's [webscraping test site](https://webscraper.io/test-sites/e-commerce/static) structured as an  electronics e-commerce site. This webpage serves as a great testing ground to learn about webscraping without having to worry about site blockages when building our initial scraper. We'll be toggling back and forth between this notebook and the website to use its Inspect feature to learn more about the site's underlying structure throughout the demos. 


#Module 1- Fundamentals with Beautiful Soup 

Let's start by importing in the libraries we'll be using within the first module: 

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pandas as pd


Our initial goal will be to retrieving the name, price, description, and number of reviews of the three items featured on the site home page. `urllib.requests` is a library that goes hand-in-hand with Beautiful Soup by establishing the HTTP connection we'll need with the e-commerce site to parse the site's HTML. Let's start by inspecting the full HTML of the site itself by directly using the `BeautifulSoup` method with a designated `html.parser`:

In [None]:
html = urlopen('https://webscraper.io/test-sites/e-commerce/static')
bs_html = BeautifulSoup(html, 'html.parser')
print(bs_html)

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Anti-flicker snippet (recommended)  -->
<style>.async-hide {
		opacity: 0 !important
	} </style>
<script>(function (a, s, y, n, c, h, i, d, e) {
		s.className += ' ' + y;
		h.start = 1 * new Date;
		h.end = i = function () {
			s.className = s.className.replace(RegExp(' ?' + y), '')
		};
		(a[n] = a[n] || []).hide = h;
		setTimeout(function () {
			i();
			h.end = null
		}, c);
		h.timeout = c;
	})(window, document.documentElement, 'async-hide', 'dataLayer', 4000,
		{'GTM-NVFPDWB': true});</script>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB'

### HTML Tags & Attributes

Let's break down some of the key ideas of what's otherwise a intimidatingly long and complex jumble of HTML.

- **Tags** are used to represent different elements such as `<title>`, `<div>` for a section within a document, and `<h1>` for the first header. 
- HTML tags usually require a closing tag such as
HTML and will include additional information within a given tag that are known as **attributes**.
- Attributes are marked by key words such as `class`, `id`, or `href` (referring to hyperlinks) followed by an equal sign and a description such as `<div class='container'>`
- Our created Beautiful Soup parsed HTML object hosts a variety of methods that allows us to look at specific tags within the HTML. The following retrieves the first specified instance of the `<p>` paragraph tag in the HTML file:

In [None]:
print(bs_html.p)

<p>Web Scraper</p>


Attributes are used to distinguish different classes of the same HTML tag, which facilitates the easy retrieval of distinct HTML elements while scraping that would otherwise have the same structure to each other. An example of this is the e-commerce test site `<div class="caption">` and `<div class="ratings">` to differentiate between the product descriptions and the ratings of the items being displayed on the website. 

We can use the `find_all` method to identify any of the `div` tags with their `class` attribute specified as a `caption`:

In [None]:
bs_html.find_all('div', {'class':'caption'})

[<div class="caption">
 <h4 class="pull-right price">$24.99</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/486" title="Nokia 123">Nokia 123</a>
 </h4>
 <p class="description">7 day battery</p>
 </div>, <div class="caption">
 <h4 class="pull-right price">$409.63</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/559" title="Lenovo V110-15ISK">Lenovo V110-15IS...</a>
 </h4>
 <p class="description">Lenovo V110-15ISK, 15.6" HD, Core i3-6006U, 8GB, 128GB SSD, Windows 10 Home</p>
 </div>, <div class="caption">
 <h4 class="pull-right price">$488.64</h4>
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/575" title="Acer Swift 1 SF113-31 Silver">Acer Swift 1 SF1...</a>
 </h4>
 <p class="description">Acer Swift 1 SF113-31 Silver, 13.3" FHD, Pentium N4200, 4GB, 128GB SSD, Windows 10 Home</p>
 </div>]

The `get_text` method strips away the HTML tags to just express the text itself:

In [None]:
captions = bs_html.find_all('div', {'class':'caption'})
for name in captions:
    print(name.get_text())


$103.99

Amazon Kindle

6" screen, wifi


$485.90

Acer Aspire ES1-...

Acer Aspire ES1-572 Black, 15.6" HD, Core i5-7200U, 4GB, 128GB SSD, Linux


$1144.20

Dell Inspiron 15...

Dell Inspiron 15 (7567) Black, 15.6" FHD, Core i7-7700HQ, 8GB, 1TB, GeForce GTX 1050 Ti 4GB, Linux + Windows 10 Home



The basic structure behind using Beautiful Soups' `find` and `find_all` methods is to first specify the tag you're interested in within the HTML document, and then the associated attributes that further narrows down the tag you're interested in. You can build from this to retrieve multiple tag types at once, such as all of the headers set to the 1, 2, and 4 subheader sizes as follows:  

In [None]:
bs_html.find_all(['h1','h2','h4'])

[<h1>Test Sites</h1>,
 <h1>E-commerce training site</h1>,
 <h2>Top items being scraped right now</h2>,
 <h4 class="pull-right price">$103.99</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/498" title="Amazon Kindle">Amazon Kindle</a>
 </h4>,
 <h4 class="pull-right price">$485.90</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/573" title="Acer Aspire ES1-572 Black">Acer Aspire ES1-...</a>
 </h4>,
 <h4 class="pull-right price">$1144.20</h4>,
 <h4>
 <a class="title" href="/test-sites/e-commerce/static/product/596" title="Dell Inspiron 15 (7567) Black">Dell Inspiron 15...</a>
 </h4>]

You can pass both the caption and ratings attributes as values in a class dictionary to identify multiple attributes in one call. 

In [None]:
caption_reviews = bs_html.find_all('div', {'class':{'caption', 'ratings'}})
for name in caption_reviews:
    print(name.get_text())


$103.99

Amazon Kindle

6" screen, wifi


3 reviews








$485.90

Acer Aspire ES1-...

Acer Aspire ES1-572 Black, 15.6" HD, Core i5-7200U, 4GB, 128GB SSD, Linux


6 reviews







$1144.20

Dell Inspiron 15...

Dell Inspiron 15 (7567) Black, 15.6" FHD, Core i7-7700HQ, 8GB, 1TB, GeForce GTX 1050 Ti 4GB, Linux + Windows 10 Home


2 reviews






If you look closely, you'll notice how the values for the electronic goods returned from Beautiful Soup don't actually align with what you currently see when looking at the site directly via Inspect. That's a sign that we're dealing with dynamically loaded content from Javascript, as it appears that the site is updating the items shown on the homepage outside of default values set within the HTML. We don't have the tools in our web scraping kit to handle this yet, but we'll be getting there shortly! 

### Saving Scraped Data 

After identifying the tags and attributes that will allow us to match with the specific data we're interested in collecting from the website's HTML, we'd then proceed with the actual data collection within our scraper. The `pandas` library naturally intergrates with Beautiful Soup HTML parsing by structuring different tag retrievals into designated data columns. 

Let's therefore create a function that retrieves the product data we're interested in via the e-commerce site's URL and stores it within a Pandas DataFrame for us. `html_to_pandas` uses a dictionary that specifies the column names as keys and the collected data from the webpage as row values stored within lists. 

Creating each of the columns involve a different Beautiful Soup pattern matching of the HTML tag that has the data we're interested in with the tag's unique attributes. I've gone ahead and identified within the HTML itself what exactly said tag and attributes are for the title, description, price, and reviews on the site's home page. 

In [None]:
def html_to_pandas(url): 
    df_dict = {}

    html = urlopen(url)
    bs_html = BeautifulSoup(html, 'html.parser')

    df_dict['Title'] = [values.get_text() for values in bs_html.find_all('a', {'class': 'title'})]
    df_dict['Description'] = [values.get_text() for values in bs_html.find_all('p', {'class': 'description'})]
    df_dict['Price'] = [values.get_text() for values in bs_html.find_all('h4', {'class': 'pull-right price'})]
    df_dict['Reviews'] = [values.get_text() for values in bs_html.find_all('p', {'class': 'pull-right'})]

    product_df = pd.DataFrame(df_dict)
    return product_df

In [None]:
product_df = html_to_pandas('https://webscraper.io/test-sites/e-commerce/static')
product_df.head()

Unnamed: 0,Title,Description,Price,Reviews
0,Dell Latitude 55...,"Dell Latitude 5580, 15.6"" FHD, Core i5-7300U, ...",$1178.19,6 reviews
1,Asus VivoBook E5...,"Asus VivoBook E502NA-GO022T Dark Blue, 15.6"" H...",$399.99,3 reviews
2,MSI GL62VR 7RFX,"MSI GL62VR 7RFX, 15.6"" FHD, Core i7-7700HQ, 8G...",$1299.00,1 reviews


We've successfully gone from a whole jumble of HTML and website text to a ordered DataFrame with the exact information we're interested in via Beautiful Soup. 

### Regular Expressions (Regexes)
While we retrieved all of the product prices through identifying headers specified as members of the 'pull-right price' class attribute as follows:

In [None]:
bs_html.find_all('h4', {'class': 'pull-right price'})

[<h4 class="pull-right price">$103.99</h4>,
 <h4 class="pull-right price">$485.90</h4>,
 <h4 class="pull-right price">$1144.20</h4>]

Let's say we were instead in a situation where the prices weren't clearly distinguished to belong to a specific class attribute. 
We can an use regular expressions (**regexes**) to identify variable text that has some degree of consistency with its formatting
such as all of the numbers that follow from the '$' symbol.

Regexes are particularly helpful for identifying data such as emails, phone numbers, dates, or specific file types such as '.pdf' or '.jpg'. Although a complex topic on their own right that's beyond the scope of this workshop, they're very commonly used within web scraping projects and therefore important to familirize ourselves with. 

I highly recommend test running any code you write that uses regexes while developing your scraper to think through edge cases in the data that a current regex version may miss. I personally use [Regex 101](https://regex101.com/) as a website to check the behavior of regexes whenever I'm using them within my own work.  

We can use Python's built-in `re` library to combine regexes with Beautiful Soup. A regex that will successfully identify all instances of prices specified in dollars is `"\$\d+(?:\.\d+)?"`. Regex syntax is convoluted, which is partially why they can be challenging to structure correctly. Let's break down what the individual components of this regex is achieving for us:
- `\$` starts the match based on the dollar sign symbol as our core anchor to find the product prices.
- `\d+` will match with one or more digits that follow the dollar sign.
- `(?:\.\d+)?` specifies an optional decimal value and any digits that follows the decimals. This structure effectively handles cases where the price is listed either with or without decimal points.

In [None]:
bs_html.find_all(string=re.compile("\$\d+(?:\.\d+)?"))

['$103.99', '$485.90', '$1144.20']

# Module 2- Dynamic Sites with Selenium 
The first module of the coding demo provided an overview regarding how web scraping programs are fundamentally a process of identifying how data is represented within site HTML and retrieving said information through libraries such as Beautiful Soup. However, because we can't engage directly with the HTML- or specifically the Javascript-backed dynamic features we'd like to interact with to change the data featured on the page- through the Beautiful Soup library, we were limited in our ability to guide our scraper through the site's content beyond what we initially obtain through a URL request to the e-commerce test site alone. 

This is a very common situation individuals interested in building web scrapers find themselves in, and is why we'll therefore bring in Selenium as our next core tool to automate our site interactions. Throughout this module we'll be exploring how to properly initiatilize a Selenium web driver browser session, how to find HTML elements within the site with Selenium's particular methods, how to interact with site content through implementing wait times and addressing site exceptions, and how to piece all of these components together to obtain a final data set of product records collected over multiple web pages. 

We'll need to `pip install selenium` directly since it doesn't come pre-installed in our Google Colab virtual Python environment, as well as download and install the [Chrome Web Driver](https://chromedriver.chromium.org/home) that we'll be using to run our own Chrome browser through Selenium into a local directory path our program will have access to.

In [None]:
!pip install -q selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver/usr/bin

[K     |████████████████████████████████| 995 kB 2.1 MB/s 
[K     |████████████████████████████████| 384 kB 58.3 MB/s 
[K     |████████████████████████████████| 140 kB 71.8 MB/s 
[K     |████████████████████████████████| 58 kB 4.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
requests 2.23.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you have urllib3 1.26.12 which is incompatible.[0m
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:6 https://developer.download.

We'll use the `sys` library to explicitly set the Chrome driver's location as well as import in a suite of methods from Selenium we'll be using to build our scraper. 

In [None]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

import re
import pandas as pd

Let's create a designated `Options` instance to configure our Chrome driver. Selenium won't actually run within Google Colab unless the browser is set to `--headless`, while the additional options are designed to optimize resource use and prevent browser crashes. 

In [None]:
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

We'll then instantiate our Selenium-driven Chrome web browser by referring to the chromedriver executable package we've downloaded into our local directory, pass our set option parameters, and then connect to the URL of the e-commerce webscraping test site. This time we'll be collecting product information specific to the touch phone sublisting of the site, as the page features the dynamic component of having multiple pages of listed products that we toggle through by clicking on page buttons. 

In [None]:
driver = webdriver.Chrome('chromedriver', options=options)
url = 'https://www.webscraper.io/test-sites/e-commerce/static/phones/touch?page=1'
driver.get(url)

### Finding Elements 

Selenium offers a range of methods to locate elements within the website's HTML. Beautiful Soup's strategy to identify specific data was through the `find` and `find_all` methods, while Selenium's equivalent is based upon the combination of the `find_elements` method of our web driver with a designated parameter set via the `By` locater class. 

For example, we can retrieve all of the elements on the web page with a specified class value of `caption` within the `div` HTML tag as follows: 

In [None]:
for vals in driver.find_elements(By.CLASS_NAME, 'caption'):
    print(vals.text)

$24.99
Nokia 123
7 day battery
$57.99
LG Optimus
3.2" screen
$93.99
Samsung Galaxy
5 mpx. Android 5.0
$109.99
Nokia X
Andoid, Jolla dualboot
$118.99
Sony Xperia
GPS, waterproof
$499.99
Ubuntu Edge
Sapphire glass


There are many different attributes we can combine with `By` to find elements, such as `ID`, `NAME`, and `TAG_NAME`. [Here's a helpful guide](https://https://selenium-python.readthedocs.io/locating-elements.html) that reviews the avaliable options and example HTML.  

A location method that's particularly important to highlight is Selenium's support of **XPath** locators. The XPath language is similar to regexes which we learned about in the previous module, in that it provides a greater degree of customization regarding what you'd like to match, but XPath is designed to traverse HTML documents rather than string data types with regexes.   

XPaths allow us to be more specific regarding what we'd like to identify in the web page compared to most other locators. They're particularly helpful when you don't have an easy name or id attribute to retrieve the element you're interested in.  

The equivalent XPath syntax to retrieve the exact same information we obtained from the product captions by matching with their class name is:

In [None]:
for vals in driver.find_elements(By.XPATH, "//div[@class='caption']"):
    print(vals.text)

$24.99
Nokia 123
7 day battery
$57.99
LG Optimus
3.2" screen
$93.99
Samsung Galaxy
5 mpx. Android 5.0
$109.99
Nokia X
Andoid, Jolla dualboot
$118.99
Sony Xperia
GPS, waterproof
$499.99
Ubuntu Edge
Sapphire glass


We can get much more specific regarding exactly what we're interested in retrieving by using XPaths, such as the price of the third phone as follows: 

In [None]:
driver.find_elements(By.XPATH, "//div[@class='caption']//h4[@class='pull-right price']")[2].text

'$93.99'

We're only briefly considering the broad topic of XPath query writing similar to regexes given our limited time today, but it's an important concept to introduce ourselves to should you find yourself in the common situation of needing more complex element location strategies when building out a Selenium-powered web scraper. 

### Interacting with Elements 

The above examples were only collecting the data of the first page of the site that features 6 phones, but there's actually a second page that has an additional 3 phones listed that you can access by clicking either the 2 or next arrow button on the website. This is the exact circumstance we've turned to Selenium to handle for us to coordinate clicking said buttons via our headless Chrome browser.

We can use the `page-link` class name to identify the button selection options found within the HTML: 

In [None]:
for vals in driver.find_elements(By.CLASS_NAME, 'page-link'):
    print(vals.text)

‹
1
2
›


The most robust technique for building our scraper would be clicking on the 'Next' button rather than '2', since that'll allow us to replicate our code for pages on the website with more than 2 pages worth of listed products 

Specifically referencing the `rel="next"` relationship parameter that distinguishes the "next" button is a great use case for an XPath locator:  

In [None]:
driver.find_element(By.XPATH, "//a[@rel='next']").text

'›'

Now that we've found a way to identify the particular button we're interested in, it's an intuitive Selenium method to actually faciliate a user click through our headless browser. Quick heads up- you're intentionally about to see an error!

In [None]:
driver.find_element(By.XPATH, "//a[@rel='next']").click()

ElementClickInterceptedException: ignored

We're encountering an unexpected raised exception here, which is liable to occur
whenever an interaction with a website via Selenium doesn't achieve the behavior the code developer was likely aiming to create. Exception messages are often quite informative towards identifying the underlying issue. The `ElemenentClickInterceptedException` refers to there being an "Other element would receive the click", and then references an `acceptContainer` div class attribute that's also within the website's HTML structure. 

If we switch over to the website's Inspect tab and identify what exactly the `acceptContainer` is on the website itself, we see it's equivalent to the "We use cookies to make your Web Scraper experience better" pop-up that appears at the bottom of the site the first time you visit the page. 

Although it's easy enough for us to ignore the pop-up and click on other components of the site as manual users, the pop-up is intercepting our attempted click on the next arrow button when we try to interact with the site via Selenium. The issue is simply resolved by the following:

In [None]:
driver.find_element(By.CLASS_NAME, 'acceptContainer').click()

In [None]:
driver.find_element(By.XPATH, "//a[@rel='next']").click()

This is a great example of a common situation with building scrapers for dynamic sites, in that you often can't tell what may raise issues within your code until you're actually testing with a web driver.

What happen if we were to click the "next" button a second time? Remember that we only have two pages worth of listed phones, so we're about to see a different type of exception come up: 

In [None]:
driver.find_element(By.XPATH, "//a[@rel='next']").click()

NoSuchElementException: ignored

The raised `NoSuchElementException` implies that our program doesn't think that the next arrow button exists. If you dig into the HTML on the website itself via Inspect following clicking the next arrow button, you can see that the website updates to list the button as a member of the class attibute `"page-item disabled"` subtree within the pagination container of the site. This means that the site is recognizing that there are no more pages to click the next arrow following one click to the second page. 

Receiving this error reasserts for us that our Selenium-powered web driver session is indeed similar to us manually interacting with the site itself, as our past behaviors during our entire scraping workflow dictate how the website changes and therefore how we should adapt our code to accomadate the new site structure. 

We can therefore adjust by prompting our program to click the previous arrow button instead, which has switched from `"page-item disabled"` class status to now a live page link after we guided our driver to move from the first to second page: 

In [None]:
driver.find_element(By.XPATH, "//a[@rel='prev']").click()

This puts us back to where we started on the first page of the listed phones for sale. 

### Waits

Now that we've covered how to interact directly with the site through clicking elements as well as a few common complications within the click-through process to keep an eye out for, we're ready to move towards using our program to properly time the transitions between site interactions, website updates, and then correctly identifying newly displayed data.  

Waits, website element checks, and managing exceptions go hand-in-hand with each other. By implementing waits, we avoid errors in our scraper by allotting the time needed for the website to properly update. Selenium offers two types of waits which we'll both consider towards the goal of retrieving all of the phone names on the e-commerce site across both of the pages. 

**Implicit waits** sets a designated amount of time that the web driver will pause for before raising an exception. They're designed to broadly estimate how long a website will take to update following clicking on a page. 

The following function combines multiple of our previous steps within the module to collect the phone names from the first page of the site, click on the cookie pop-up and next arrow button, use a 5-second wait to give the site time to update following the next arrow click, and then print out all 9 phone names that we're expecting across the two pages. It also initializes a new connection with the site via the web driver each time it's called so we won't have to worry about exceptions caused by prior button clicks each time we run the function: 


In [None]:
def implicit_wait(url): 
    driver = webdriver.Chrome('chromedriver', options=options)
    driver.get(url)

    element_list = []

    for vals in driver.find_elements(By.XPATH, "//a[@class='title']"):
        element_list.append(vals.text)
    driver.find_element(By.CLASS_NAME, 'acceptContainer').click()
    driver.find_element(By.XPATH, "//a[@rel='next']").click()
    driver.implicitly_wait(5)
    for vals in driver.find_elements(By.XPATH, "//a[@class='title']"):
        element_list.append(vals.text)

    return print(*element_list, sep='\n')

implicit_wait(url)

Nokia 123
LG Optimus
Samsung Galaxy
Nokia X
Sony Xperia
Ubuntu Edge
Iphone
Iphone
Iphone


Success! While implicit waits are a solid technique and certainly important to be familiar with when using Selenium, I'd overall recommend using **explicit waits** over implicit waits when given the option between the two. Explicit waits employ both the specified pause time of implicit waits with the additional feature of designating conditions that Selenium will check whether they become satisfied after interacting with the web site. It therefore provides more information regarding what you're expecting to occur following interacting with the page beyond just placing a brief wait without any additional context. 

Explicit waits are conducted via the `WebDriverWait` class in combination with `expected_conditions` (commonly shortened to `EC`). The one-line adaption of our implicit wait code block via the `explicit_wait` function that follows below designates what the browser will need to locate before proceeding forward with collecting data from the second page. 

`EC.presence_of_element_located` specifies our web driver to pause until the titles of the phones on the second page are displayed on the website. The number you pass along with `WebDriverWait` sets the amount of time before raising an exception similar to implicit waits, but this time it's based on whether the expected condition is met within the updated website HTML. I often use `EC.element_to_be_clickable` as well to ensure that a new element is fully loaded for interaction as my expected condition parameter, while many more conditions are available [as listed here](https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions). 

In [None]:
def explicit_wait(url): 
    driver = webdriver.Chrome('chromedriver', options=options)
    driver.get(url)

    element_list = []

    for vals in driver.find_elements(By.XPATH, "//a[@class='title']"):
        element_list.append(vals.text)
    driver.find_element(By.CLASS_NAME, 'acceptContainer').click()
    driver.find_element(By.XPATH, "//a[@rel='next']").click()
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH , "//a[@class='title']")))
    for vals in driver.find_elements(By.XPATH, "//a[@class='title']"):
        element_list.append(vals.text)

    return print(*element_list, sep='\n')

explicit_wait(url)

Nokia 123
LG Optimus
Samsung Galaxy
Nokia X
Sony Xperia
Ubuntu Edge
Iphone
Iphone
Iphone


### Combining into a Pandas DataFrame

Let's finish by combining all of the core ideas we've covered to collect the prices, titles, descriptions, and number of reviews of all of the phones on the two pages of the e-commerce site. We'll know that we've succeeded if we obtain a DataFrame with 9 unique records as that's the total count of unique items listed across the two pages. 

The following `phone_dataframe` function merges all of our previously covered topics into one code block. We first initiatize our web driver, connect to the e-commerce site URL, and take care of the cookie pop-up that would otherwise conflict with our page button clicks. We additionally create a data list to collect the record values for each product. 

The `max_pages` variable is designed to serve as a stopping condition so Selenium will know when we've reached the last page and it can therefore break out of the data collection portion of the function. The complete workflow of data collection is based on the condition of `max_pages` exisiting, where for each round of data scraping we subsequently reduce the value by one. The idea behind this approach is that you can change `max_pages` to represent the number of pages you'd like to collect data from as needed. 

The text associated with the `thumbnail` class parameter in the HTML contains the exact four data points for each record that we're interested in, so we use an XPath based on identifying the `thumbnail` attribute as our element finder. You'll also note that our clicking of the next arrow button and then explicit wait is conditioned on the `max_pages` parameter not being at 1. This is the avoid the raised exception scenario we reviewed earlier regarding the next arrow being disabled by the site once we've reached the last page, which is equivalent to 1 in our `max_pages` condition. 

In [None]:
def phone_dataframe(url): 
    driver = webdriver.Chrome('chromedriver', options=options)
    driver.get(url)
    driver.find_element(By.CLASS_NAME, 'acceptContainer').click()

    data_list = []
    max_pages = 2

    while max_pages: 
        for vals in driver.find_elements(By.XPATH, "//div[@class='thumbnail']"):
            vals = re.split(r'\n', vals.text)
            data_list.append(vals)
        
        if max_pages != 1: 
            driver.find_element(By.XPATH, "//a[@rel='next']").click()
        WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH , "//a[@rel='prev']")))
        max_pages -= 1

    phone_df = pd.DataFrame(data_list)
    phone_df.columns = ['Price', 'Title', 'Description', 'Reviews']
    return phone_df

In [None]:
phone_df = phone_dataframe(url)
phone_df

Unnamed: 0,Price,Title,Description,Reviews
0,$24.99,Nokia 123,7 day battery,11 reviews
1,$57.99,LG Optimus,"3.2"" screen",11 reviews
2,$93.99,Samsung Galaxy,5 mpx. Android 5.0,3 reviews
3,$109.99,Nokia X,"Andoid, Jolla dualboot",4 reviews
4,$118.99,Sony Xperia,"GPS, waterproof",6 reviews
5,$499.99,Ubuntu Edge,Sapphire glass,2 reviews
6,$899.99,Iphone,White,10 reviews
7,$899.99,Iphone,Silver,8 reviews
8,$899.99,Iphone,Black,1 reviews


We can see that our final DataFrame does indeed feature 9 records, confirming that our complete function successfully handled clicking through multiple pages worth of product listings. 

Congratulations on completing the second module of our webscraping coding demos! We started with the very basics of connecting to a website and inspecting HTML code and advanced all the way to interacting directly with the site to retrieve dynamically generated data. Let's now return to the slides to conclude our web scraping workshop. 