---

### 🎓 **Professor**: Apostolos Filippas w/ Reina Chehayeb's help!

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Using Selenium to Parse Web Content

### 🔗 **Link**: https://bit.ly/AF_WA_LEC6

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## 1. Static content 🔒 vs. Dynamic content 🍃

<br>

<img src="https://www.cloudflare.com/resources/images/slt3lc6tev37/6ijRQV6QxiyG4zyidpgJmi/23088f026f5b01cd671274b9b994096f/caching-dynamic-content.svg" width="800" height="480">
<br>

> 🔒 **Static content** is any file that is stored in a server and is the same every time it is delivered to users. Unless the developer makes changes themselves, the web page always remains the same. It is like a newspaper: once an issue of a newspaper is published, it features the same articles and photos all day for everyone who picks up a copy.

<div style="text-align: center;">
        <img src="https://i.imgur.com/mxfom84.png" title="source: imgur.com" width="350" height="350" />
</div>

> 🍃 **Dynamic content** is content that changes based on factors specific to the user such as time of visit, location, and device. A dynamic webpage will not look the same for everybody, and it can change as users interact with it – like if a newspaper could rewrite itself as someone is reading it. This makes webpages more personalized and more interactive.

<div style="text-align: center;">
        <img src="https://i.imgur.com/TjkGFKx.png" title="source: imgur.com" width="350" height="350" />
</div>

---
## 2. How do dynamic websites work?

There are many external services that dynamic webpages interact with. Here we cover 3 common services:


### **🌐 Server-side scripting** 

When a user requests a webpage, the server processes the script, and interacts with databases or external services. Then, it sends the dynamically generated HTML back to the user's browser.

**Used for**: detecting that you are logging in from a certain geographic location, and shows you relevant information for that location

<a href="https://imgur.com/VR0xInc"><img src="https://i.imgur.com/VR0xInc.png" title="source: imgur.com" width="400" height="300" /></a>

### **💻 Client-side scripting** 
#### This involves using **_JavaScript_** to manipulate the content and behavior of web pages directly within the user's browser (i.e., click, scroll, play, pause, and more). It allows for interactive features, such as real-time updates and dynamic animations.
***Used for**: Scrolling and clicking, form validation (submitting a form), real-time chat and messaging, image carousels

<a href="https://imgur.com/N8VXk0G"><img src="https://i.imgur.com/N8VXk0G.png" title="source: imgur.com" width="400" height="300" /></a>

### **☎️ Application Programming Interfaces (APIs)** 
#### APIs enable different systems to communicate and share data. In the dynamic websites, APIs can connect to external services or retrieve data from other sources, such as social media platforms, weather services, or payment gateways.
**Used for**: Allowing users to view and interact with external services without having to leave the website (ex. live Twitter feeds, PayPal).

<a href="https://imgur.com/HJl6RSl"><img src="https://i.imgur.com/HJl6RSl.png" title="source: imgur.com" width="375" height="300" /></a>

### Why Beautiful Soup alone is not enough to scrape web content?
- **Beautiful Soup**: It does only static scraping. Static scraping doesn't take JavaScript into consideration. When using Beautiful Soup to fetch web pages from the servers, it doesn't interact with the browser. 

- **Selenium**: In many cases, you need data that are hidden in components which get rendered on clicking JavaScript links. For example, for long reviews on many websites, you often need to click "read more" to view the full content. If you scraped a website using BeautifulSoup without clicking the "read more" button, you would only get part of that review.

---
## 3. Getting started with Selenium

As always, we start by importing all the useful packages. 

Note that to be able to follow along, you should have followed the steps in the "Before class" portion of Lecture 6.

We will be using chrome throughout.

In [1]:
from selenium import webdriver
#Service is an object that manages the starting and stopping of the ChromeDriver
from selenium.webdriver.chrome.service import Service
#Some packages have their own types of errors -- we will use the No Such Element in our code later
from selenium.common.exceptions import NoSuchElementException
#The By class is used to locate elements within a document
from selenium.webdriver.common.by import By

# and some other packages -- time, pandas, and bs4
import time 
import pandas as pd
from bs4 import BeautifulSoup

Above:
- **time** will allow us to build "breaks" into our code to slow it down.
- **pandas** will allow us to store data that we parsed from the website (you will learn more about this next week)


Below:
- The following scripts will open an instance of the Chrome browser. The instance of Chrome that opened will indicate that "Chrome is now being controlled by automated test software".

In [7]:
# create a browser object-- this should open a chrome browser using selenium
browser = webdriver.Chrome()

In [8]:
# go to a website
browser.get("http:/www.newyorktimes.com/")
time.sleep(2)
browser.get("http://www.crunchbase.com/")

In [9]:
# close the browser
browser.quit()

# take a look at other browser methods
# browser.

That's it! Now you're ready to use this powerful tool! 

---
## 4. Navigating around a website with Selenium 

In [10]:
# initialize our browser
browser = webdriver.Chrome()

time.sleep(1)
browser.maximize_window()
time.sleep(1)

# let's go to the review page of BBQ Sailing Trip on TripAdviser
link = "https://www.tripadvisor.com/Attraction_Review-g1187810-d10127777-Reviews-BBQ_Sailing_Trip-Skopelos_Town_Skopelos_Sporades.html"
browser.get(link)

time.sleep(2)



Just like HTML is used to find static content, we can use __**XML**__ (Extensible Markup Language) to find and interact with dynamic content.

**What is XML?** XML(Extensive markup language) is a kind of markup language, it is designed to store, transmit data or information, the focus of it is the content of the data. It is a powerful way to store data in a format that can be stored, searched, and shared.


**XPath in Selenium** is an XML path used for navigation through the web page. It is a syntax for finding any element on a web page using XML path expression. You can find the XPATH by right click the web page, click the inspect button and select the element you want to check. It's very similar to access content with id, class and other attributes, but gives us even more freedom.

The format of the XPath is shown below: 



<br>
<img src="https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png" width="800" height="280">
<br>

#### Let's access some content using XPath. How can we click the filters box with Selenium, so that we can see only summer reviews?

"<a href="https://imgur.com/F13dxbp"><img src="https://i.imgur.com/F13dxbp.png" title="source: imgur.com" width="350" height="200" /></a>"

<a href="https://imgur.com/2QZRoBu"><img src="https://i.imgur.com/2QZRoBu.png" title="source: imgur.com" /></a>

#### Now we can build the XPath.

By inspecting the elements on this web page, we found that:

- **tagname** = button
- **attribute** = aria-label
- **value** = "Click to open the filter"

**

From here we can fill in those parameters into our xpath:

//button[@aria-label='Click to open the filter']

In [11]:
# use xpath to click the Filter button

#get xpath for "Filters"
xp = "//button[@aria-label='Click to open the filter']"

#click on that object
time.sleep(1)
browser.find_element(By.XPATH,xp).click()
time.sleep(2)

# finding elements other ways
# By.

In [12]:
#get xpath of months to select:
xpath_june = "//button[@aria-label='Enable filter: June']"
xpath_july = "//button[@aria-label='Enable filter: July']"
xpath_august = "//button[@aria-label='Enable filter: August']"

#select months
time.sleep(1)
browser.find_element(By.XPATH, xpath_june).click()
time.sleep(1)
browser.find_element(By.XPATH, xpath_july).click()
time.sleep(1)
browser.find_element(By.XPATH, xpath_august).click()
time.sleep(2)

In [13]:
# and then, we click the "Apply" button to see the resulting reviews

time.sleep(1)
xpath_apply = '//button[@class="rmyCe _G B- z _S c Wc wSSLS AeLHi sOtnj"]'
browser.find_element(By.XPATH, xpath_apply).click()
time.sleep(2)

## Other ways to identify the button

**Alternatively, we can right click the element we want > Copy > Copy XPath**

This will give you the Absolute XPath

<a href="https://imgur.com/JkXWlcn"><img src="https://i.imgur.com/JkXWlcn.png" title="source: imgur.com" width = "500" height="400" /></a>

For the filter button, it looks like this:

/*[@id="tab-data-qa-reviews-0"]/div/div[1]/div/div/div[2]/div/div/div[1]/div/button


**Why don't we do this instead?**

- It makes your code longer
- Your code will be more likely to break if anything changes on the web page (ex. a review gets deleted)

### What if there are multiple elements with that XPath? How do we find them all?

There are multiple buttons on the page. Let's say we want to click on the button that scrolls down to the reviews.

Here is the XPath for some clickable buttons on the page:

//button[@type='button']

In [14]:
# solving multiple buttons with the same parameter
# we use .find_elements() with an s to get all elements on a page with a given XPath

xpath_buttons = "//button[@type='button']"

time.sleep(2)
elements = browser.find_elements(By.XPATH, xpath_buttons)

print(type(elements))
print(len(elements))
elements[:10]

<class 'list'>
176


[<selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e21cc89da", element="f.9C4EB05520232CCF79CD99EA1AE20EFC.d.3D1FE9BCC36F534A256014B3BAEEACFA.e.76")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e21cc89da", element="f.9C4EB05520232CCF79CD99EA1AE20EFC.d.3D1FE9BCC36F534A256014B3BAEEACFA.e.77")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e21cc89da", element="f.9C4EB05520232CCF79CD99EA1AE20EFC.d.3D1FE9BCC36F534A256014B3BAEEACFA.e.78")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e21cc89da", element="f.9C4EB05520232CCF79CD99EA1AE20EFC.d.3D1FE9BCC36F534A256014B3BAEEACFA.e.79")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e21cc89da", element="f.9C4EB05520232CCF79CD99EA1AE20EFC.d.3D1FE9BCC36F534A256014B3BAEEACFA.e.80")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1bf449224ed55b14f8d1775e2

**How do I know which element is which??**

**We can extract the innerHTML or property or text information by .get_attribute('innerHTML'), .get_property() or .text methods** 

In [15]:
# we can extract the inner HTML as such:
for element in elements:
    ind = elements.index(element)
    html = element.get_attribute('innerHTML')
    print(f'{ind}. {html}')
    

0. <span class="biGQs _P XWJSj Wb">Skip to main content</span>
1. <span class="biGQs _P ttuOS">Discover</span>
2. <span class="biGQs _P ttuOS">Trips</span>
3. <span class="biGQs _P ttuOS">Review</span>
4. <span class="biGQs _P ttuOS"><span class="q"><svg viewBox="0 0 24 24" width="20px" height="20px" class="d Vb UmNoP"><path fill-rule="evenodd" clip-rule="evenodd" d="M9.31 9.82h4.178c-.069-1.591-.356-2.993-.766-4.017-.237-.593-.501-1.023-.756-1.293-.211-.223-.38-.3-.5-.32h-.133c-.12.02-.289.097-.5.32-.255.27-.519.7-.756 1.293-.41 1.024-.697 2.426-.767 4.017m-.374-5.14q-.135.272-.252.566C8.194 6.472 7.88 8.07 7.81 9.82H5.055a6.39 6.39 0 0 1 3.88-5.14m2.301-1.989a7.883 7.883 0 0 0-7.726 7.88 7.88 7.88 0 0 0 7.884 7.885c.584 0 .871-.014 1.11-.074.124-.031.172-.049.213-.064.058-.02.099-.036.312-.073l-.26-1.477a4 4 0 0 0-.628.159c-.031.007-.132.029-.743.029-.121 0-.313-.06-.566-.327-.255-.27-.519-.699-.756-1.292-.41-1.025-.697-2.426-.767-4.017h4.203a4.7 4.7 0 0 1-.113.843 6 6 0 0 1-.112.413

Is there **another** way of identifying the button?

In [16]:
for element in elements:
    element_text = element.text
    search_term = 'read more'
    if search_term in element_text.lower():
        print(elements.index(element))

28
72
80
90
108


---
## 5. Using the CSS selector to find elements

We can use CSS Selectors instead of XPath to find and interact with elements

##### **What is a CSS Selector?**

A CSS selector is like a set of instructions that tells a web browser how to find and style elements on a web page. Remember that elements have different attributes; we can use CSS to look for elements with specific attribute values.


##### **Examples of CSS Selectors:**
1. Element Selector

    < button > or < a > will point to elements with those respective types

2. Class Selector --> .
    
    '.expand' points to all elements with class='expand'

3. ID Selector--> #
    
     #submit-button points to the element with id="submit-button"

4. Attribute Selector --> [type=" "]

    eg. [ type="text" ] points to all elements with type="text"

In [17]:
#locating by CSS selector

example_button = "#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(1) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"

time.sleep(2)
browser.find_element(By.CSS_SELECTOR,example_button).click()
time.sleep(2)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(1) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"}
  (Session info: chrome=130.0.6723.58); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x000000010155b9d4 cxxbridge1$str$ptr + 3647524
1   chromedriver                        0x0000000101554234 cxxbridge1$str$ptr + 3616900
2   chromedriver                        0x0000000100fc010c cxxbridge1$string$len + 88416
3   chromedriver                        0x0000000101002338 cxxbridge1$string$len + 359308
4   chromedriver                        0x000000010103bb10 cxxbridge1$string$len + 594788
5   chromedriver                        0x0000000100ff6f34 cxxbridge1$string$len + 313224
6   chromedriver                        0x0000000100ff7ba4 cxxbridge1$string$len + 316408
7   chromedriver                        0x000000010152661c cxxbridge1$str$ptr + 3429484
8   chromedriver                        0x0000000101529958 cxxbridge1$str$ptr + 3442600
9   chromedriver                        0x000000010150d344 cxxbridge1$str$ptr + 3326356
10  chromedriver                        0x000000010152a21c cxxbridge1$str$ptr + 3444844
11  chromedriver                        0x00000001014fe5cc cxxbridge1$str$ptr + 3265564
12  chromedriver                        0x0000000101544c98 cxxbridge1$str$ptr + 3554024
13  chromedriver                        0x0000000101544e14 cxxbridge1$str$ptr + 3554404
14  chromedriver                        0x0000000101553ecc cxxbridge1$str$ptr + 3616028
15  libsystem_pthread.dylib             0x0000000198f2df94 _pthread_start + 136
16  libsystem_pthread.dylib             0x0000000198f28d34 thread_start + 8


**How did this work?** 

**'#tab-data-qa-reviews-0':** find the element of which the attribute id = tab-data-qa-reviews-0;

**'> div':** find element in the child nodes of last element, with tag 'div';

**'> div.LbPSX':** target an element with the class = LbPSX 

**'> div:nth-child(1)>'**: looks for nth (in this case, 1st) child of its siblings

**'> div'**: again, going further into child nodes of last element

**'div._T.FKffI.bmUTE:** target element with class = "_T", class = "FKffI", AND class = "bmUTE"

**'button':** finally, target the button element

There are multiple ways to locate elements using selenium. For the full list, you can refer to: https://selenium-python.readthedocs.io/locating-elements.html

Because the webpage is dynamically generated, we often need to click the **read more** button to reveal content. 
- For example, some long reviews are only partially available on TripAdvisor-- you would need to click on "Read More" to access the full review. Other short reviews are fully available.

## Trying to click a button that doesn't exist

In [18]:
# we have already expanded this review. Since we interacted with a dynamic page, the button is no longer on it now.
# what will happen if we try to click the button that doesn't exist?

browser.find_element(By.CSS_SELECTOR,example_button).click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(1) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"}
  (Session info: chrome=130.0.6723.58); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x000000010155b9d4 cxxbridge1$str$ptr + 3647524
1   chromedriver                        0x0000000101554234 cxxbridge1$str$ptr + 3616900
2   chromedriver                        0x0000000100fc010c cxxbridge1$string$len + 88416
3   chromedriver                        0x0000000101002338 cxxbridge1$string$len + 359308
4   chromedriver                        0x000000010103bb10 cxxbridge1$string$len + 594788
5   chromedriver                        0x0000000100ff6f34 cxxbridge1$string$len + 313224
6   chromedriver                        0x0000000100ff7ba4 cxxbridge1$string$len + 316408
7   chromedriver                        0x000000010152661c cxxbridge1$str$ptr + 3429484
8   chromedriver                        0x0000000101529958 cxxbridge1$str$ptr + 3442600
9   chromedriver                        0x000000010150d344 cxxbridge1$str$ptr + 3326356
10  chromedriver                        0x000000010152a21c cxxbridge1$str$ptr + 3444844
11  chromedriver                        0x00000001014fe5cc cxxbridge1$str$ptr + 3265564
12  chromedriver                        0x0000000101544c98 cxxbridge1$str$ptr + 3554024
13  chromedriver                        0x0000000101544e14 cxxbridge1$str$ptr + 3554404
14  chromedriver                        0x0000000101553ecc cxxbridge1$str$ptr + 3616028
15  libsystem_pthread.dylib             0x0000000198f2df94 _pthread_start + 136
16  libsystem_pthread.dylib             0x0000000198f28d34 thread_start + 8


In [19]:
# a helper function to check if the button exists on the page
def check_exists_by_css(selector):
    try:
        browser.find_element(By.CSS_SELECTOR,selector)
    except NoSuchElementException:
        return False
    return True

def check_exists_by_xpath(xp):
    try:
        browser.find_element(By.XPATH,xp)
    except NoSuchElementException:
        return False
    return True

In [20]:
# so if we try to find a nonsense element, it will just return False instead of breaking the code

check_exists_by_css('hahaha')

False

In [21]:
# check if the read more button exists
# note the xpath can be a different one if you search for other places

readmore_css = "#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(4) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"

read_more_exists = check_exists_by_css(readmore_css)

# if the reviews are expanable, expand the reviews
if read_more_exists:
    browser.find_element(By.CSS_SELECTOR,readmore_css).click()
    time.sleep(5)
else:
    print('No element found!')

No element found!


In [24]:
# let's make the previous code a helper function too

def click_read_more():
    
    # set the xpath to the buttons and the search term "read more"
    readmore_xpath = "//button[@class='UikNM _G B- _S _W _T c G_ wSSLS wnNQG']"
    search_term = 'read more'
    # find all buttons on the page and save to a list
    buttons = browser.find_elements(By.XPATH, readmore_xpath)

    # for each button, check if the text is "read more"
    for button in buttons:
        if search_term in button.text.lower():
            try:
                button.click() # if so, click it
            except: # otherwise do nothing
                continue
        time.sleep(1)

In [25]:
click_read_more()

KeyboardInterrupt: 

In [26]:
# close the browser
browser.quit()

---
## 6. Combining Selenium and BeautifulSoup

Whereas we cna use Selenium alone to scrape content, we prefer to use it together with beautifulsoup to make our lives easier. 

The complete workflow would be like:
1. Using Selenium to automate web browser interaction so that hidden content can be made available by automating actions such as the button clicks, screen scrolling and so on
2. After all the content we want to parse is revealed, we'll use BeautifulSoup to parse it like we did in Lecture 5

---
# 7. Challenge

1. Use Selenium to click the "Read More" buttons 
2. Use BeautifulSoup to parse the content of the reviews

In [28]:
# set the number of pages you want to parse, and the link you want to parse from
page_num = 5
link = "https://www.tripadvisor.com/Attraction_Review-g1187810-d10127777-Reviews-BBQ_Sailing_Trip-Skopelos_Town_Skopelos_Sporades.html"

reviews = []
ratings = []
authors = []

# Open browser
browser = webdriver.Chrome()
# # Go to the link
browser.get(link)
time.sleep(2)


for i in range(0, page_num):
    
    # expand the reviews
    click_read_more() 
    time.sleep(1)

    
    # parse to a soup
    page_source     = browser.page_source
    soup            = BeautifulSoup(page_source, 'html.parser')
    reviews_content = soup.find_all('div', class_="_c")
    
    # extract the author, review_text and rating 
    for review in reviews_content:
        author      = review.find('a', class_='BMQDV _F Gv wSSLS SwZTJ FGwzt ukgoS').text
        review_text = review.find('div', class_='biGQs _P pZUbB KxBGd').text
        rating      = review.find('svg', class_='UctUV d H0').find('title').text[0:3]
        
        # append to our accumulative lists
        reviews.append(review_text)
        ratings.append(rating)
        authors.append(author)
        
    # use selenium to go to the next page
    if (check_exists_by_xpath('//a[@aria-label="Next page"]')):
        browser.find_element(By.XPATH,'//a[@aria-label="Next page"]').click()
        time.sleep(3)
    else:
        break

browser.quit()


---
# 8. Dataframes

A nicer way to deal with data such as the ones we're scraping is by putting them into a dataframe. 
You'll learn the basics of data frames in the "before class" module of the next lecture!

In [29]:
reviews

[]

In the last step, we will store the results into a dataframe.

In [None]:
data = {
    'reviews': reviews,
    'ratings': ratings,
    'authors': authors
}

df = pd.DataFrame(data)

# check the shape of our dataframe
df.shape

In [None]:
df

---
# 9. Extra Practice

If you'd like to get some more practice in using Selenium, you can try the following exercise:
1. Open a browser with Selenium, go to the review page of [Wall Street](https://www.tripadvisor.com/Attraction_Review-g60763-d136051-Reviews-Wall_Street-New_York_City_New_York.html) on TripAdvisor (or any other attraction place of your choice)
2. Select only the **Business trip** in the filter.
3. Scrape the full reviews from the first 5 pages. Can you also include the date of the review?
4. Can you see any relationship between the review months and rating scores?