---

### 🎓 **Professor**: Apostolos Filippas w/ Reina Chehayeb's help!

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Using Selenium to Parse Web Content

### 🔗 **Link**: https://www.bit.ly/WA_selenium

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## 1. Static content 🔒 vs. Dynamic content 🍃

<div style="text-align: center;">
<br>
<img src="https://www.cloudflare.com/resources/images/slt3lc6tev37/6ijRQV6QxiyG4zyidpgJmi/23088f026f5b01cd671274b9b994096f/caching-dynamic-content.svg" width="800" height="480">
<br>
</div>

> 🔒 **Static content**: files that are stored in the server and are the same every time they are delivered to users. 
> Unless the developer makes changes, the web page always remains the same. It is like a newspaper: once an issue of a newspaper is published, it features the same articles and photos all day for everyone who picks up a copy.

<div style="text-align: center;">
        <img src="https://i.imgur.com/mxfom84.png" title="source: imgur.com" width="350" height="350" />
</div>

> 🍃 **Dynamic content** is content that changes based on factors specific to the user such as time of visit, location, and device. A dynamic webpage will not look the same for everybody, and it can change as users interact with it – like if a newspaper could rewrite itself as someone is reading it. This makes webpages more personalized and more interactive.

<div style="text-align: center;">
        <img src="https://i.imgur.com/TjkGFKx.png" title="source: imgur.com" width="350" height="350" />
</div>

---
## 2. How do dynamic websites work?

There are many external services that dynamic webpages interact with. Here we cover 3 common services:


### **🌐 Server-side scripting** 

When a user requests a webpage, the server processes the script, and interacts with databases or external services. Then, it sends the dynamically generated HTML back to the user's browser.

**Used for**: detecting that you are logging in from a certain geographic location, and shows you relevant information for that location

<div style="text-align: center;">
<a href="https://imgur.com/VR0xInc"><img src="https://i.imgur.com/VR0xInc.png" title="source: imgur.com" width="400" height="300" /></a>
</div>

### **💻 Client-side scripting** 
#### This involves using **_JavaScript_** to manipulate the content and behavior of web pages directly within the user's browser (i.e., click, scroll, play, pause, and more). It allows for interactive features, such as real-time updates and dynamic animations.
***Used for**: Scrolling and clicking, form validation (submitting a form), real-time chat and messaging, image carousels

<div style="text-align: center;">
<a href="https://imgur.com/N8VXk0G"><img src="https://i.imgur.com/N8VXk0G.png" title="source: imgur.com" width="400" height="300" /></a>
</div>

### **☎️ Application Programming Interfaces (APIs)** 
#### APIs enable different systems to communicate and share data. In the dynamic websites, APIs can connect to external services or retrieve data from other sources, such as social media platforms, weather services, or payment gateways.
**Used for**: Allowing users to view and interact with external services without having to leave the website (ex. live Twitter feeds, PayPal).

<div style="text-align: center;">
<a href="https://imgur.com/HJl6RSl"><img src="https://i.imgur.com/HJl6RSl.png" title="source: imgur.com" width="375" height="300" /></a>
</div>

### Why can't we just use Beautiful Soup to scrape dynamic content?
- **Beautiful Soup**: It does only static scraping. Static scraping doesn't take JavaScript into consideration. When using Beautiful Soup to fetch web pages from the servers, it doesn't interact with the browser. 

- **Selenium**: In many cases, you need data that are hidden in components which get rendered on clicking JavaScript links. For example, for long reviews on many websites, you often need to click "read more" to view the full content. If you scraped a website using BeautifulSoup without clicking the "read more" button, you would only get part of that review.

---
## 3. Getting started with Selenium

As always, we start by importing all the useful packages. 

Note that to be able to follow along, you should have followed the steps in the "Before class" portion of Lecture 6.

We will be using chrome throughout.

In [None]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By

import time 
import pandas as pd
from bs4 import BeautifulSoup

Above:
- **time** will allow us to build "breaks" into our code to slow it down.
- **pandas** will allow us to store data that we parsed from the website (you will learn more about this next week)


Below:
- The following scripts will open an instance of the Chrome browser. The instance of Chrome that opened will indicate that "Chrome is now being controlled by automated test software".

In [None]:
# create a browser object-- this should open a chrome browser using selenium
browser = webdriver.Chrome()

In [None]:
# go to a website
browser.get("http:/www.newyorktimes.com/")
time.sleep(2)
browser.get("http://www.crunchbase.com/")

In [None]:
# close the browser
browser.quit()

# take a look at other browser methods
# browser.

That's it! Now you're ready to use this powerful tool! 

---
## 4. Navigating around a website with Selenium 

In [None]:
# initialize our browser
browser = webdriver.Chrome()

time.sleep(1)
browser.maximize_window()
time.sleep(1)

# let's go to the review page of BBQ Sailing Trip on TripAdviser
link = "https://www.tripadvisor.com/Attraction_Review-g1187810-d10127777-Reviews-BBQ_Sailing_Trip-Skopelos_Town_Skopelos_Sporades.html"
browser.get(link)

time.sleep(2)



Just like HTML is used to find static content, we can use __**XML**__ (Extensible Markup Language) to find and interact with dynamic content.

## XML
- XML is a markup language,designed to store, structure, and transport data or information.
- It focuses on representing the content of data rather than specifying how it should be displayed (unlike HTML)
- XML is widely used for data interchange and storage, providing a standardized way to format and organize data.
- XML documents use tags to enclose data elements, creating a hierarchical structure. For example:

```xml
<bookstore>
  <book>
    <title lang="en">Harry Potter and the mid-life crisis</title>
    <description>Go on a journey with Harry Potter and his friends as they navigate the challenges of middle age by buying a Porsche.</description>
    <price>29.99</price>
  </book>
</bookstore>
```

## XPath
- XPath (XML Path Language) is a query language used in Selenium to navigate and locate elements within a web page's DOM (Document Object Model).
- XPath is used to locate and interact with elements within XML or HTML documents, making it particularly useful for tasks like web scraping and automated testing. While XPath is often associated with XML because it was originally designed for XML documents, it can also be applied to HTML documents. 
- XPath allows you to traverse the hierarchy of elements within both XML and HTML, making it a valuable tool for selecting and manipulating data in these structured documents. So, while they are separate concepts, XPath is frequently used in conjunction with XML and HTML to access and work with data in these formats.
- You can find the XPATH by right click the web page, click the inspect button and select the element you want to check. It's very similar to access content with id, class and other attributes, but gives us even more freedom.

A sample XPath is shown below: 

<div style="text-align: center;">
<br>
<img src="https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png" width="800" height="280">
<br>
</div>

- `//`:  This selects all elements in the document that match the criteria that follow, regardless of their location within the document's hierarchy.
- `tagname`: This selects all elements with the specified tag name.
- `[@attribute='value']`: This selects all elements that have the specified attribute with a value equal to the specified value.

## Manipulating content using XPath 

Now let's click apply some filters using Selenium, so that we only see summer reviews.

<div style="text-align: center;">
<a href="https://imgur.com/F13dxbp"><img src="https://i.imgur.com/F13dxbp.png" title="source: imgur.com" width="350" height="200" /></a>
<div style="text-align: center;">

<div style="text-align: center;">
<a href="https://imgur.com/2QZRoBu"><img src="https://i.imgur.com/2QZRoBu.png" title="source: imgur.com" /></a>
<div style="text-align: center;">


By inspecting the elements on this web page, we can see that:

- **tagname** = button
- **attribute** = aria-label
- **value** = "Click to open the filter"

From here we can fill in those parameters into our xpath:

//button[@aria-label='Click to open the filter']

In [None]:
# use xpath to click the Filter button

#get xpath for "Filters"
xp = "//button[@aria-label='Click to open the filter']"

#click on that object
time.sleep(1)
browser.find_element(By.XPATH,xp).click()
time.sleep(2)

# finding elements other ways
# By.

In [None]:
#get xpath of months to select:
xpath_june = "//button[@aria-label='Enable filter: June']"
xpath_july = "//button[@aria-label='Enable filter: July']"
xpath_august = "//button[@aria-label='Enable filter: August']"

#select months
time.sleep(1)
browser.find_element(By.XPATH, xpath_june).click()
time.sleep(1)
browser.find_element(By.XPATH, xpath_july).click()
time.sleep(1)
browser.find_element(By.XPATH, xpath_august).click()
time.sleep(2)

In [None]:
# and then, we click the "Apply" button to see the resulting reviews

time.sleep(1)
xpath_apply = '//button[@class="rmyCe _G B- z _S c Wc wSSLS AeLHi sOtnj"]'
browser.find_element(By.XPATH, xpath_apply).click()
time.sleep(2)

## Other ways to locate elements

Alternatively, we can right click the element we want > Copy > Copy XPath

- This will give you the Absolute XPath
<div style="text-align: center;">
<a href="https://imgur.com/JkXWlcn"><img src="https://i.imgur.com/JkXWlcn.png" title="source: imgur.com" width = "500" height="400" /></a>
<div style="text-align: center;">

For the filter button, it looks like this:
- `/*[@id="tab-data-qa-reviews-0"]/div/div[1]/div/div/div[2]/div/div/div[1]/div/button`


**Why don't we do this instead?**

- It makes your code longer
- Your code will be more likely to break if anything changes on the web page (ex. a review gets deleted)

## Selecting multiple elements by xPath

There are multiple buttons on the page. Let's say we want to click on the button that scrolls down to the reviews.

Here is the XPath for some clickable buttons on the page:

//button[@type='button']

In [None]:
# finding multiple buttons with the same parameter
# we use .find_elements() with an s to get all elements on a page with a given XPath

xpath_buttons = "//button[@type='button']"

time.sleep(2)
elements = browser.find_elements(By.XPATH, xpath_buttons)

print(type(elements))
print(len(elements))
elements[:10]

**How do I know which element is which??**

**We can extract the innerHTML or property or text information by .get_attribute('innerHTML'), .get_property() or .text methods** 

In [None]:
# we can extract the inner HTML as such:
for element in elements:
    ind = elements.index(element)
    html = element.get_attribute('innerHTML')
    print(f'{ind}: \t {html}')
    

Is there **another** way of identifying the button?

In [None]:
for element in elements:
    element_text = element.text
    search_term = 'read more'
    if search_term in element_text.lower():
        print(f"{elements.index(element)}: \t {element.get_attribute('innerHTML')}")

---
## 5. Using the CSS selector to find elements

We can use CSS Selectors instead of XPath to find and interact with elements

##### **What is a CSS Selector?**

A CSS selector is like a set of instructions that tells a web browser how to find and style elements on a web page. Remember that elements have different attributes; we can use CSS to look for elements with specific attribute values.


##### **Examples of CSS Selectors:**
1. Element Selector

    < button > or < a > will point to elements with those respective types

2. Class Selector --> .
    
    '.expand' points to all elements with class='expand'

3. ID Selector--> #
    
     #submit-button points to the element with id="submit-button"

4. Attribute Selector --> [type=" "]

    eg. [ type="text" ] points to all elements with type="text"

In [None]:
# locating by CSS selector
example_button = "#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(1) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"
# inspect -> "Copy Selector"

time.sleep(2)
browser.find_element(By.CSS_SELECTOR,example_button).click()
time.sleep(2)

**How did this work?** 

- `#tab-data-qa-reviews-0`: find the element of which the attribute id = tab-data-qa-reviews-0;

- `'> div'` find element in the child nodes of last element, with tag 'div';

- `'> div.LbPSX'` target an element with the class = LbPSX 

- `'> div:nth-child(1)>'` looks for nth (in this case, 1st) child of its siblings

- `'> div'` again, going further into child nodes of last element

- `'div._T.FKffI.bmUTE`: target element with class = "_T", class = "FKffI", AND class = "bmUTE"

- `'button'` finally, target the button element

There are multiple ways to locate elements using selenium. For the full list, you can refer to: https://selenium-python.readthedocs.io/locating-elements.html

Because the webpage is dynamically generated, we often need to click the **read more** button to reveal content. 
- For example, some long reviews are only partially available on TripAdvisor-- you would need to click on "Read More" to access the full review. Other short reviews are fully available.

## Trying to click a button that doesn't exist

In [None]:
# we have already expanded this review. Since we interacted with a dynamic page, the button is no longer on it now.
# what will happen if we try to click the button that doesn't exist?

browser.find_element(By.CSS_SELECTOR,example_button).click()

In [None]:
# a helper function to check if the button exists on the page
def check_exists_by_css(selector):
    try:
        browser.find_element(By.CSS_SELECTOR,selector)
    except NoSuchElementException:
        return False
    return True

def check_exists_by_xpath(xp):
    try:
        browser.find_element(By.XPATH,xp)
    except NoSuchElementException:
        return False
    return True

In [None]:
# so if we try to find a nonsense element, it will just return False instead of breaking the code

check_exists_by_css('hahaha')

In [None]:
# check if the read more button exists
# note: the xpath can be a different one if you search for other places

readmore_css = "#tab-data-qa-reviews-0 > div > div.LbPSX > div > div:nth-child(4) > div > div > div._T.FKffI.bmUTE > div.lszDU > button"

read_more_exists = check_exists_by_css(readmore_css)

# if the reviews are expanable, expand the reviews
if read_more_exists:
    browser.find_element(By.CSS_SELECTOR,readmore_css).click()
    time.sleep(5)
else:
    print('No element found!')

In [None]:
# let's make the previous code a helper function too

def click_read_more():
    
    # set the xpath to the buttons and the search term "read more"
    readmore_xpath = "//button[@class='UikNM _G B- _S _T c G_ y wSSLS wnNQG']"
    search_term = 'read more'
    # find all buttons on the page and save to a list
    buttons = browser.find_elements(By.XPATH, readmore_xpath)

    # for each button, check if the text is "read more"
    for button in buttons:
        if search_term in button.text.lower():
            try:
                button.click() # if so, click it
            except: # otherwise do nothing
                continue
        time.sleep(1)

In [None]:
click_read_more()

In [None]:
# close the browser
browser.quit()

---
## 6. Combining Selenium and BeautifulSoup

Whereas we cna use Selenium alone to scrape content, we prefer to use it together with beautifulsoup to make our lives easier. 

The complete workflow would be like:
1. Using Selenium to automate web browser interaction so that hidden content can be made available by automating actions such as the button clicks, screen scrolling and so on
2. After all the content we want to parse is revealed, we'll use BeautifulSoup to parse it like we did in Lecture 5

---
# 7. Challenge

1. Use Selenium to click all the "Read More" buttons 
2. Use BeautifulSoup to parse the content of the reviews
3. Locate and click the "next page" button, if it exists and if you're scraping more pages

In [None]:
# set the number of pages you want to parse, and the link you want to parse from
page_num = 1
link = "https://www.tripadvisor.com/Attraction_Review-g1187810-d10127777-Reviews-BBQ_Sailing_Trip-Skopelos_Town_Skopelos_Sporades.html"

# create some empty lists to store the reviews, ratings, and authors as you go
reviews = []
ratings = []
authors = []

# Open browser
browser = webdriver.Chrome()
# Go to the link
browser.get(link)
time.sleep(2)

# loop through the pages you want to scrape
for i in range(0, page_num):
    
    # expand the reviews on that page
    click_read_more() 
    time.sleep(1)

    
    # parse the page to a soup
    page_source     = browser.page_source
    soup            = BeautifulSoup(page_source, 'html.parser')
    reviews_content = soup.find_all('div', class_="_c")
    
    # extract the author, review_text and rating for each review on the page
    # You need to fill in the elements for the text, author, and rating (hint: use inspect)
    for review in reviews_content:
        review_text = None # <<< Get the text content of the review >>>
        author      = None # <<< Get the author name >>>
        rating      = None # <<< how many stars out of 5? >>>
        
        # append to our accumulative lists
        reviews.append(review_text)
        ratings.append(rating)
        authors.append(author)
        
    # use selenium to go to the next page
    next_page = None 
    # << Find the xpath or CSS selector for the next page button >>
    # << check if the next page button exists for the page you are currently on. If it does, click it. If not, exit the loop.>>
    # << FILL IN YOUR CODE HERE! >>


---
# 8. Dataframes

A nicer way to deal with data such as the ones we're scraping is by putting them into a dataframe. 
You'll learn the basics of data frames in the "before class" module of the next lecture!

In [None]:
reviews

In the last step, we will store the results into a dataframe.

In [None]:
data = {
    'reviews': reviews,
    'ratings': ratings,
    'authors': authors
}

df = pd.DataFrame(data)

# check the shape of our dataframe
df.shape

In [None]:
df

---
# 9. Extra Practice

If you'd like to get some more practice in using Selenium, you can try the following exercise:
1. Open a browser with Selenium, go to the review page of [Wall Street](https://www.tripadvisor.com/Attraction_Review-g60763-d136051-Reviews-Wall_Street-New_York_City_New_York.html) on TripAdvisor (or any other attraction place of your choice)
2. Select only the **Business trip** in the filter.
3. Scrape the full reviews from the first 5 pages. Can you also include the date of the review?
4. Can you see any relationship between the review months and rating scores?