---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Using Selenium to Parse Web Content

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## 1. Static content 🔒 vs. Dynamic content 🍃

<div style="text-align: center;">
<br>
<img src="https://raw.githubusercontent.com/apostolosfilippas/wa/e60f312c3435e31da72af81baed7194daf98cf11/assets/selenium-1.svg" width="800" height="480">
<br>
</div>

> 🔒 **Static content** is any file that is stored in a server and is the same every time it is delivered to users. Unless the developer makes changes themselves, the web page always remains the same. It is like a newspaper: once an issue of a newspaper is published, it features the same articles and photos all day for everyone who picks up a copy.

<div style="text-align: center;">
        <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-2.png?raw=true" title="source: imgur.com" width="350" height="350" />
</div>

> 🍃 **Dynamic content** is content that changes based on factors specific to the user such as time of visit, location, and device. A dynamic webpage will not look the same for everybody, and it can change as users interact with it – like if a newspaper could rewrite itself as someone is reading it. This makes webpages more personalized and more interactive.

<div style="text-align: center;">
        <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-3.png?raw=true" title="source: imgur.com" width="350" height="350" />
</div>

---
## 2. How do dynamic websites work?

There are many external services that dynamic webpages interact with. Here we cover 3 common services:


### **🌐 Server-side scripting** 

When a user requests a webpage, the server processes the script, and interacts with databases or external services. Then, it sends the dynamically generated HTML back to the user's browser.

**Used for**: detecting that you are logging in from a certain geographic location, and shows you relevant information for that location

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-4.png?raw=true" title="source: imgur.com" width="400" height="300" />
</div>

### **💻 Client-side scripting** 
#### This involves using **_JavaScript_** to manipulate the content and behavior of web pages directly within the user's browser (i.e., click, scroll, play, pause, and more). It allows for interactive features, such as real-time updates and dynamic animations.
***Used for**: Scrolling and clicking, form validation (submitting a form), real-time chat and messaging, image carousels

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-5.png?raw=true" title="source: imgur.com" width="400" height="300" />
</div>

### **☎️ Application Programming Interfaces (APIs)** 
#### APIs enable different systems to communicate and share data. In the dynamic websites, APIs can connect to external services or retrieve data from other sources, such as social media platforms, weather services, or payment gateways.
**Used for**: Allowing users to view and interact with external services without having to leave the website (ex. live Twitter feeds, PayPal).

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-6.png?raw=true" title="source: imgur.com" width="375" height="300" />
</div>

### Why Beautiful Soup alone is not enough to scrape web content?
- **Beautiful Soup**: It does only static scraping. Static scraping doesn't take JavaScript into consideration. When using Beautiful Soup to fetch web pages from the servers, it doesn't interact with the browser. 

- **Selenium**: In many cases, you need data that are hidden in components which get rendered on clicking JavaScript links. For example, for long reviews on many websites, you often need to click "read more" to view the full content. If you scraped a website using BeautifulSoup without clicking the "read more" button, you would only get part of that review.

---
## 3. Getting started with Selenium

As always, we start by importing all the useful packages. 

Note that to be able to follow along, you should have followed the steps in the "Before class" portion of Lecture 6.

We will be using chrome throughout.

In [None]:
# !pip install selenium webdriver-manager beautifulsoup4 pandas

In [None]:
from selenium import webdriver # type: ignore
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager


import time 
import pandas as pd
from bs4 import BeautifulSoup

Above:
- **time** will allow us to build "breaks" into our code to slow it down.
- **pandas** will allow us to store data that we parsed from the website (you will learn more about this next week)


Below:
- The following scripts will open an instance of the Chrome browser. The instance of Chrome that opened will indicate that "Chrome is now being controlled by automated test software".

In [None]:
# create a browser object-- this should open a chrome browser using selenium
browser = webdriver.Chrome()

In [None]:
# go to a website
browser.get("http://www.newyorktimes.com/")
time.sleep(2)
browser.get("https://www.openai.com/")


In [None]:
# close the browser
browser.quit()

# take a look at other browser methods
# browser.

That's it! Now you're ready to use this powerful tool! 

---
## 4. Navigating around a website with Selenium 

In [None]:
#  initialize our browser
browser = webdriver.Chrome()

time.sleep(1)
browser.maximize_window()
time.sleep(2)

# let's go to the review page of Brooklyn Hotel on Booking.com
link = "https://www.booking.com/hotel/us/bklyn-house-new-york-brooklyn.html"

browser.get(link)

time.sleep(2)



Just like HTML is used to find static content, we can use __**XML**__ (Extensible Markup Language) to find and interact with dynamic content.

## XML
- XML is a markup language,designed to store, structure, and transport data or information.
- It focuses on representing the content of data rather than specifying how it should be displayed (unlike HTML)
- XML is widely used for data interchange and storage, providing a standardized way to format and organize data.
- XML documents use tags to enclose data elements, creating a hierarchical structure. For example:

```xml
<bookstore>
  <book>
    <title lang="en">Harry Potter and the mid-life crisis</title>
    <description>Go on a journey with Harry Potter and his friends as they navigate the challenges of middle age by buying a Porsche.</description>
    <price>29.99</price>
  </book>
</bookstore>
```

## XPath
- XPath (XML Path Language) is a query language used in Selenium to navigate and locate elements within a web page's DOM (Document Object Model).
- XPath is used to locate and interact with elements within XML or HTML documents, making it particularly useful for tasks like web scraping and automated testing. While XPath is often associated with XML because it was originally designed for XML documents, it can also be applied to HTML documents. 
- XPath allows you to traverse the hierarchy of elements within both XML and HTML, making it a valuable tool for selecting and manipulating data in these structured documents. So, while they are separate concepts, XPath is frequently used in conjunction with XML and HTML to access and work with data in these formats.
- You can find the XPATH by right click the web page, click the inspect button and select the element you want to check. It's very similar to access content with id, class and other attributes, but gives us even more freedom.

A sample XPath is shown below: 



<br>
<img src="https://raw.githubusercontent.com/apostolosfilippas/wa/refs/heads/main/assets/selenium-7.webp" width="800" height="280">
<br>

- `//`:  This selects all elements in the document that match the criteria that follow, regardless of their location within the document's hierarchy.
- `tagname`: This selects all elements with the specified tag name.
- `[@attribute='value']`: This selects all elements that have the specified attribute with a value equal to the specified value.

## Manipulating content using XPath 

Now let's click apply some filters using Selenium, so that we only see summer reviews.

"<a href="https://imgur.com/a/3Aq34yG"><img src="https://i.imgur.com/z2DEFYU.png" title="source: imgur.com" width="250" height="250" /></a>"

<a href="https://imgur.com/a/EHpkA6G"><img src="https://i.imgur.com/GElF4on.png" title="source: imgur.com" /></a>

#### Now we can build the XPath.

By inspecting the elements on this web page, we found that:

- **tagname** = a(it is a link/anchor element)
- **attribute** = data-testid
- **value** = "Property-Header-Nav-Tab-Trigger-reviews"

**

From here we can fill in those parameters into our xpath:

//a[@data-testid='Property-Header-Nav-Tab-Trigger-reviews']

However website structures can change frequently so we can also use 

//a[contains(., 'Guest reviews')]

In [None]:
# use xpath to click the "Guest Reviews" button/link

# Use xpath to click the "Guest reviews" tab
xp = ("//a[contains(.,'Guest reviews')]"
      " | //button[contains(.,'Guest reviews')]"
      " | //a[.//span[contains(.,'Guest reviews')]]"
      " | //button[.//span[contains(.,'Guest reviews')]]")

time.sleep(1)
el = browser.find_element(By.XPATH, xp)
# scroll so it's on-screen (helps avoid hidden/covered element issues)
browser.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
time.sleep(1)
# Re-find the element after scrolling to avoid stale reference
el = browser.find_element(By.XPATH, xp)
el.click()
time.sleep(2)

In [None]:
time.sleep(2)

# Close any popup that might be blocking
close_xp = "//button[contains(@aria-label, 'Close')]"
try:
    el = browser.find_element(By.XPATH, close_xp)
    el.click()
    time.sleep(1)
except:
    pass  # No popup to close

# Click "Show more" button
show_more_xp = "//button[contains(., 'Show more')]"
el = browser.find_element(By.XPATH, show_more_xp)
browser.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
time.sleep(1)
el = browser.find_element(By.XPATH, show_more_xp)
el.click()
time.sleep(2)

# Click "Show less" button  
show_less_xp = "//button[contains(., 'Show less')]"
el = browser.find_element(By.XPATH, show_less_xp)
el.click()
time.sleep(2)     


## Other ways to identify the button

**Alternatively, we can right click the element we want > Copy > Copy XPath**

This will give you the Absolute XPath

<a href="https://imgur.com/a/9nbYj0y"><img src="https://i.imgur.com/Nb2Jczu.png" title="source: imgur.com" width = "600" height="200" /></a>

For the show more button, it looks like this:

//*[@id="b2hotelPage"]/div[26]/div/div/div/div/div[2]/div/div[4]/div/div[2]/div/button[2]/span



**Why don't we do this instead?**

- It makes your code longer
- Your code will be more likely to break if anything changes on the web page

### What if there are multiple elements with that XPath? How do we find them all?

There are multiple buttons on the page. Let's find all of them and inspect what they contain.

Here is an XPath that matches buttons on the page:

//button[@type='button']

In [None]:
# Grabbing all the buttons on a page
xpath_buttons = "//button[@type='button']"

time.sleep(2)
elements = browser.find_elements(By.XPATH, xpath_buttons)

print(type(elements))
print(len(elements))
elements[:10]

**How do I know which element is which??**

**We can extract the innerHTML or property or text information by .get_attribute('innerHTML'), .get_property() or .text methods** 

In [None]:
# grab all buttons in the page (or popup)
elements = browser.find_elements(By.XPATH, "//button")

print(len(elements))   # how many buttons total?
for ind, element in enumerate(elements[:15]):  # just show first 15
    print(f"{ind}. {element.text} | {element.get_attribute('innerHTML')}")

Is there **another** way of identifying the button?

In [None]:
print("\nChecking for reviews content...")

# First check main page content
found_in_main = False
try:
    # Look for review-related elements in main content
    main_review_selectors = [
        "//*[contains(text(), 'Select topics')]",
        "//*[contains(text(), 'Filter reviews')]", 
        "//*[contains(text(), 'All reviews')]",
        "//div[contains(@class, 'review')]",
        "//*[contains(text(), 'Guest review')]"
    ]
    
    for selector in main_review_selectors:
        try:
            element = browser.find_element(By.XPATH, selector)
            print(f"Found reviews in main content with selector: {selector}")
            found_in_main = True
            break
        except:
            continue
            
    if not found_in_main:
        print("Reviews not found in main content")
        
except Exception as e:
    print(f"Error checking main content: {e}")

# STEP 3: If not in main content, check iframes
if not found_in_main:
    print("\nChecking iframes for reviews...")
    
    # List iframes
    iframes = browser.find_elements(By.TAG_NAME, "iframe")
    print(f"iframes: {len(iframes)}")

    # Try each iframe until we see the reviews content
    found = False
    for i, f in enumerate(iframes):
        browser.switch_to.frame(f)
        try:
            # Look for various review-related text/elements
            review_indicators = [
                "//*[contains(text(), 'Select topics to read reviews')]",
                "//*[contains(text(), 'Filter reviews')]",
                "//*[contains(text(), 'All reviews')]", 
                "//div[contains(@class, 'review')]",
                "//*[contains(text(), 'Guest review')]"
            ]
            
            for indicator in review_indicators:
                try:
                    browser.find_element(By.XPATH, indicator)
                    print(f"found reviews frame: {i} with indicator: {indicator}")
                    found = True
                    break
                except:
                    continue
            
            if found:
                break
                
        except Exception as e:
            print(f"Error in iframe {i}: {e}")
        
        browser.switch_to.default_content()

    if not found:
        browser.switch_to.default_content()
        print("reviews frame not found")
        
        # Debug: Let's see what's actually in each iframe
        print("\nDEBUG - Content preview of each iframe:")
        for i, f in enumerate(iframes):
            browser.switch_to.frame(f)
            try:
                body_text = browser.find_element(By.TAG_NAME, "body").text[:100]
                print(f"iframe {i}: {body_text}")
            except:
                print(f"iframe {i}: could not read content")
            browser.switch_to.default_content()
    else:
        print("Reviews content successfully found!")
        # You're now in the correct frame (if it was in an iframe)
        # Continue with your reviews scraping code here     

---
## 5. Using the CSS selector to find elements

We can use CSS Selectors instead of XPath to find and interact with elements

##### **What is a CSS Selector?**

A CSS selector is like a set of instructions that tells a web browser how to find and style elements on a web page. Remember that elements have different attributes; we can use CSS to look for elements with specific attribute values.


##### **Examples of CSS Selectors:**
1. Element Selector

    < button > or < a > will point to elements with those respective types

2. Class Selector --> .
    
    '.expand' points to all elements with class='expand'

3. ID Selector--> #
    
     #submit-button points to the element with id="submit-button"

4. Attribute Selector --> [type=" "] [attribute="value"]

    eg. [ type="text" ] points to all elements with type="text"

In [None]:
example_button = "button, a.bui-button, a[data-testid*='review'], *[data-testid*='review']"

time.sleep(2)
cands = browser.find_elements(By.CSS_SELECTOR, example_button)

print("buttons found:", len(cands))
for i, el in enumerate(cands[:10]):  # show the first 10
    # Try getting text from the element or its children
    text = el.text.strip()
    if not text:  # If no direct text, try getting from child elements
        try:
            text = el.find_element(By.XPATH, ".//span | .//div").text.strip()
        except:
            text = "no text found"
    print(i, repr(text))

**How did this work?** 

"button": selects all button elements

'a.bui-button': selects all "a" elements that have class 'bui-button'

a[data-testid*='review']: Selects all "a" elements where the data-testid attribute contains the text 'review'

[data-testid*='review']: Selects any element where "data-testid" contains 'review'


There are multiple ways to locate elements using selenium. For the full list, you can refer to: https://selenium-python.readthedocs.io/locating-elements.html

Because the webpage is dynamically generated, we often need to click the **read more** button to reveal content. 
- For example, some long reviews are only partially available on TripAdvisor-- you would need to click on "Read More" to access the full review. Other short reviews are fully available.

## Trying to click a button that doesn't exist

In [None]:
# we have already expanded this review. Since we interacted with a dynamic page, the button is no longer on it now.
# what will happen if we try to click the button that doesn't exist?

browser.find_element(By.CSS_SELECTOR,example_button).click()

In [None]:
# a helper function to check if the button exists on the page
def check_exists_by_css(selector):
    try:
        browser.find_element(By.CSS_SELECTOR,selector)
    except NoSuchElementException:
        return False
    return True

def check_exists_by_xpath(xp):
    try:
        browser.find_element(By.XPATH,xp)
    except NoSuchElementException:
        return False
    return True

In [None]:
# so if we try to find a nonsense element, it will just return False instead of breaking the code

check_exists_by_css('hahaha')

In [None]:
# Sometimes hotel responses are collapsed - let's expand them
# Look for "Continue reading" buttons in hotel responses

continue_reading_xp = "//button[contains(., 'Continue reading')]"

# Find all "Continue reading" buttons
continue_buttons = browser.find_elements(By.XPATH, continue_reading_xp)

print(f"Found {len(continue_buttons)} 'Continue reading' buttons")

# Click each one to expand the hotel responses
for button in continue_buttons:
    try:
        # Scroll to button
        browser.execute_script("arguments[0].scrollIntoView({block:'center'});", button)
        time.sleep(0.5)
        # Click it
        button.click()
        time.sleep(0.5)
    except:
        # If it fails (maybe button disappeared), just continue
        continue

print("Expanded all hotel responses")


In [None]:
# close the browser
browser.quit()

---
## 6. Combining Selenium and BeautifulSoup

Whereas we cna use Selenium alone to scrape content, we prefer to use it together with beautifulsoup to make our lives easier. 

The complete workflow would be like:
1. Using Selenium to automate web browser interaction so that hidden content can be made available by automating actions such as the button clicks, screen scrolling and so on
2. After all the content we want to parse is revealed, we'll use BeautifulSoup to parse it like we did in Lecture 5

---
# 7. Challenge

1. Use Selenium to go through the reviews 
2. Use BeautifulSoup to parse the content of the reviews

In [None]:
import re

# Initialize lists
authors = []
ratings = []
review_titles = []
positive_parts = []
negative_parts = []
room_types = []

# Open browser and go to reviews
link = "https://www.booking.com/hotel/us/bklyn-house-new-york-brooklyn.html#tab-reviews"
browser = webdriver.Chrome()
browser.get(link)
print("Opened browser, waiting for page to load...")
time.sleep(5)

# Make sure we're on the reviews section - click it if needed
try:
    reviews_tab_xp = "//a[contains(.,'Guest reviews')] | //button[contains(.,'Guest reviews')]"
    reviews_tab = browser.find_element(By.XPATH, reviews_tab_xp)
    browser.execute_script("arguments[0].scrollIntoView({block:'center'});", reviews_tab)
    time.sleep(1)
    reviews_tab.click()
    print("Clicked Guest reviews tab")
    time.sleep(3)
except:
    print("Guest reviews already open or couldn't find tab")

# Loop through 10 pages
for page in range(10):
    print(f"\n{'='*50}")
    print(f"SCRAPING PAGE {page + 1}")
    print(f"{'='*50}\n")
    
    # Wait a bit for any dynamic content to load
    time.sleep(2)
    
    # Get page source and parse
    page_source = browser.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    
    # Find review cards
    cards = soup.find_all("div", {"data-testid": "review-card"})
    print(f"Found {len(cards)} review cards on this page")
    
    # If no cards found, try to debug
    if len(cards) == 0:
        print("WARNING: No review cards found!")
        print("Looking for any review-related elements...")
        review_divs = soup.find_all("div", {"data-testid": lambda x: x and "review" in x.lower() if x else False})
        print(f"Found {len(review_divs)} divs with 'review' in data-testid")
        if len(review_divs) > 0:
            print("First few:")
            for div in review_divs[:3]:
                print(f"  - {div.get('data-testid')}")
        break
    
    # Extract data from each card
    for i, card in enumerate(cards):
        print(f"Processing review {i+1}...")
        
        # Get author name
        author_elem = card.find("div", class_="b08850ce41 f546354b44")
        author = author_elem.get_text(strip=True) if author_elem else "No author"
        authors.append(author)
        
        # Get rating
        rating_elem = card.find("div", class_="bc946a29db")
        if rating_elem:
            rating_text = rating_elem.get_text(strip=True)
            try:
                numbers = re.findall(r'\d+\.?\d*', rating_text)
                rating = float(numbers[0]) if numbers else None
            except:
                rating = None
        else:
            rating = None
        ratings.append(rating)
        
        # Get review title
        title_elem = card.find("h4", {"data-testid": "review-title"})
        title = title_elem.get_text(strip=True) if title_elem else "No title"
        review_titles.append(title)
        
        # Get room type
        room_elem = card.find(string=re.compile(r"Room"))
        room_type = room_elem.strip() if room_elem else "No room type"
        room_types.append(room_type)
        
        # Get positive review
        positive_elem = card.find("div", {"data-testid": "review-positive-text"})
        positive_text = positive_elem.get_text(strip=True) if positive_elem else ""
        positive_parts.append(positive_text)
        
        # Get negative review
        negative_elem = card.find("div", {"data-testid": "review-negative-text"})
        negative_text = negative_elem.get_text(strip=True) if negative_elem else ""
        negative_parts.append(negative_text)
    
    print(f"Completed page {page + 1}: scraped {len(cards)} reviews")
    
    # Try to go to next page (if not on last page)
    if page < 9:
        try:
            # Look for "Next page" button
            next_button_xp = "//button[@aria-label='Next page']"
            next_button = browser.find_element(By.XPATH, next_button_xp)
            browser.execute_script("arguments[0].scrollIntoView({block:'center'});", next_button)
            time.sleep(1)
            next_button.click()
            print(f"Clicked 'Next page' button, waiting for page {page + 2} to load...")
            time.sleep(4)  # Increased wait time
        except Exception as e:
            print(f"Could not find next page button - stopping at page {page + 1}")
            print(f"Error: {e}")
            break

# Close browser
browser.quit()


In [None]:
# let's take a look at some of the reviews
print(f"The first author is: {authors[0]}")
print(f"The first rating is: {ratings[0]}")
print(f"The first title is: {review_titles[0]}")
print(f"The first room type is: {room_types[0]}")
print(f"The first positive review is: {positive_parts[0]}")
print(f"The first negative review is: {negative_parts[0]}")


---
# 8. Dataframes

A nicer way to deal with data such as the ones we're scraping is by putting them into a dataframe. 
You'll learn the basics of data frames in the "before class" module of the next lecture!

In the last step, we will store the results into a dataframe.

In [None]:
df = pd.DataFrame({
    'Author': authors,
    'Rating': ratings,
    'Title': review_titles,
    'Room_Type': room_types,
    'Positive': positive_parts,
    'Negative': negative_parts
})

print(f"\n\n{'='*50}")
print(f"SCRAPING COMPLETE")
print(f"{'='*50}")
print(f"Successfully scraped {len(df)} total reviews")
print("\nDataFrame preview:")
print(df.head(10))
print(f"\nFull dataset shape: {df.shape}")