---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Using Selenium to Parse Web Content

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## 1. Static content 🔒 vs. Dynamic content 🍃

<div style="text-align: center;">
<br>
<img src="https://raw.githubusercontent.com/apostolosfilippas/wa/e60f312c3435e31da72af81baed7194daf98cf11/assets/selenium-1.svg" width="800" height="480">
<br>
</div>

> 🔒 **Static content** is any file that is stored in a server and is the same every time it is delivered to users. Unless the developer makes changes themselves, the web page always remains the same. It is like a newspaper: once an issue of a newspaper is published, it features the same articles and photos all day for everyone who picks up a copy.

<div style="text-align: center;">
        <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-2.png?raw=true" title="source: imgur.com" width="350" height="350" />
</div>

> 🍃 **Dynamic content** is content that changes based on factors specific to the user such as time of visit, location, and device. A dynamic webpage will not look the same for everybody, and it can change as users interact with it – like if a newspaper could rewrite itself as someone is reading it. This makes webpages more personalized and more interactive.

<div style="text-align: center;">
        <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-3.png?raw=true" title="source: imgur.com" width="350" height="350" />
</div>

---
## 2. How do dynamic websites work?

There are many external services that dynamic webpages interact with. Here we cover 3 common services:


### **🌐 Server-side scripting** 

When a user requests a webpage, the server processes the script, and interacts with databases or external services. Then, it sends the dynamically generated HTML back to the user's browser.

**Used for**: detecting that you are logging in from a certain geographic location, and shows you relevant information for that location

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-4.png?raw=true" title="source: imgur.com" width="400" height="300" />
</div>

### **💻 Client-side scripting** 
#### This involves using **_JavaScript_** to manipulate the content and behavior of web pages directly within the user's browser (i.e., click, scroll, play, pause, and more). It allows for interactive features, such as real-time updates and dynamic animations.
***Used for**: Scrolling and clicking, form validation (submitting a form), real-time chat and messaging, image carousels

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-5.png?raw=true" title="source: imgur.com" width="400" height="300" />
</div>

### **☎️ Application Programming Interfaces (APIs)** 
#### APIs enable different systems to communicate and share data. In the dynamic websites, APIs can connect to external services or retrieve data from other sources, such as social media platforms, weather services, or payment gateways.
**Used for**: Allowing users to view and interact with external services without having to leave the website (ex. live Twitter feeds, PayPal).

<div style="text-align: center;">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/selenium-6.png?raw=true" title="source: imgur.com" width="375" height="300" />
</div>

### Beautiful Soup alone is not enough to scrape web content
- **Beautiful Soup only helps us extract content from an HTML file.** We used requests to get the HTML code of the page, and we assumed that all information we need is available in the HTML code we have received from the server, and we do not need to interact with the browser.

- As we described above, in many cases the data we need are hidden in components which get rendered on clicking JavaScript links, browsing around the website, and so on. For example, for long reviews on many websites, you often need to click "read more" to view the full content. In such cases, **we need to use Selenium to interact with the browser and get the full content.**

---
## 3. Getting started with Selenium

As always, we start by importing all the useful packages. 

Note that to be able to follow along, you should have followed the steps in the "Before class" portion of Lecture 6.

We will be using chrome throughout.

In [None]:
# !pip install selenium webdriver-manager beautifulsoup4 pandas

In [None]:
from selenium import webdriver # type: ignore
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time 
import pandas as pd
from bs4 import BeautifulSoup

Above:
- **time** will allow us to build "breaks" into our code to slow it down.
- **pandas** will allow us to store data that we parsed from the website (you will learn more about in the "Before class" reading for next week)


Below:
- The following scripts will open an instance of the Chrome browser. The instance of Chrome that opened will indicate that "Chrome is now being controlled by automated test software".

In [None]:
# create a browser object-- this should open a chrome browser using selenium
browser = webdriver.Chrome()

In [None]:
# go to a website
browser.get("http://www.newyorktimes.com/")
time.sleep(5)
browser.get("https://www.openai.com/")


In [None]:
# close the browser
browser.quit()

# take a look at other browser methods
# browser.

That's it! Now you're ready to use this powerful tool! 

---
## 4. Navigating around a website with Selenium 

In [None]:
#  initialize our browser
browser = webdriver.Chrome()

time.sleep(3)
browser.maximize_window()
time.sleep(3)

# let's go to the review page of Brooklyn Hotel on Booking.com
link = "https://www.booking.com/hotel/us/bklyn-house-new-york-brooklyn.html"
browser.get(link)

time.sleep(3)

## 4.1 XML
Just like HTML is used to find static content, we can use __**XML**__ (Extensible Markup Language) to find and interact with dynamic content.

- XML is a markup language,designed to store, structure, and transport data or information.
- It focuses on representing the content of data rather than specifying how it should be displayed (unlike HTML)
- XML is widely used for data interchange and storage, providing a standardized way to format and organize data.
- XML documents use tags to enclose data elements, creating a hierarchical structure. 


Here's an XML example:

```xml
<bookstore>
  <book>
    <title lang="en">Harry Potter and the mid-life crisis</title>
    <description>Go on a journey with Harry Potter and his friends as they navigate the challenges of middle age by buying a Porsche.</description>
    <price>29.99</price>
  </book>
</bookstore>
```

In a nutshell:

| Feature          | **HTML (HyperText Markup Language)**                 | **XML (eXtensible Markup Language)**                     |
| ---------------- | ---------------------------------------------------- | -------------------------------------------------------- |
| **Goal**         | Display data — defines how data looks on a web page. | Store and transport data — defines what data *is*.       |
| **Design focus** | Presentation (how information is shown).             | Structure and meaning (what the information represents). |
| **Tags**         | HTML uses predefined tags like `<p>`, `<div>`, ... that have meaning built into browsers.   | XML has no predefined tags — you invent your own, like `<book>` or `<price>`. XML is extensible — that’s what the “X” stands for. |
| **Syntax**       | HTML is lenient — browsers can interpret sloppy or missing tags (<p><b>Hi will still render).   | XML is strict — every tag must close, nesting must be perfect, and case matters (`<Tag>` is not the same as `<tag>`).|

## 4.2 XPath

XPath lets you “walk” the XML tree, and select or extract specific nodes.
- XPath can be used to locate and interact with elements both within XML or HTML documents, making it particularly useful for tasks like web scraping and automated testing.
- You can find the XPATH by right click the web page, click the inspect button and select the element you want to check. It's very similar to access content with id, class and other attributes, but gives us even more freedom.

A sample XPath is shown below: 



<br>
<img src="https://raw.githubusercontent.com/apostolosfilippas/wa/refs/heads/main/assets/selenium-7.webp" width="800" height="280">
<br>

- `//`:  This selects all elements in the document that match the criteria that follow, regardless of their location within the document's hierarchy.
- `tagname`: This selects all elements with the specified tag name.
- `[@attribute='value']`: This selects all elements that have the specified attribute with a value equal to the specified value.

You do not have to remember everything, but here's some basic syntax:

| Example                       | Meaning                                                                  |
| ----------------------------- | ------------------------------------------------------------------------ |
| `/bookstore/book/title`       | Selects all `<title>` elements inside `<book>` inside `<bookstore>`.     |
| `//title`                     | Selects **all** `<title>` elements anywhere in the document.             |
| `//book[@category="fiction"]` | Selects `<book>` elements whose `category` attribute equals `"fiction"`. |
| `//book/title/text()`         | Returns the *text content* of all `<title>` elements.                    |
| `(//book)[1]`                 | Selects the *first* `<book>` element.                                    |


## Manipulating content using XPath 

Let's try to go to the reviews tab.



<br>
<img src="https://raw.githubusercontent.com/apostolosfilippas/wa/refs/heads/main/assets/selenium-8.png" >
<br>

#### Now we can build the XPath to click the reviews tab.

Inspecting the elements on the home page, we can see that:

- The tagname is `a` -- the thing we are looking for is a link (anchor) element
- There is an `attribute` called `data-testid` which has value `Property-Header-Nav-Tab-Trigger-reviews`

We can construct the XPath as follows:
```
//a[@data-testid='Property-Header-Nav-Tab-Trigger-reviews']
```

### Here is another way to do it
Because website structures can change frequently, we can also use the text of the element to find it.

```
//a[contains(., 'Guest reviews')]
```



In [None]:
# use xpath to click the "Guest Reviews"
xp = ("//a[contains(.,'Guest reviews')]"
      " | //button[contains(.,'Guest reviews')]"
      " | //a[.//span[contains(.,'Guest reviews')]]"
      " | //button[.//span[contains(.,'Guest reviews')]]")

time.sleep(5)

el = browser.find_element(By.XPATH, xp)

# scroll so it's on-screen (helps avoid hidden/covered element issues)
browser.execute_script("arguments[0].scrollIntoView({block:'center'});", el)

time.sleep(3)

# Re-find the element after scrolling to avoid stale reference
el = browser.find_element(By.XPATH, xp)
el.click()
time.sleep(3)

Now let's click on the "Show more" button to reveal more filters, and then click on the "Show less" button to hide them again

In [None]:
time.sleep(5)

# Close any popup that might be blocking
close_xp = "//button[contains(@aria-label, 'Close')]"
try:
    el = browser.find_element(By.XPATH, close_xp)
    el.click()
    time.sleep(5)
except:
    pass  # No popup to close

# Click "Show more" button
show_more_xp = "//button[contains(., 'Show more')]"
el = browser.find_element(By.XPATH, show_more_xp)

# Scroll to the element
browser.execute_script("arguments[0].scrollIntoView({block:'center'});", el)
time.sleep(5)
el = browser.find_element(By.XPATH, show_more_xp)
el.click()
time.sleep(5)

# Click "Show less" button  
show_less_xp = "//button[contains(., 'Show less')]"
el = browser.find_element(By.XPATH, show_less_xp)
el.click()
time.sleep(5)     


## Identifying elements through their full XPath


**Alternatively, we can right click the element we want > Copy > Copy XPath**

This will give you the Absolute XPath

<a href="https://imgur.com/a/9nbYj0y"><img src="https://i.imgur.com/Nb2Jczu.png" title="source: imgur.com" width = "600" height="200" /></a>

For the show more button, it looks like this:
```
//*[@id="b2hotelPage"]/div[26]/div/div/div/div/div[2]/div/div[4]/div/div[2]/div/button[2]/span
```


**Why don't we do this instead?**

- It makes your code longer
- Your code will be more likely to break if anything changes on the web page

## Dealing with multiple elements with an XPath

There are multiple buttons on the page. Let's find all of them and inspect what they contain.

Here is an XPath that matches buttons on the page:

//button[@type='button']

In [None]:
# Grabbing all the buttons on a page
xpath_buttons = "//button[@type='button']"

time.sleep(2)
elements = browser.find_elements(By.XPATH, xpath_buttons)

print(type(elements))
print(len(elements))
elements[:10]

**How do we know which element is which??**
- We can extract the `innerHTML` or `property` or `text` using the by `.get_attribute('innerHTML')`, `.get_property()` or `.text` methods respectively

In [None]:
# grab all buttons in the page (or popup)
elements = browser.find_elements(By.XPATH, "//button")

print(len(elements))   # how many buttons total?
for ind, element in enumerate(elements[:15]):  # just show first 15
    print(f"{ind}. {element.text} | {element.get_attribute('innerHTML')}")

Is there **another** way of identifying the button?

In [None]:
print("\nChecking for reviews content...")

# First check main page content
found_in_main = False
try:
    # Look for review-related elements in main content
    main_review_selectors = [
        "//*[contains(text(), 'Select topics')]",
        "//*[contains(text(), 'Filter reviews')]", 
        "//*[contains(text(), 'All reviews')]",
        "//div[contains(@class, 'review')]",
        "//*[contains(text(), 'Guest review')]"
    ]
    
    for selector in main_review_selectors:
        try:
            element = browser.find_element(By.XPATH, selector)
            print(f"Found reviews in main content with selector: {selector}")
            found_in_main = True
            break
        except:
            continue
            
    if not found_in_main:
        print("Reviews not found in main content")
        
except Exception as e:
    print(f"Error checking main content: {e}")

# STEP 3: If not in main content, check iframes
if not found_in_main:
    print("\nChecking iframes for reviews...")
    
    # List iframes
    iframes = browser.find_elements(By.TAG_NAME, "iframe")
    print(f"iframes: {len(iframes)}")

    # Try each iframe until we see the reviews content
    found = False
    for i, f in enumerate(iframes):
        browser.switch_to.frame(f)
        try:
            # Look for various review-related text/elements
            review_indicators = [
                "//*[contains(text(), 'Select topics to read reviews')]",
                "//*[contains(text(), 'Filter reviews')]",
                "//*[contains(text(), 'All reviews')]", 
                "//div[contains(@class, 'review')]",
                "//*[contains(text(), 'Guest review')]"
            ]
            
            for indicator in review_indicators:
                try:
                    browser.find_element(By.XPATH, indicator)
                    print(f"found reviews frame: {i} with indicator: {indicator}")
                    found = True
                    break
                except:
                    continue
            
            if found:
                break
                
        except Exception as e:
            print(f"Error in iframe {i}: {e}")
        
        browser.switch_to.default_content()

    if not found:
        browser.switch_to.default_content()
        print("reviews frame not found")
        
        # Debug: Let's see what's actually in each iframe
        print("\nDEBUG - Content preview of each iframe:")
        for i, f in enumerate(iframes):
            browser.switch_to.frame(f)
            try:
                body_text = browser.find_element(By.TAG_NAME, "body").text[:100]
                print(f"iframe {i}: {body_text}")
            except:
                print(f"iframe {i}: could not read content")
            browser.switch_to.default_content()
    else:
        print("Reviews content successfully found!")
        # You're now in the correct frame (if it was in an iframe)
        # Continue with your reviews scraping code here     

---
## (self-study) 
## 5. Using the CSS selector to find elements

We can use CSS Selectors instead of XPath to find and interact with elements

##### **What is a CSS Selector?**

A CSS selector is like a set of instructions that tells a web browser how to find and style elements on a web page. Remember that elements have different attributes; we can use CSS to look for elements with specific attribute values.


##### **Examples of CSS Selectors:**
1. Element Selector

    < button > or < a > will point to elements with those respective types

2. Class Selector --> .
    
    '.expand' points to all elements with class='expand'

3. ID Selector--> #
    
     #submit-button points to the element with id="submit-button"

4. Attribute Selector --> [type=" "] [attribute="value"]

    eg. [ type="text" ] points to all elements with type="text"

In [None]:
example_button = "button, a.bui-button, a[data-testid*='review'], *[data-testid*='review']"

time.sleep(2)
cands = browser.find_elements(By.CSS_SELECTOR, example_button)

print("buttons found:", len(cands))
for i, el in enumerate(cands[:10]):  # show the first 10
    # Try getting text from the element or its children
    text = el.text.strip()
    if not text:  # If no direct text, try getting from child elements
        try:
            text = el.find_element(By.XPATH, ".//span | .//div").text.strip()
        except:
            text = "no text found"
    print(i, repr(text))

**How did this work?** 

"button": selects all button elements

'a.bui-button': selects all "a" elements that have class 'bui-button'

a[data-testid*='review']: Selects all "a" elements where the data-testid attribute contains the text 'review'

[data-testid*='review']: Selects any element where "data-testid" contains 'review'


There are multiple ways to locate elements using selenium. For the full list, you can refer to: https://selenium-python.readthedocs.io/locating-elements.html

Because the webpage is dynamically generated, we often need to click the **read more** button to reveal content. 
- For example, some long reviews are only partially available on TripAdvisor-- you would need to click on "Read More" to access the full review. Other short reviews are fully available.

## Trying to click a button that doesn't exist

In [None]:
# we have already expanded this review. Since we interacted with a dynamic page, the button is no longer on it now.
# what will happen if we try to click the button that doesn't exist?

browser.find_element(By.CSS_SELECTOR,example_button).click()

In [None]:
# a helper function to check if the button exists on the page
def check_exists_by_css(selector):
    try:
        browser.find_element(By.CSS_SELECTOR,selector)
    except NoSuchElementException:
        return False
    return True

def check_exists_by_xpath(xp):
    try:
        browser.find_element(By.XPATH,xp)
    except NoSuchElementException:
        return False
    return True

In [None]:
# so if we try to find a nonsense element, it will just return False instead of breaking the code

check_exists_by_css('hahaha')

# Expanding hotel responses

In [None]:
# Sometimes hotel responses are collapsed - let's expand them
# Look for "Continue reading" buttons in hotel responses

continue_reading_xp = "//button[contains(., 'Continue reading')]"

# Find all "Continue reading" buttons
continue_buttons = browser.find_elements(By.XPATH, continue_reading_xp)

print(f"Found {len(continue_buttons)} 'Continue reading' buttons")

# Click each one to expand the hotel responses
for button in continue_buttons:
    try:
        # Scroll to button
        browser.execute_script("arguments[0].scrollIntoView({block:'center'});", button)
        time.sleep(2)
        # Click it
        button.click()
        time.sleep(2)
    except:
        # If it fails (maybe button disappeared), just continue
        continue

print("Expanded all hotel responses")


In [None]:
# close the browser
browser.quit()

---
## 6. Combining Selenium and BeautifulSoup

Whereas we cna use Selenium alone to scrape content, we prefer to use it together with beautifulsoup to make our lives easier. 

The complete workflow would be like:
1. Using Selenium to automate web browser interaction so that hidden content can be made available by automating actions such as the button clicks, screen scrolling and so on
2. After all the content we want to parse is revealed, we'll use BeautifulSoup to parse it like we did in Lecture 5

---
# 7. Challenge

Scrape the first 10 pages of reviews for the hotel with the link: https://www.booking.com/hotel/us/bklyn-house-new-york-brooklyn.html#tab-reviews

For each review, extract the following information:
- Author name
- Room type
- Rating
- Review title
- Positive text
- Negative text

In [None]:
import re

# Initialize lists
authors = []
ratings = []
review_titles = []
positive_parts = []
negative_parts = []
room_types = []

# Open browser and go to reviews
link = "https://www.booking.com/hotel/us/bklyn-house-new-york-brooklyn.html#tab-reviews"
browser = webdriver.Chrome()
browser.get(link)
print("Opened browser, waiting for page to load...")
time.sleep(5)

##
## YOUR CODE HERE
## 




