# Scraping with Selenium

A lot of modern websites relies on Javascript to navigate dynamically in the content. However the usual Python web scrapers (like `requests`) are not able to execute javascript. Since then they are struggling in getting the content of dynamic web pages.

Selenium is THE solution for tackling this problem. Initially it has been created to automate tests on websites. It will open your browser _for real_ and allow you to simulate human interactions in website through Python commands.

For example it can be useful when information is accessible by clicking on buttons (which is not possible with `requests` and `beautifulsoup`).

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import logging

# Set up service with log
service = Service('/snap/bin/geckodriver')
service.log_level = logging.DEBUG

# Set up options
options = Options()
options.add_argument('-headless')


### Search engine simulation

We will simulate a query on the official Python website by using the search bar.

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import logging

# Set up service with log
service = Service('/snap/bin/geckodriver')
service.log_level = logging.DEBUG

# Set up options
options = Options()
options.add_argument('-headless')

# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox(service=service, options=options)

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("https://www.python.org/")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title
print(driver.title)

# WebDriver offers a method `find_element` that aims to search for item based on attributes
# For example, the input text element can be located by its name attribute by
# using the attribute `name` with the value `q`
elem = driver.find_element(By.NAME, "q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion that ensures that the source page does not contain the word "No results found".
assert "No results found." not in driver.page_source
driver.close()

Welcome to Python.org


### Getting the title of all the articles from the homepage of _The New York Times_

First let's open the homepage of the newspaper's website.


In [3]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import logging

# Set up service with log
service = Service('/snap/bin/geckodriver')
service.log_level = logging.DEBUG

# Set up options
options = Options()
options.add_argument('-headless')

url = "https://www.nytimes.com/"

driver = webdriver.Firefox(service=service, options=options)
driver.get(url)

As you can see, you are facing the famous GDPR banner. Let's accept it in order to access the page!

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options

# Set up the geckodriver service and Firefox options for headless mode
service = Service('/snap/bin/geckodriver')
options = Options()
options.add_argument('-headless')

# Initialize the WebDriver
driver = webdriver.Firefox(service=service, options=options)

# Open the target webpage
driver.get('https://www.nytimes.com/')

# Wait up to 10 seconds for the cookie button to be clickable
try:
    wait = WebDriverWait(driver, 10)
    cookie_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='GDPR-accept']")))

    # Click the button
    cookie_button.click()
    print("Cookie button clicked successfully.")
except Exception as e:
    print(f"An error occurred: {e}")

# Quit the driver
driver.quit()

Error terminating service process.
Traceback (most recent call last):
  File "/home/siegfried2021/Bureau/BeCode_AI/LGG-Thomas4-Mathieu/.venv/lib/python3.12/site-packages/selenium/webdriver/common/service.py", line 170, in _terminate_process
    self.process.terminate()
  File "/usr/lib/python3.12/subprocess.py", line 2211, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python3.12/subprocess.py", line 2203, in send_signal
    os.kill(self.pid, sig)
PermissionError: [Errno 13] Permission denied


An error occurred: Message: 
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:511:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16



Error terminating service process.
Traceback (most recent call last):
  File "/home/siegfried2021/Bureau/BeCode_AI/LGG-Thomas4-Mathieu/.venv/lib/python3.12/site-packages/selenium/webdriver/common/service.py", line 170, in _terminate_process
    self.process.terminate()
  File "/usr/lib/python3.12/subprocess.py", line 2211, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python3.12/subprocess.py", line 2203, in send_signal
    os.kill(self.pid, sig)
PermissionError: [Errno 13] Permission denied


Now let's get all the titles of the articles by using XPATH and let's store them in a list


In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import logging

# Set up the geckodriver service and Firefox options for headless mode
service = Service('/snap/bin/geckodriver')
service.log_level = logging.DEBUG
options = Options()
options.add_argument('-headless')

# Initialize the WebDriver
driver = webdriver.Firefox(service=service, options=options)

# Open the target webpage
driver.get('https://www.nytimes.com/')
article_titles = driver.find_elements(By.CLASS_NAME, "summary-class css-8h5y1w")
all_titles = []
for title in article_titles:
    all_titles.append(title.text)

all_titles

[]

In [5]:
# Set up the geckodriver service and Firefox options for headless mode
geckodriver_path = '/snap/bin/geckodriver'  # Replace with your geckodriver path
service = Service(geckodriver_path)
service.log_level = logging.DEBUG
options = Options()
options.headless = True  # Use headless mode

# Initialize the WebDriver
driver = None
try:
    driver = webdriver.Firefox(service=service, options=options)

    # Open the target webpage
    driver.get('https://www.nytimes.com/')

    # Find all elements with class name 'summary-class css-8h5y1w'
    article_titles = driver.find_elements(By.CLASS_NAME, "summary-class")

    # Extract text from each element
    all_titles = []
    for title in article_titles:
        all_titles.append(title.text)
    print(all_titles)

    # Print or do further processing with the extracted titles
    print("Article Titles:")
    for title in all_titles:
        print(title)

except Exception as e:
    print(f"An error occurred: {e}")

['Federal Reserve officials had initially predicted three rate cuts for the year. Earlier, inflation data for May came in cooler than expected.', 'The vote was an indication that ordinary evangelicals are increasingly open to arguments that equate embryos with human life.', 'Democrats have leads in battleground states, but strategists aligned with both parties caution that the fight for Senate control is just starting.', 'He was a sharpshooting, high-scoring Hall of Fame guard for the Lakers and later an executive with the team. His silhouette became the N.B.A.’s logo.', 'Karine Jean-Pierre, the White House press secretary, said she had not spoken with President Biden about the matter yet.', 'The Biden administration is taking new measures to stop China from helping Russia sustain the war. U.S. officials hope European nations will follow.', 'President Emmanuel Macron called on people of good will to come together to defend the Republic in the snap election he decided to call.', 'Chance

Here we are ! Let's close the browser then !

In [6]:
driver.close()

WebDriverException: Message: Failed to decode response from marionette


### Exercise

1. Use Selenium for opening the homepage of your favourite newspaper (not the New York Times, too easy)
2. Close the cookie banner (if it appears)
3. Get the link of the first article of the page and open it
4. Print the title and the content of the article

**tip:** [Newspaper3k](https://pypi.org/project/newspaper3k/) is a powerful library for scraping articles from newspapers. Have a look to the `fulltext` method.