# Scraping with Selenium

A lot of modern websites relies on Javascript to navigate dynamically in the content. However the usual Python web scrapers (like `requests`) are not able to execute javascript. Since then they are struggling in getting the content of dynamic web pages.

Selenium is THE solution for tackling this problem. Initially it has been created to automate tests on websites. It will open your browser _for real_ and allow you to simulate human interactions in website through Python commands.

For example it can be useful when information is accessible by clicking on buttons (which is not possible with `requests` and `beautifulsoup`).

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

In [41]:
# from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.common.by import By
# from selenium.webdriver.firefox.service import Service
# from selenium.webdriver.firefox.options import Options
# import logging

# # Set up service with log
# service = Service('/snap/bin/geckodriver')
# service.log_level = logging.DEBUG

# # Set up options
# options = Options()
# options.add_argument('-headless')

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [42]:

# Specify the path to the chromedriver if it's not in your PATH
path = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome()
# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers a method `find_element` that aims to search for item based on attributes
# For example, the input text element can be located by its name attribute by
# using the attribute `name` with the value `q`
elem = driver.find_element(By.NAME, "q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion that ensures that the source page does not contain the word "No results found".
assert "No results found." not in driver.page_source
driver.close()

### Search engine simulation

We will simulate a query on the official Python website by using the search bar.

In [43]:
# from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.common.by import By
# from selenium.webdriver.firefox.service import Service
# from selenium.webdriver.firefox.options import Options
# import logging

# # Set up service with log
# service = Service('/snap/bin/geckodriver')
# service.log_level = logging.DEBUG

# # Set up options
# options = Options()
# options.add_argument('-headless')

# # Here, we create instance of Firefox WebDriver.
# driver = webdriver.Firefox(service=service, options=options)

# # The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# # loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# # It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# # when it was fully loaded.
# driver.get("https://www.python.org/")

# # The following line is a statement confirming that the title contains the word "Python".
# assert "Python" in driver.title
# print(driver.title)

# # WebDriver offers a method `find_element` that aims to search for item based on attributes
# # For example, the input text element can be located by its name attribute by
# # using the attribute `name` with the value `q`
# elem = driver.find_element(By.NAME, "q")

# # Then we send keys. This is similar to entering keys using your keyboard.
# # Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# # For security reasons, we will delete any pre-filled text in the input field
# # (for example, "Search") so that it does not affect our search results:
# elem.clear()
# elem.send_keys("pycon")
# elem.send_keys(Keys.RETURN)

# # After submitting the page, you should get the result if there is one. To ensure that certain results
# # are found, make an assertion that ensures that the source page does not contain the word "No results found".
# assert "No results found." not in driver.page_source
# driver.close()



### Getting the title of all the articles from the homepage of _The New York Times_

First let's open the homepage of the newspaper's website.


In [49]:
# from selenium import webdriver
# from selenium.webdriver.firefox.service import Service
# from selenium.webdriver.firefox.options import Options
# import logging

# # Set up service with log
# service = Service('/snap/bin/geckodriver')
# service.log_level = logging.DEBUG

# # Set up options
# options = Options()
# options.add_argument('-headless')

# url = "https://www.nytimes.com/"

# driver = webdriver.Firefox(service=service, options=options)
# driver.get(url)

url = "https://www.nytimes.com/"

driver = webdriver.Chrome()
driver.get(url)

As you can see, you are facing the famous GDPR banner. Let's accept it in order to access the page!

In [47]:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from selenium.webdriver.firefox.service import Service
# from selenium.webdriver.firefox.options import Options

# # Set up the geckodriver service and Firefox options for headless mode
# service = Service('/snap/bin/geckodriver')
# options = Options()
# options.add_argument('-headless')

# # Initialize the WebDriver
# driver = webdriver.Firefox(service=service, options=options)

# # Open the target webpage
# driver.get('https://www.nytimes.com/')

# # Wait up to 10 seconds for the cookie button to be clickable
# try:
#     wait = WebDriverWait(driver, 10)
#     cookie_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='GDPR-accept']")))

#     # Click the button
#     cookie_button.click()
#     print("Cookie button clicked successfully.")
# except Exception as e:
#     print(f"An error occurred: {e}")

# # Quit the driver
# driver.quit()

Now let's get all the titles of the articles by using XPATH and let's store them in a list


In [52]:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.firefox.service import Service
# from selenium.webdriver.firefox.options import Options
# import logging

# # Set up the geckodriver service and Firefox options for headless mode
# service = Service('/snap/bin/geckodriver')
# service.log_level = logging.DEBUG
# options = Options()
# options.add_argument('-headless')

# # Initialize the WebDriver
# driver = webdriver.Firefox(service=service, options=options)

# # Open the target webpage
# driver.get('https://www.nytimes.com/')
# article_titles = driver.find_elements(By.CLASS_NAME, "summary-class css-8h5y1w")
# all_titles = []
# for title in article_titles:
#     all_titles.append(title.text)

# all_titles

In [59]:
# Set up the geckodriver service and Firefox options for headless mode
# geckodriver_path = '/snap/bin/geckodriver'  # Replace with your geckodriver path
# service = Service(geckodriver_path)
# service.log_level = logging.DEBUG
# options = Options()
# options.headless = True  # Use headless mode

# # Initialize the WebDriver
# driver = None
# try:
#     driver = webdriver.Firefox(service=service, options=options)

#     # Open the target webpage
#     driver.get('https://www.nytimes.com/')

#     # Find all elements with class name 'summary-class css-8h5y1w'
#     article_titles = driver.find_elements(By.CLASS_NAME, "summary-class")

#     # Extract text from each element
#     all_titles = []
#     for title in article_titles:
#         all_titles.append(title.text)
#     print(all_titles)

#     # Print or do further processing with the extracted titles
#     print("Article Titles:")
#     for title in all_titles:
#         print(title)

# except Exception as e:
#     print(f"An error occurred: {e}")

from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.nytimes.com/"

driver = webdriver.Chrome()
driver.get(url)

try:
    article_titles = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "summary-class")))
    all_titles = [title.text for title in article_titles]
    print(all_titles)
    driver.close()

except Exception as e:
    print(f"An error occurred: {e}")

['Just two months ago, President Biden appeared to have a daunting financial advantage. After Donald Trump was convicted of 34 felonies, Republicans rallied.', 'Pharmacy benefit managers are driving up drug costs for millions of Americans, employers and the government.', 'The Israeli leader’s quarrels with the White House, his military and his coalition partners have escalated at a pivotal time in the war in Gaza.', 'Robert Winnett will stay at The Daily Telegraph after reports raised questions about his ties to unethical news gathering practices.', 'Economists are debating what effect the singer’s sweep through Europe will have this summer as swarms of fans increase demand for hotels and services.', 'Judge Aileen Cannon has repeatedly proved willing to hear out even far-fetched arguments from former President Trump’s legal team.', 'As the Supreme Court enters the final weeks of its term, it is poised to issue a series of blockbuster decisions.', 'People all over the world are facing s

In [None]:
# Set up the geckodriver service and Firefox options for headless mode
# geckodriver_path = '/snap/bin/geckodriver'  # Replace with your geckodriver path
# service = Service(geckodriver_path)
# service.log_level = logging.DEBUG
# options = Options()
# options.headless = True  # Use headless mode

# # Initialize the WebDriver
# driver = None
# try:
#     driver = webdriver.Firefox(service=service, options=options)

#     # Open the target webpage
#     driver.get('https://www.nytimes.com/')

#     # Find all elements with class name 'summary-class css-8h5y1w'
#     article_titles = driver.find_elements(By.CLASS_NAME, "summary-class")

#     # Extract text from each element
#     all_titles = []
#     for title in article_titles:
#         all_titles.append(title.text)
#     print(all_titles)

#     # Print or do further processing with the extracted titles
#     print("Article Titles:")
#     for title in all_titles:
#         print(title)

# except Exception as e:
#     print(f"An error occurred: {e}")

Here we are ! Let's close the browser then !

### Exercise

1. Use Selenium for opening the homepage of your favourite newspaper (not the New York Times, too easy)
2. Close the cookie banner (if it appears)
3. Get the link of the first article of the page and open it
4. Print the title and the content of the article

**tip:** [Newspaper3k](https://pypi.org/project/newspaper3k/) is a powerful library for scraping articles from newspapers. Have a look to the `fulltext` method.

In [62]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.lalibre.be/"

driver = webdriver.Chrome()
driver.get(url)

try:
    cookie_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='didomi-notice-agree-button']")))
    cookie_button.click()
    print("Cookie button clicked successfully.")
except Exception as e:
    print(f"An error occurred: {e}")

try:
    article_titles = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//h2//span[2]")))
    all_titles = [title.text for title in article_titles]
    print(all_titles)
    driver.close()

except Exception as e:
    print(f"An error occurred: {e}")

Cookie button clicked successfully.
['Jacques Attali: "Je n’ai aucune raison d’être moins en colère contre les acteurs de cette gigantesque farce. Macron a agi de la pire des façons"', 'Barré dans la course à la coprésidence de Défi, Julien Lemoine quitte le parti et rejoint les Engagés: "Défi, en Wallonie, est voué à s’éteindre"', 'Les provinces wallonnes coûtent plus cher que les provinces flamandes : “Parce que nous faisons plus de choses”', 'Des négos rapides, oui. Mais pas trop.', 'A 84 ans, la reine Elisabeth de Belgique apprend la posture "tête au sol"', 'Bientôt éjecté du “16”, son destin européen compromis : quelle sera la porte de sortie pour Alexander De Croo ?', '', 'Alin Stoica, ex-star à Anderlecht et… fan des Diables contre sa Roumanie, a grandi avec le T1 Iordanescu : "Vertonghen est traité d’arrogant"', 'L’incroyable histoire de Vernon de Marco, le serveur argentin devenu international slovaque par hasard : "J’ai dû regarder où c’était sur la carte"', "Les Diables en t