# Scraping dynamique

Sur certains sites, les informations qu'on veut obtenir sont rendues côté client avec du javascript. Cette mécanique de rendu rend le scraping statique impossible, puisque le fichier HTML qu'on reçoit en faisant une requête est presque vide.

Démonstration:

In [6]:
import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com/js/"
response = requests.get(url)

soup = BeautifulSoup(response.text)

soup.find_all("div", class_="quote")

[]


Sans surprise, on ne trouve rien sur la page ! Pourtant, les éléments class="quote" sont bien présents quand inspecte l'élément dans le navigateur !
C'est parce que tout le contenu est chargé avec du javascript.

## Selenium

On va utiliser Firefox dans cette démonstration. Vous avez donc besoin d'avoir Firefox installé sur votre machine, ce qui est probablement déjà le cas, mais également le WebDriver pour Firefox, `geckodriver`, qui n'est peut-être pas là par défaut.


Installation MacOs:
```bash
brew install geckodriver
```
Installation Linux:
```bash
sudo apt install firefox
```
ou
```bash
sudo apt install firefox-geckodriver
```

Ou manuellement:

[Télécharger geckodriver](https://github.com/mozilla/geckodriver/releases)

Extraire l'exécutable, puis placer `geckodriver` dans /usr/local/bin/

In [1]:
import time

import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Commençons par initialiser notre crawler

In [7]:
MAX_WAIT = 10

options = Options()
# options.add_argument("--headless") # pour faire tourner firefox en tâche de fond
# proxy = Proxy({ # Paramètres pour passer par un proxy
#             'proxyType': ProxyType.MANUAL,
#             'httpProxy': proxy_address,
#             'sslProxy': proxy_address,
#             'noProxy': ''
#         })
# options.proxy = proxy
driver = webdriver.Firefox(options=options)

Super, on vient de lancer une fenêtre Firefox vide !

Ouvrons une page !

In [8]:
url = "http://quotes.toscrape.com/js/"
driver.get(url)

# Attente implicite (s'applique globalement)
driver.implicitly_wait(3)

all_quotes = []

In [9]:
while True:
    # Attente implicite: attend que les quotes soient présentes dans la page (mais au maximum 10 secondes)
    WebDriverWait(driver, MAX_WAIT).until(
        EC.presence_of_element_located((By.CLASS_NAME, "quote"))
    )

    quotes = driver.find_elements(By.CLASS_NAME, "quote")

    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, "text").text
        author = quote.find_element(By.CLASS_NAME, "author").text
        tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag")]
        all_quotes.append({"text": text, "author": author, "tags": ", ".join(tags)})

    # On attend encore un peu avant de passer à la page suivante
    time.sleep(0.5)

    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, ".next > a")
        next_btn.click()
    except NoSuchElementException:
        break  # No more pages

driver.quit()

In [10]:
# Output CSV
df = pd.DataFrame(all_quotes)
df.to_csv("data/quotes.csv", index=False)
print(df.head())

                                                text           author  \
0  “The world as we have created it is a process ...  Albert Einstein   
1  “It is our choices, Harry, that show what we t...     J.K. Rowling   
2  “There are only two ways to live your life. On...  Albert Einstein   
3  “The person, be it gentleman or lady, who has ...      Jane Austen   
4  “Imperfection is beauty, madness is genius and...   Marilyn Monroe   

                                           tags  
0        change, deep-thoughts, thinking, world  
1                            abilities, choices  
2  inspirational, life, live, miracle, miracles  
3              aliteracy, books, classic, humor  
4                    be-yourself, inspirational  


## Bonus

Bon, en l'occurrence, si on va regarder dans le code source de la page, on se rend très rapidement compte que les données sont en dur dans script dans le HTML.
Une chose intéressante qu'on peut faire avec ça, c'est exécuter du javascript dans la page via Selenium pour récupérer ces données:

In [None]:
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get("http://quotes.toscrape.com/js/")

WebDriverWait(driver, MAX_WAIT).until(
    EC.presence_of_element_located((By.CLASS_NAME, "quote"))
)

js_data = driver.execute_script("return data;")
for item in js_data:
    print(item["text"], "-", item["author"]["name"])

driver.quit()

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” - Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” - J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” - Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” - Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” - Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.” - Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.” - André Gide
“I have not failed. I've just found 10,000 ways that won't work.” - Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” - Eleanor Roos

C'est """presque""" l'équivalent d'aller taper dans une API directement.
Généralement, il est préférable de scraper des APIs plutôt que des pages web !

## Bonus: authentification par cookies


In [15]:
def read_cookies(cookie_file: str) -> list[dict]:
    cookies = []
    with open(cookie_file, "r") as f:
        for line in f:
            line = line.strip()
            if line.startswith("#") or not line:
                continue

            try:
                domain, flag, path, secure, expiry, name, value = line.split("\t")

                cookie = {
                    "domain": domain,
                    "name": name,
                    "value": value,
                    "path": path,
                    "secure": secure.lower() == "true",
                }

                if expiry != "0":
                    cookie["expiry"] = int(expiry)

                cookies.append(cookie)

            except ValueError:
                print(f"Skipping malformed cookie line: {line}")
                continue

    print(f"Loaded {len(cookies)} cookies")
    return cookies

In [16]:
cookies = read_cookies("cookies-quotes-toscrape-com.txt")

Loaded 1 cookies


Ajouter les cookies à notre instance de Firefox

In [27]:
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile

options = Options()
firefox_profile = FirefoxProfile()
options.profile = firefox_profile
driver = webdriver.Firefox(options=options)

driver.get("https://quotes.toscrape.com/js/")

for cookie in cookies:
    print(cookie)
    try:
        driver.add_cookie(cookie)
    except Exception as e:
        print(f"Failed to add cookie {cookie['name']}: {str(e)}")

driver.implicitly_wait(3)


driver.refresh()

WebDriverWait(driver, MAX_WAIT)

driver.quit()

{'domain': 'quotes.toscrape.com', 'name': 'session', 'value': 'eyJjc3JmX3Rva2VuIjoidUJTcElWYXlaaVdNRGVPZkdFQWJQZGprUWx6SktnbXNVUlljdENGblRxWEhOb3Z3aEx4ciIsInVzZXJuYW1lIjoidXNlciJ9.aAAlOg.4pYuLKVmAY7I9O5bB4lcwHMCpgg', 'path': '/', 'secure': False}


Aha. Mais ça ne marche pas...

C'est parce que le serveur utilise un token CSRF généré au moment du login et vérifié côté serveur. Ceci nous oblige a repliquer le flux du login:

In [34]:
# Setup
options = Options()
driver = webdriver.Firefox(options=options)

driver.get("https://quotes.toscrape.com/login")

# Wait for page to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "csrf_token")))

# Extract CSRF token
csrf_token = driver.find_element(By.NAME, "csrf_token").get_attribute("value")

# Fill in the form
driver.find_element(By.ID, "username").send_keys("admin")
driver.find_element(By.ID, "password").send_keys("password")

# Submit the form
driver.find_element(By.CSS_SELECTOR, "input[value='Login']").click()

# Wait for login to process (e.g., look for logout or profile link)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.LINK_TEXT, "Logout"))
)

print("Login successful!")

# Now go to the quotes page
driver.get("https://quotes.toscrape.com/js")
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "quote"))
)

# You're in.
print("I'm in.")

WebDriverWait(driver, 10)


driver.quit()

Login successful!
I'm in.
