## Class 01 - Web Scrapping

In [2]:
import requests
#!pip install bs4
from bs4 import BeautifulSoup
from statsmodels.graphics.tukeyplot import results


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 1) Requesting data from URLs and filtering based on tags

* use lists for multiple tags: ["tag1", "tag2", ..., "tagN"]

In [16]:
url = "https://www.thomas-renault.com/side/ex1.html"
x = requests.get(url)
my_html_file = BeautifulSoup(x.content)

for element in my_html_file.find_all(["p", "h1"]):
    print(element.text)

I want to get this text
But not this one


### 2) Extract url from an "a" tag

the objective here is to extract all the article URLs, but not the contact one

Filter a tag based on attributes:
* soup.find_all("tag", {"class" : "value"})
* element ['href']

In [39]:
url = "https://www.thomas-renault.com/side/ex2.html"
x = requests.get(url)
my_html_file = BeautifulSoup(x.content)
print(my_html_file)

<html>
<head>
<meta charset="utf-8"/>
<title>Second example</title>
</head>
<body>
<main>
<p><a class="art" href="article-1-xyz.html">Article 1</a></p>
<p><a class="art" href="article-2-ert.html">Article 2</a></p>
<p><a class="art" href="article-3-sdf.html">Article 3</a></p>
<p><a class="art" href="article-4-vbn.html">Article 4</a></p>
</main>
<footer>
<p><a class="menu" href="contact.html">Contact us</a></p>
</footer>
</body>
</html>



#### Alternative 1: select based on class value

In [32]:
for element in my_html_file.find_all("a", {"class" : "art"}):
        print(element["href"])


article-1-xyz.html
article-2-ert.html
article-3-sdf.html
article-4-vbn.html


#### Alternative 2: Loop in a loop to select just the desired section

In [37]:
l = []
for element in my_html_file.find_all("main"):
    for link in element.find_all("a"):
        l.append(link["href"])

l

['article-1-xyz.html',
 'article-2-ert.html',
 'article-3-sdf.html',
 'article-4-vbn.html']

#### Alternative 3: based on conditions

In [42]:
a = []
for element in my_html_file.find_all("a"):
    if "article-" in element["href"]:
        a.append(element["href"])

a

['article-1-xyz.html',
 'article-2-ert.html',
 'article-3-sdf.html',
 'article-4-vbn.html']

### Exercise on The Guardian

extract from all articles on the main page:
* Title
* URL
* Description
* Date of publication
* Text of the article


In [50]:
url = "https://www.theguardian.com/europe"
raw = requests.get(url)
front_page = BeautifulSoup(raw.content)

links = []
#h3 tag for article titles
for element in front_page.find_all("a", {"class" : "dcr-2yd10d"}):
    links.append(element["href"])
print(links)

['/lifeandstyle/ng-interactive/2025/nov/03/i-knew-i-needed-help-i-knew-it-was-over-anthony-hopkins-on-alcoholism-anger-academy-awards-and-50-years-of-sobriety', '/commentisfree/2025/nov/03/blood-spilled-sudan-el-fasher-space-rsf-uae-darfur', '/culture/2025/nov/03/big-trouble-in-little-berlin-the-tiny-hamlet-split-in-two-by-the-cold-war', '/technology/2025/nov/03/grokipedia-academics-assess-elon-musk-ai-powered-encyclopedia', '/music/2025/nov/03/smiths-drummer-mike-joyce-marr-morrissey-i-know-its-over', '/books/2025/nov/03/book-of-lives-margaret-atwood-autobiography-review', '/world/live/2025/nov/03/valencia-spain-europe-ukraine-serbia-czech-republic-latest-live-news-updates', '/uk-news/2025/nov/03/man-charged-after-mass-stabbing-on-cambridgeshire-train', '/world/2025/nov/02/louvre-jewel-heist-petty-criminals-paris-prosecutor', '/education/2025/nov/03/uk-university-halted-human-rights-research-after-pressure-from-china', '/world/2025/nov/03/pregnant-uk-teenager-bella-may-culley-accused-

In [51]:
title = []

for element in front_page.find_all("a", {"class" : "dcr-2yd10d"}):
    title.append(element["aria-label"])
print(title)

['Anthony Hopkins on alcoholism, anger and stardom', 'Nobody can feign ignorance about  Sudan', 'The hamlet split in two by the cold war', 'Musk’s AI-encyclopedia gets assessed ', 'Mike Joyce on wild gigs, Marr’s jim-jams and Morrissey', 'Margaret Atwood reveals her hidden side', '‘I can’t go on anymore’: Mazón resigns as Valencia leader and acknowledges mistakes during deadly 2024 floods', 'Man charged after mass stabbing on train in Cambridgeshire', 'Louvre jewel heist by petty criminals, not organised professionals, says Paris prosecutor', 'UK university halted human rights research after pressure from China', 'Pregnant UK teenager Bella May Culley freed from Georgian jail', 'Ukrainian computer game-style drone attack system goes ‘viral’', 'Israel receives remains of three more hostages from Gaza', 'Opponents and loyalists of Serbia’s autocratic president clash in Belgrade', 'Rare white Iberian lynx captured on film in Spain by amateur photographer', 'The Nord Stream riddle: echoes 

In [None]:
title = []

for element in front_page.find_all("a", {"class" : "dcr-2yd10d"}):
    title.append(element["aria-label"])
print(title)

## Class 02 - Web Scrapping with `Selenium`

### A more integrated approach

In [8]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com'
req = requests.get(url)
soup = BeautifulSoup(req.content)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Hello there, HTML enthusiast! -->
  <!-- DCR commit hash 10fa90639cbb8711dfcf8ec629d2967f97ca15c6 -->
  <title>
   Latest news, sport and opinion from the Guardian
  </title>
  <meta content="Latest news, sport, business, comment, analysis and reviews from the Guardian, the world's leading liberal voice" name="description"/>
  <meta charset="utf-8"/>
  <link href="https://www.theguardian.com" rel="canonical"/>
  <meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
  <meta content="#052962" name="theme-color"/>
  <link href="https://assets.guim.co.uk/static/frontend/manifest.json" rel="manifest"/>
  <link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon.svg" rel="apple-touch-icon" sizes="any"/>
  <link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon-512.png" rel="apple-touch-icon" sizes="512x512"/>
  <link href="https://assets.guim.co.uk/stat

### next step is to find the useful keys

In [9]:
link_objects = soup.findAll("a", {"class" : "dcr-2yd10d"})
link_objects = [link['href'] for link in link_objects] # url starts with a "/" so we must add the main web link
link_objects = [url+link for link in link_objects]
link_objects

['https://www.theguardian.com/tv-and-radio/2025/nov/17/russell-tovey-pride-sexual-power-politics-green-party-interview',
 'https://www.theguardian.com/commentisfree/2025/nov/17/sweden-vikings-chaos-sacrifice-ritual-norse-pagan',
 'https://www.theguardian.com/music/2025/nov/17/robin-blades-percussionist-broomsticks-olivier-britten-hitchcock',
 'https://www.theguardian.com/books/2025/nov/17/david-szalay-booker-prize-novel-crisis-masculinity-debate',
 'https://www.theguardian.com/food/2025/nov/17/prawn-tomato-stew-fregola-herby-pickled-vegetable-salad-recipes-sami-tamimi',
 'https://www.theguardian.com/news/audio/2025/nov/17/why-labour-is-going-danish-on-immigration-podcast',
 'https://www.theguardian.com/us-news/2025/nov/17/trump-tells-republicans-to-vote-to-release-epstein-files-saying-we-have-nothing-to-hide',
 'https://www.theguardian.com/world/2025/nov/17/at-least-98-palestinians-have-died-in-custody-since-october-2023-israeli-data-shows',
 'https://www.theguardian.com/world/2025/nov

### Loop to fetch Title, date o publication and dontent of `The Guardian`

In [14]:
results = []

for link in link_objects:

    req = requests.get(link)
    soup = BeautifulSoup(req.content)
    #retrieve title
    title = soup.title.string

    #retrive publication day
    for date in soup.findAll("meta", {"property" : "article:published_time"}):
        date = date["content"]

    #retrive content
    for article in soup.findAll("div", {"id" : "maincontent"}):
        text = article.text

    results.append([title, date, text])

import pandas as pd
pd.DataFrame(results, columns=["title", "date", "text"])

results

[['Russell Tovey on pride, sexual power and politics: ‘The Green party slogan – make hope normal again – is what we need’  | Russell Tovey | The Guardian',
  '2025-11-17T05:00:05.000Z',
  'Russell Tovey’s best characters often seem to have it all together, typically as a barrier to further interrogation. Take his recent projects: in surreal BBC sitcom Juice, Tovey plays Guy, a buttoned-up therapist with a seemingly perfect life, hobbled by an aversion to recklessness. Then there’s the closeted Andrew Waters in award-winning American indie film Plainclothes, a well-respected married man of faith who secretly cruises New York shopping mall toilets. Even in the forthcoming Doctor Who spin-off, The War Between the Land and the Sea, Tovey’s character, Barclay, is an ordinary office clerk who is swept up into a planet-saving mission while trying to keep his family from falling apart. In each performance, Tovey anchors his characters with a beguiling mix of strength, empathy and vulnerability

In [1]:
!pip install selenium


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.theguardian.com")
