<a href="https://colab.research.google.com/github/finnkrueger/finn-krueger/blob/main/2_python_beautifulsoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install bs4 requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
# Import the necessary libraries
from bs4 import BeautifulSoup

# Define the HTML string
html = """<html><head><title>My Simple HTML Page</title></head><body><div class="my-class"><span>This is some text in a span tag</span><p>This is some text in a paragraph tag</p><p>And yet again another paragraph</p></div></body></html>"""

In [None]:
# Parse the HTML with Beautiful Soup
soup = BeautifulSoup(html, 'html.parser')

In [None]:
print(soup.prettify())

<html>
 <head>
  <title>
   My Simple HTML Page
  </title>
 </head>
 <body>
  <div class="my-class">
   <span>
    This is some text in a span tag
   </span>
   <p>
    This is some text in a paragraph tag
   </p>
   <p>
    And yet again another paragraph
   </p>
  </div>
 </body>
</html>


In [None]:
# Example 1: Find the first occurrence of a tag
first_span = soup.find('span')
print(first_span)

<span>This is some text in a span tag</span>


In [None]:
# Example 2: Find all occurrences of a tag
all_divs = soup.find_all('div')
print(all_divs)

[<div class="my-class"><span>This is some text in a span tag</span><p>This is some text in a paragraph tag</p><p>And yet again another paragraph</p></div>]


In [None]:
# Example 3: Navigate to the parent tag
parent_div = first_span.parent
print(parent_div)



<div class="my-class"><span>This is some text in a span tag</span><p>This is some text in a paragraph tag</p><p>And yet again another paragraph</p></div>


In [None]:
# Example 4: Navigate to the next sibling tag
next_sibling = first_span.next_sibling
print(next_sibling)

<p>This is some text in a paragraph tag</p>


In [None]:
# Example 5: Navigate to the previous sibling tag
previous_sibling = next_sibling.previous_sibling
print(previous_sibling)



<span>This is some text in a span tag</span>


In [None]:
# Example 6: Get the tag name
tag_name = first_span.name
print(tag_name)



span


In [None]:
# Example 7: Get the tag attributes
class_name = parent_div['class']
print(class_name)



['my-class']


In [None]:
# Example 8: Get the text content of a tag
span_text = first_span.text
print(span_text)



This is some text in a span tag


In [None]:
# Example 9: Get the text content of multiple tags
div_text = parent_div.get_text()
print(div_text)



This is some text in a span tagThis is some text in a paragraph tagAnd yet again another paragraph


In [None]:
# Example 10: Modify the HTML
parent_div['class'] = 'new-class'
print(parent_div)

<div class="new-class"><span>This is some text in a span tag</span><p>This is some text in a paragraph tag</p><p>And yet again another paragraph</p></div>


In [None]:
# Example 11: Iterate over all <p> elements in <body>
for i, p in enumerate(soup.body.find_all('p')):
  print(f"{i+1}. tag: {p.name}, text:", p.get_text())

1. tag: p, text: This is some text in a paragraph tag
2. tag: p, text: And yet again another paragraph


# Wikipedia

Real-world example!

We will scrape the main page of wikipedia. In doing this, we will:

1. Get the featured article content
2. Get all the `did you know` content along with some additional data
3. Do the same as 2. but for the `in the news` sectio
4. Repeat this process for the `on this day` section.

In [2]:
import requests as re

In [6]:
# Get the main page content
req = re.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(req.content, "html.parser")

In [7]:
# Get the featured article content
featured_article_text = soup.find(class_="MainPageBG mp-box").p.get_text()


In [9]:
featured_article_text[:50] + '[...]'

'Louis H. Bean (April\xa015, 1896\xa0– August\xa05, 1994) wa[...]'

In [10]:
# Get all "did you know" facts
did_you_know = soup.find(id="mp-dyk").find_all("li")

# For each, make a tuple, that stores the text and the first link
did_you_know_list = []
for li in did_you_know:
  text = li.get_text()
  link = li.find("a")["href"]
  did_you_know_list.append({"text": text, "link": link})

# for each link, get the first paragraph text
for dyk in did_you_know_list:
  # URL suffix: '/wiki/foo_bar'
  link = dyk['link']
  url = f"https://en.wikipedia.org/{link}"
  intro_text_soup = BeautifulSoup(re.get(url).content)
  intro = intro_text_soup.find_all("p")[1].get_text()
  dyk['intro'] = intro

In [14]:
did_you_know_list[0]

{'text': '... that precursors to the killer toy include ventriloquist dummies such as Otto (pictured) in the 1929 film The Great Gabbo?',
 'link': '/wiki/Killer_toy',
 'intro': 'A killer toy is a stock character in horror fiction. They include toys, such as dolls and ventriloquist dummies, that come to life and seek to kill or otherwise carry out violence. The killer toy subverts the associations of childhood with innocence and lack of agency while invoking the uncanny nature of a lifelike toy. Killer toy fiction often invokes ideas of companionship and the corruption of children, sometimes taking place in dysfunctional or single parent homes. They have historically been associated with occultism and spirit possession, though artificial intelligence became more common in later works.\n'}

In [15]:
# let's do it again for the in the news section
in_the_news = soup.find(id="mp-itn").find_all("li")

in_the_news_list = []
for itn in in_the_news:
  text = itn.get_text()
  link = itn.find("a")["href"]

  # notice that this part is different from above
  # which do you think makes the most sense performance-wise? 
  url = f"https://en.wikipedia.org/{link}"
  intro_text_soup = BeautifulSoup(re.get(url).content)
  intro = intro_text_soup.find_all("p")[1].get_text()
  itn['intro'] = intro
  in_the_news_list.append({
      "text": text,
      "link": link,
      "intro": intro
  })


In [17]:
in_the_news_list[0]

{'text': 'The European Space Agency launches the Jupiter Icy Moons Explorer (JUICE) to study Ganymede, Europa and  Callisto (trajectory pictured).',
 'link': '/wiki/European_Space_Agency',
 'intro': 'The European Space Agency[a] is an intergovernmental organisation of 22 member states[7] dedicated to the exploration of space. Established in 1975 and headquartered in Paris, ESA has a worldwide staff of about 2,200 in 2018[8] and an annual budget of about €4.9\xa0billion in 2023.[4]\n'}

In [29]:
# Your turn! Do it for the "On this day" section!
# Take into account that the first item might not be in a <li> tag
# let's do it again for the in the news section
on_this_day = soup.find(id="mp-otd").find_all("li")

on_this_day_list = []
for otd in on_this_day:
  text = otd.get_text()
  link = otd.find("a")["href"]

on_this_day_list.append({
      "text": text,
      "link": link,
      "intro": intro
  })


print(on_this_day)

[<li><a href="/wiki/1632" title="1632">1632</a> – <a href="/wiki/Thirty_Years%27_War" title="Thirty Years' War">Thirty Years' War</a>: A Swedish–German army defeated the forces of the <a href="/wiki/Catholic_League_(German)" title="Catholic League (German)">Catholic League</a> at the <b><a href="/wiki/Battle_of_Rain" title="Battle of Rain">Battle of Rain</a></b>, mortally wounding their commander <a href="/wiki/Johann_Tserclaes,_Count_of_Tilly" title="Johann Tserclaes, Count of Tilly">Johann Tserclaes, Count of Tilly</a>.</li>, <li><a href="/wiki/1923" title="1923">1923</a> – Ten Japanese-American children were killed in <b><a href="/wiki/Nihon_Sh%C5%8Dgakk%C5%8D_fire" title="Nihon Shōgakkō fire">a racially motivated arson attack on a school</a></b> in <a href="/wiki/Sacramento,_California" title="Sacramento, California">Sacramento, California</a>.</li>, <li><a href="/wiki/1936" title="1936">1936</a> – <b><a href="/wiki/1936_Tulkarm_shooting" title="1936 Tulkarm shooting">Two Jews were

## NY Times

Exercise:

1. Get all article titles from the main page `https://www.nytimes.com/international/`
2. For each, get:
   - title
   - summary of the article
   - reading time
   - link to it

What would be the follow up to this?


In [None]:
req = re.get("https://www.nytimes.com/international/")
soup = BeautifulSoup(req.content, "html.parser")


In [None]:
story_wrappers = soup.find_all("section", class_="story-wrapper")
stories = []
for story in story_wrappers:
  try:
    title = story.find(class_="indicate-hover").get_text()
  except:
    title = ""
  try:
    summary = story.find("p", class_="summary-class").get_text()
  except:
    summary = ""
  try:
    reading_time = story.find("p", class_="css-1esztn").get_text()
  except:
    reading_time = ""
  try:
    link = story.find("a")["href"]
  except:
    link = ""
  stories.append((title, summary, reading_time, link))


In [None]:
stories[:10]

[('F.B.I. Is Searching Home of Leader of Online Group Where Secrets Appeared',
  'Officials are looking for the person who shared classified documents about the Ukraine war.',
  '',
  'https://www.nytimes.com/live/2023/04/13/us/documents-leak-pentagon'),
 ('Leader of Chat Group That Leaked Documents Is Air National Guardsman',
  'The national guardsman oversaw a private online group named Thug Shaker Central, according to interviews and documents reviewed by The Times.',
  '3 min read',
  'https://www.nytimes.com/2023/04/13/world/documents-leak-leaker-identity.html'),
 ('The leaked documents show broad infighting among Russian officials.',
  '',
  '4 min read',
  'https://www.nytimes.com/2023/04/13/world/europe/russia-intelligence-leaks.html'),
 ('U.S. allies appear to be mostly shrugging off the latest examples of apparent spying.',
  '',
  '6 min read',
  'https://www.nytimes.com/2023/04/13/us/politics/us-spying-allies.html'),
 ('What do leaked U.S. intelligence reports say? Here is 