We split this task among ourselves into two parts: Fiia and Isabel figure out how to fetch the website and look a bit into the structure, and then Manuel and Aaro would do the actual parsing part. These first few code blocks were therefore written by Fiia and Isabel, although in a different file so they don't appear in the commit history of this notebook.

In [8]:
# Imports needed for the project
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html
import json

In [None]:
# Website url
url = "https://www.helsinki.fi/en/news/news-and-press-releases"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

The goal is to collect news information in the overview: (title, description, and link). After searching for the title of the first news item using `Ctrl + F` in the soup output, we identified that the relevant block was called `hy-general-list` and that the data were contained in the `data-items` attribute. The format of `data-items` appears to be a JSON array containg the news items as objects.

In [None]:
from typing import TypedDict

block = soup.find("hy-general-list")
assert block is not None # This is for type intellisense; practically we know that `block`
raw_data = block.get("data-items")
decoded = html.unescape(str(raw_data)) # convert HTML-encoded text into readable text
items = json.loads(decoded) # convert the JSON-formatted string into python objects (list of dictionaries)

# Type for intellisense; we really don't *need* to use a class for anything 
# (and we really aren't using it) but VS Code complains otherwise
class NewsItem(TypedDict):
    title: str
    description: str
    url: str

# Parse the news item objects inside the JSON array
news_items: list[NewsItem] = []
for item in items:
    title = item.get("title")
    description = item.get("description")
    url = item.get("url")
    news_items.append({
        "title": title,
        "description": description,
        "url": url
    })

# Print formatted as a table
pd.DataFrame(news_items)

Unnamed: 0,title,description,url
0,Private funding boosts transformative research...,A private foundation has made a significant do...,https://www.helsinki.fi/en/news/brain/private-...
1,“Anyone can have sisu. It is available to ever...,The University of Helsinki has joined the inte...,https://www.helsinki.fi/en/news/university/any...
2,Childhood exposure to air pollution linked to ...,A new study shows that childhood exposure to i...,https://www.helsinki.fi/en/news/fair-society/c...
3,University of Helsinki to recruit top internat...,The University is strengthening its research i...,https://www.helsinki.fi/en/news/university/uni...
4,University of Helsinki launches top research a...,The University of Helsinki has defined four to...,https://www.helsinki.fi/en/news/university/uni...
5,Number and timing of children linked to biolog...,A study based on Finnish twins shows that repr...,https://www.helsinki.fi/en/news/public-health/...
6,Artificial intelligence can identify bird soun...,In a study conducted at the University of Hels...,https://www.helsinki.fi/en/news/artificial-int...
7,Efficient method to capture carbon dioxide fro...,The method is based on a recyclable filtration...,https://www.helsinki.fi/en/news/innovations/ef...
8,Statistical methods uncover meaningful signals...,Professor Klaus Nordhausen develops modern mul...,https://www.helsinki.fi/en/news/mathematics-an...
9,A number theorist investigates connections bet...,Anne-Maria Ernvall-Hytönen is the Professor of...,https://www.helsinki.fi/en/news/mathematics-an...
