# Web scraping with Python
The websites [srf.ch](https://www.srf.ch/) and [blick.ch](https://www.blick.ch/) are used for web scraping. To retrieve the current articles, the RSS feeds of both websites were utilized. From the RSS feeds, links to the articles are extracted, followed by downloading the contents of the articles. For this purpose, the Python libraries BeautifulSoup, Feedparser, and Requests were employed. Using the Python libraries BeautifulSoup, Feedparser, and Requests, the contents of the websites were extracted.

This notebook documents the code and process for web scraping. The notebook aims to help understand the procedure. Within the notebook, the contents of the websites are extracted, structured, stored in DataFrames, and then saved in CSV files. Using the notebook, a script **main.py** was subsequently created, which automatically downloads the contents of the websites.

In [1]:
import feedparser
import pandas as pd
import requests
from bs4 import BeautifulSoup
import datetime as dt

## Experiments with [SRF.ch](https://www.srf.ch/)
The website of SRF is structured such that the article text is contained within the `<p>` elements. However, these elements also display general information such as newsletters, etc. It was observed that the article text consistently resides within a `<p>` element belonging to the class "class_='article-content'". Therefore, the search initially focuses on this specific element to extract the text. This ensures that only the article text is extracted, excluding any other information.


In [2]:
rss_url='https://www.srf.ch/news/bnf/rss/1646'
#Read feed xml data
news_feed = feedparser.parse(rss_url)
#Flatten data
df_news_feed=pd.json_normalize(news_feed.entries)

In [20]:
df_news_feed.head(1)

Unnamed: 0,title,summary,links,link,id,guidislink,authors,author,published,published_parsed,title_detail.type,title_detail.language,title_detail.base,title_detail.value,summary_detail.type,summary_detail.language,summary_detail.base,summary_detail.value,author_detail.name
0,Einschätzung von US-Behörden: 2023 wärmstes Ja...,"<img alt=""2023 wärmstes Jahr seit Aufzeichnung...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://www.blick.ch/news/einschaetzung-von-us...,https://www.blick.ch/news/einschaetzung-von-us...,False,[{'name': 'SDA Import'}],SDA Import,"Thu, 16 Nov 2023 09:03:11 GMT","(2023, 11, 16, 9, 3, 11, 3, 320, 0)",text/plain,,https://www.blick.ch/news/rss.xml,Einschätzung von US-Behörden: 2023 wärmstes Ja...,text/html,,https://www.blick.ch/news/rss.xml,"<img alt=""2023 wärmstes Jahr seit Aufzeichnung...",SDA Import


In [3]:
data_list = []
for number in range(len(df_news_feed)):
    titel = df_news_feed['title'][number]
    link = df_news_feed['link'][number]
    published = df_news_feed['published'][number]
    published_parsed = df_news_feed['published_parsed'][number]
    
    # Only take articles from today
    if published_parsed[2] != dt.date.today().day or published_parsed[1] != dt.date.today().month:
        continue
    
    #Request the article url to get the web page content
    article = requests.get(link)

    # extract all paragraph elements inside the page body
    articles = BeautifulSoup(article.content, 'html.parser')
    article_content = articles.find('section', class_='article-content', itemprop='articleBody')
    p_blocks = article_content.find_all('p')

    long_blocks = []
    for block in p_blocks:
        if len(block.text) > 10:
            long_blocks.append(block.text)
    long_blocks_str = ' '.join(long_blocks)
    long_blocks_str = long_blocks_str.replace('\n', ' ')

    data_list.append({
        'title': titel,
        'link': link,
        'published': published,
        'published_parsed': published_parsed,
        'article_content': long_blocks_str
    })
    
# Convert the list of dictionaries to a DataFrame
df_srf = pd.DataFrame(data_list)

    

In [4]:
df_srf.head(5)

Unnamed: 0,title,link,published,published_parsed,article_content
0,«Netanjahus Verhalten ist weder mutig noch auf...,https://www.srf.ch/news/international/gewaltes...,"Thu, 16 Nov 2023 16:57:32 +0100","(2023, 11, 16, 15, 57, 32, 3, 320, 0)","Der bekannte israelische Historiker, Tom Segev..."
1,Sturm «Frederico» zieht über die Schweiz,https://www.srf.ch/news/schweiz/nacht-auf-frei...,"Thu, 16 Nov 2023 16:15:33 +0100","(2023, 11, 16, 15, 15, 33, 3, 320, 0)",In der Nacht auf Freitag wird die Schweiz mit ...
2,Geheimes Treffen der Reichsbürger in Basel,https://www.srf.ch/news/schweiz/umstrittene-st...,"Thu, 16 Nov 2023 14:25:05 +0100","(2023, 11, 16, 13, 25, 5, 3, 320, 0)","Ein stürmischer Sonntag, Anfang November. In e..."
3,Rede in Lausanne: Macron verteidigt seine Posi...,https://www.srf.ch/news/schweiz/franzoesischer...,"Thu, 16 Nov 2023 15:49:13 +0100","(2023, 11, 16, 14, 49, 13, 3, 320, 0)","Rund 1400 Menschen, hauptsächlich Studierende,..."
4,Der Papst befürwortet ein Schweizer Kirchenger...,https://www.srf.ch/kultur/gesellschaft-religio...,"Thu, 16 Nov 2023 16:49:59 +0100","(2023, 11, 16, 15, 49, 59, 3, 320, 0)",Der Papst persönlich hat am Mittwoch die beide...


In [10]:
len(df_srf)

39

## Experiments with [Blick.ch](https://www.blick.ch/)
Similar to the SRF website, the texts on Blick.ch are also contained within `<p>` elements. However, the structure is not identical; there isn't a specific class that marks the `<p>` elements containing the article content. Nevertheless, it was observed that on Blick.ch, only the article texts are stored in `<p>` elements, without general information. As a result, the code had to be adjusted to extract all `<p>` elements and then concatenate them into a single string.

Furthermore, it was noticed that the "News" RSS feed from Blick.ch covers relatively few articles. In contrast, under "News" on SRF, articles for current events, Switzerland, international news, economy, and panorama are included. On Blick.ch, the "News" section only encompasses articles related to current events. Therefore, it was decided to use not only the RSS feed for "News" but also those for "Switzerland," "International," "Economy," and "Politics" from Blick.ch.

In [5]:
blick_rss_url =['https://www.blick.ch/news/rss.xml', 'https://www.blick.ch/schweiz/rss.xml',
                'https://www.blick.ch/ausland/rss.xml', 'https://www.blick.ch/wirtschaft/rss.xml',
                'https://www.blick.ch/politik/rss.xml']

In [6]:
data_list = []

for rss_url in blick_rss_url:
    news_feed = feedparser.parse(rss_url)
    #Flatten data
    df_news_feed=pd.json_normalize(news_feed.entries)
    
    for number in range(len(df_news_feed)):
        titel = df_news_feed['title'][number]
        link = df_news_feed['link'][number]
        published = df_news_feed['published'][number]
        published_parsed = df_news_feed['published_parsed'][number]
        
        # Only take articles from today
        if published_parsed[2] != dt.date.today().day or published_parsed[1] != dt.date.today().month:
            continue
        
        #Request the article url to get the web page content
        article = requests.get(link)

        # extract all paragraph elements inside the page body
        articles = BeautifulSoup(article.content, 'html.parser')
        article_content = articles.findAll('body')
        p_blocks = article_content[0].findAll('p')

        long_blocks = []
        for block in p_blocks:
            if len(block.text) > 10:
                long_blocks.append(block.text)
        long_blocks_str = ' '.join(long_blocks)
        long_blocks_str = long_blocks_str.replace('\n', ' ')

        data_list.append({
            'title': titel,
            'link': link,
            'published': published,
            'published_parsed': published_parsed,
            'article_content': long_blocks_str
        })
        
    # Convert the list of dictionaries to a DataFrame
df_blick = pd.DataFrame(data_list)

    

In [8]:
df_blick.head(5)

Unnamed: 0,title,link,published,published_parsed,article_content
0,Stadt will weiterhin ein Züri Fäscht: «Wann es...,https://www.blick.ch/news/stadt-will-weiterhin...,"Thu, 16 Nov 2023 14:37:43 GMT","(2023, 11, 16, 14, 37, 43, 3, 320, 0)",
1,Trägerin des Grand Prix Literatur: Autorin Ann...,https://www.blick.ch/news/traegerin-des-grand-...,"Thu, 16 Nov 2023 13:54:05 GMT","(2023, 11, 16, 13, 54, 5, 3, 320, 0)",Beim Schreiben ist Anna Felder stets sehr bedä...
2,Einschätzung von US-Behörden: 2023 wärmstes Ja...,https://www.blick.ch/news/einschaetzung-von-us...,"Thu, 16 Nov 2023 09:03:11 GMT","(2023, 11, 16, 9, 3, 11, 3, 320, 0)",Der Oktober sei weltweit im Durchschnitt der w...
3,Verein hat Vertrag mit der Stadt gekündigt – j...,https://www.blick.ch/news/volksfest-hammer-zue...,"Thu, 16 Nov 2023 08:47:47 GMT","(2023, 11, 16, 8, 47, 47, 3, 320, 0)",Das Züri Fäscht 2023 ist das letzte in dieser ...
4,Sie sollen Ardit N. in Geuensee LU getötet hab...,https://www.blick.ch/news/im-monster-prozess-u...,"Thu, 16 Nov 2023 06:50:00 GMT","(2023, 11, 16, 6, 50, 0, 3, 320, 0)",«Ardit N. ist durch grobes eigenes Verschulden...


In [9]:
len(df_blick)

64

## save Dataframe as .csv

In [17]:
# Get current year_month_day_hour
timestep = dt.datetime.now().strftime('%y_%m_%d_%H')

# path
path = './articles/'

# Save df_srf to csv with filename srf_month_day.csv
df_srf.to_csv(f'{path}{timestep}_srf.csv', index=False)
df_blick.to_csv(f'{path}{timestep}_blick.csv', index=False)
