# RSS Scraping

The purpose of this notebook is to scrape data from RSS news feeds related to the Ukraine-Russia war to use for our project

## Setup

A list of RSS feeds related to the Ukraine-Russia war can be found at https://blog.feedspot.com/ukraine_war_rss_feeds/. Scrape this page for links to the RSS feeds, and then scrape each of those

In [1]:
# From https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-python/
import requests
from bs4 import BeautifulSoup
 
master = 'https://blog.feedspot.com/ukraine_war_rss_feeds/'
reqs = requests.get(master)
soup = BeautifulSoup(reqs.text, 'html.parser')

print("RSS Links: ")

rss_links = []
# Extract the href link from the page whenever the class of the object is "exts" to just get the rss feed links. Skip empty results
for elem in soup.find_all("a", class_="ext"):
    if elem.get('class')[0] == 'ext':
        if len(elem.get('href')) > 0:
            rss_links.append(elem.get('href'))
            print(rss_links[-1])

RSS Links: 
https://blogs.prio.org/category/ukraine-war/feed/
https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/news-event/ukraine-russia/rss.xml
https://www.ejiltalk.org/category/ukraine/feed/
https://www.atlanticcouncil.org/issue/conflict/feed/
https://theconversation.com/global/topics/ukraine-invasion-2022-117045/articles.atom
https://ukukraine.blogspot.com/feeds/posts/default
https://www.uavarta.org/en/category/news-en/feed/
https://www.sandboxx.us/blog/category/mozart-group-ukraine/feed/
https://www.brookings.edu/author/steven-pifer/feed/
https://www.ft.com/war-in-ukraine?format=rss
https://www.airandspaceforces.com/category/russia-ukraine/feed/
https://www.foreignaffairs.com/feeds/tag/War%20in%20Ukraine/rss.xml
https://worldcrunch.com/feeds/focus.rss
https://euromaidanpress.com/category/russian-aggression/russian-ukrainian-war-news/feed/
https://www.politico.eu/tag/war-in-ukraine/feed/
https://www.theguardian.com/world/series/ukraine-live/rss
https://blog

In [2]:
# Append some more links found manually which also have good data on the war
for link in ['https://www.atlanticcouncil.org/issue/conflict/feed/', 'https://www.pravda.com.ua/eng/rss/view_news/']:
    rss_links.append(link)

## RSS Scraping

Scrape the RSS data by adapting https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

In [3]:
!pip install -q feedparser
!pip install -q newspaper3k

In [4]:
import feedparser
import pandas as pd
from newspaper import Article
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Farhan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
# Define some functions to parse the RSS feeds: https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

def get_RSS_fields(rss_link):
    
    d = feedparser.parse(rss_link)

    all_fields = []
    try:
        for field in d.entries[0]:
            all_fields.append(field)
        return all_fields
    except:
        return None

def df_from_RSS(rss_link, fields):
    
    d = feedparser.parse(rss_link)
    
    # Create a list of lists to hold the required RSS data from each entry
    data = []
    for i, entry in enumerate(d.entries):
        row = []
        for field in fields:
            try:
                row.append(d.entries[i][field])
            except:
                row.append("")
        data.append(row)

    # Convert the list of lists to a df
    df = pd.DataFrame(data, columns = fields)

    links = df["link"]

    article_text_dict = {}
    for link in links:
        try:
            article = Article(link)
            article.download()
            article.parse()
            article.nlp()
            article_text_dict[link] = article.text
        except:
            article_text_dict[link] = ""
    
    df['text'] = list(article_text_dict.values())

    return df

In [6]:
# Resolves SSL certificate issue (https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org)
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [7]:
# Scrape from the RSS feeds

from copy import deepcopy

dfs = []

for link in rss_links:
    fields = get_RSS_fields(link)
    if fields is not None:
        df = df_from_RSS(link, fields)
        df['rss_source_link'] = link
        dfs.append(deepcopy(df))
        print("Appended data from", link)
    else:
        print("Errors appending data from", link)

Appended data from https://blogs.prio.org/category/ukraine-war/feed/
Appended data from https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/news-event/ukraine-russia/rss.xml
Appended data from https://www.ejiltalk.org/category/ukraine/feed/
Appended data from https://www.atlanticcouncil.org/issue/conflict/feed/
Appended data from https://theconversation.com/global/topics/ukraine-invasion-2022-117045/articles.atom
Appended data from https://ukukraine.blogspot.com/feeds/posts/default
Appended data from https://www.uavarta.org/en/category/news-en/feed/
Appended data from https://www.sandboxx.us/blog/category/mozart-group-ukraine/feed/
Appended data from https://www.brookings.edu/author/steven-pifer/feed/
Appended data from https://www.ft.com/war-in-ukraine?format=rss
Appended data from https://www.airandspaceforces.com/category/russia-ukraine/feed/
Appended data from https://www.foreignaffairs.com/feeds/tag/War%20in%20Ukraine/rss.xml
Appended data from https://world

In [8]:
# Append results into a single df (https://stackoverflow.com/questions/28097222/pandas-merge-two-dataframes-with-different-columns)
master_df = deepcopy(dfs[0])

for i, df in enumerate(dfs):
    if i > 0:
        master_df = pd.concat([master_df, df], axis=0, ignore_index=True)

In [9]:
print("The shape of the entire dataframe is", master_df.shape)

The shape of the entire dataframe is (476, 33)


In [10]:
# View the df
master_df

Unnamed: 0,title,title_detail,links,link,comments,published,published_parsed,authors,author,author_detail,...,foaf_homepage,rights,rights_detail,href,gd_image,media_thumbnail,thr_total,media_content,media_credit,credit
0,"As NATO Gains New Strength, Moscow Resorts to ...","{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://blogs.prio.org/2023/03/as-nato-gains-n...,https://blogs.prio.org/2023/03/as-nato-gains-n...,"Tue, 28 Mar 2023 09:21:54 +0000","(2023, 3, 28, 9, 21, 54, 1, 87, 0)",[{'name': 'Pavel Baev'}],Pavel Baev,{'name': 'Pavel Baev'},...,,,,,,,,,,
1,Four Complications for the Rushed Putin-Xi Summit,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://blogs.prio.org/2023/03/four-complicati...,https://blogs.prio.org/2023/03/four-complicati...,"Tue, 21 Mar 2023 08:30:55 +0000","(2023, 3, 21, 8, 30, 55, 1, 80, 0)",[{'name': 'Pavel Baev'}],Pavel Baev,{'name': 'Pavel Baev'},...,,,,,,,,,,
2,Taiwan Is Feeling the Pressure from Russian an...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://blogs.prio.org/2023/03/taiwan-is-feeli...,https://blogs.prio.org/2023/03/taiwan-is-feeli...,"Mon, 20 Mar 2023 12:39:59 +0000","(2023, 3, 20, 12, 39, 59, 0, 79, 0)",[{'name': 'Pavel Baev'}],Pavel Baev,{'name': 'Pavel Baev'},...,,,,,,,,,,
3,China Adjusts Limits on Partnership With Russia,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://blogs.prio.org/2023/03/china-adjusts-l...,https://blogs.prio.org/2023/03/china-adjusts-l...,"Tue, 14 Mar 2023 13:48:34 +0000","(2023, 3, 14, 13, 48, 34, 1, 73, 0)",[{'name': 'Pavel Baev'}],Pavel Baev,{'name': 'Pavel Baev'},...,,,,,,,,,,
4,Russia-Ukraine War Compels Japan to Reassess C...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://blogs.prio.org/2023/03/russia-ukraine-...,https://blogs.prio.org/2023/03/russia-ukraine-...,"Mon, 06 Mar 2023 14:33:02 +0000","(2023, 3, 6, 14, 33, 2, 0, 65, 0)",[{'name': 'Pavel Baev'}],Pavel Baev,{'name': 'Pavel Baev'},...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
471,Mine danger on Black Sea coast increases due t...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://www.pravda.com.ua/eng/news/2023/03/28/...,,"Tue, 28 Mar 2023 17:08:47 +0300","(2023, 3, 28, 14, 8, 47, 1, 87, 0)",[{'name': 'Ukrainska Pravda'}],Ukrainska Pravda,{'name': 'Ukrainska Pravda'},...,,,,,,,,,,
472,Electrical substation blown up in occupied Mel...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://www.pravda.com.ua/eng/news/2023/03/28/...,,"Tue, 28 Mar 2023 17:08:14 +0300","(2023, 3, 28, 14, 8, 14, 1, 87, 0)",[{'name': 'Ukrainska Pravda'}],Ukrainska Pravda,{'name': 'Ukrainska Pravda'},...,,,,,,,,,,
473,Zelenskyy visits border with Russia and inspec...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://www.pravda.com.ua/eng/news/2023/03/28/...,,"Tue, 28 Mar 2023 16:44:01 +0300","(2023, 3, 28, 13, 44, 1, 1, 87, 0)",[{'name': 'Ukrainska Pravda'}],Ukrainska Pravda,{'name': 'Ukrainska Pravda'},...,,,,,,,,,,
474,Border Guard Service uncovers traitor who work...,"{'type': 'text/plain', 'language': None, 'base...","[{'rel': 'alternate', 'type': 'text/html', 'hr...",https://www.pravda.com.ua/eng/news/2023/03/28/...,,"Tue, 28 Mar 2023 16:41:47 +0300","(2023, 3, 28, 13, 41, 47, 1, 87, 0)",[{'name': 'Ukrainska Pravda'}],Ukrainska Pravda,{'name': 'Ukrainska Pravda'},...,,,,,,,,,,


In [11]:
# Export results to csv
master_df.to_csv('russia_ukraine_rss_data.csv', index=False)