<a href="https://colab.research.google.com/github/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with RSS Feeds

Prepared by Group 14

The purpose of this notebook is to show how to scrape news articles from the RSS feeds of various news sources, and perform text processing techniques. This notebook can be accessed at https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

## 1. Installation

In [1]:
# !git clone https://github.com/farhanwadia/MIE1624.git

In [2]:
# %cd MIE1624
# %cd 'Course Presentation'

In [3]:
# !ls

In [4]:
!pip install feedparser
!pip install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6066 sha256=de3a62d687d634653f6776905047bfd13ee24610529d07bc6d42859d0ca34ea1
  Stored in directory: /root/.cache/pip/wheels/83/63/2f/117884c3b19d46b64d3d61690333aa80c88dc14050e269c546
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0
Looking in indexes: https://pypi.org/simple, https://us-python

## 2. Working with RSS Feeds

### New York Times

A list of all RSS feeds from the New York Times can be accessed at https://www.nytimes.com/rss.

Let's use the World feed from https://rss.nytimes.com/services/xml/rss/nyt/World.xml as an example:

#### Form the dataframe

In [5]:
import feedparser

d = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/World.xml')

In [6]:
# Get a list of all possible fields from the RSS
all_fields = []
for field in d.entries[0]:
    all_fields.append(field)

print(all_fields)

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'summary', 'summary_detail', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'media_content', 'media_credit', 'credit']


In [7]:
# Define the fields of interest that we want to obtain from the RSS
fields = ['title', 'published', 'summary', 'author', 'link']

In [8]:
import pandas as pd

# Create a list of lists to hold the required RSS data from each entry
data = []
for i, entry in enumerate(d.entries):
    row = []
    for field in fields:
        row.append(d.entries[i][field])
    data.append(row)

# Convert the list of lists to a df
df = pd.DataFrame(data, columns = fields)

In [9]:
df.head()

Unnamed: 0,title,published,summary,author,link
0,Belarus Sentences Nobel Peace Laureate to 10 Y...,"Fri, 03 Mar 2023 20:47:53 +0000",Ales Bialiatski was awarded the Nobel Peace Pr...,The New York Times,https://www.nytimes.com/live/2023/03/03/world/...
1,"The E.U. Offered to Embrace Ukraine, but Now W...","Fri, 03 Mar 2023 10:03:08 +0000",The European Union and NATO have promised a pa...,Steven Erlanger,https://www.nytimes.com/2023/03/03/world/europ...
2,Rules to Curb Illicit Dollar Flows Create Unin...,"Fri, 03 Mar 2023 19:40:10 +0000",The regulations were meant to prevent dollar t...,Alissa J. Rubin,https://www.nytimes.com/2023/03/03/world/middl...
3,They Sneaked Into a Derelict Arms Plant: Insta...,"Fri, 03 Mar 2023 18:08:49 +0000","Three people, including two Russians, arrested...",Andrew Higgins,https://www.nytimes.com/2023/03/03/world/europ...
4,Search of Train Crash Site in Greece Nears an End,"Fri, 03 Mar 2023 17:46:57 +0000",The authorities were planning to start clearin...,Niki Kitsantonis,https://www.nytimes.com/2023/03/03/world/europ...


In [10]:
print("The shape of the dataframe is", df.shape)

The shape of the dataframe is (59, 5)


#### Add full texts for the corresponding articles to the dataframe

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
from newspaper import Article

links = df["link"]

article_text_dict = {}
for link in links:
  article = Article(link)
  article.download()
  article.parse()
  article.nlp()
  article_text_dict[link] = article.text
  
df['text'] = list(article_text_dict.values())

In [13]:
df.head()

Unnamed: 0,title,published,summary,author,link,text
0,Belarus Sentences Nobel Peace Laureate to 10 Y...,"Fri, 03 Mar 2023 20:47:53 +0000",Ales Bialiatski was awarded the Nobel Peace Pr...,The New York Times,https://www.nytimes.com/live/2023/03/03/world/...,Ales Bialiatski in the defendants’ cage in a M...
1,"The E.U. Offered to Embrace Ukraine, but Now W...","Fri, 03 Mar 2023 10:03:08 +0000",The European Union and NATO have promised a pa...,Steven Erlanger,https://www.nytimes.com/2023/03/03/world/europ...,BRUSSELS — When the European Union offered Ukr...
2,Rules to Curb Illicit Dollar Flows Create Unin...,"Fri, 03 Mar 2023 19:40:10 +0000",The regulations were meant to prevent dollar t...,Alissa J. Rubin,https://www.nytimes.com/2023/03/03/world/middl...,"Separately, a sum in cash is sent to the Iraqi..."
3,They Sneaked Into a Derelict Arms Plant: Insta...,"Fri, 03 Mar 2023 18:08:49 +0000","Three people, including two Russians, arrested...",Andrew Higgins,https://www.nytimes.com/2023/03/03/world/europ...,"Spiro Lasi, a construction worker whose house ..."
4,Search of Train Crash Site in Greece Nears an End,"Fri, 03 Mar 2023 17:46:57 +0000",The authorities were planning to start clearin...,Niki Kitsantonis,https://www.nytimes.com/2023/03/03/world/europ...,The crash left rail cars strewn about the trac...


In [14]:
df.to_csv("new_york_times.csv", encoding='utf-8', index=False)

### Toronto Star

#### Function Development
Create a function to assist with the scraping process

In [15]:
def print_RSS_fields(rss_link):
    
    d = feedparser.parse(rss_link)

    all_fields = []
    for field in d.entries[0]:
        all_fields.append(field)
    print(all_fields)

def df_from_RSS(rss_link, fields):
    
    d = feedparser.parse(rss_link)
    
    # Create a list of lists to hold the required RSS data from each entry
    data = []
    for i, entry in enumerate(d.entries):
        row = []
        for field in fields:
            row.append(d.entries[i][field])
        data.append(row)

    # Convert the list of lists to a df
    df = pd.DataFrame(data, columns = fields)

    links = df["link"]

    article_text_dict = {}
    for link in links:
        article = Article(link)
        article.download()
        article.parse()
        article.nlp()
        article_text_dict[link] = article.text
    
    df['text'] = list(article_text_dict.values())

    return df

A list of RSS feeds for the Toronto Star can be found here: https://www.thestar.com/about/rssfeeds.html

Let's use the Top Stories RSS feed.

In [16]:
print_RSS_fields('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'summary', 'summary_detail', 'media_content', 'media_thumbnail', 'href', 'content', 'media_credit', 'credit']


In [17]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Canada’s biggest banks set aside $2.5 billion ...,"Fri, 3 Mar 2023 05:00:00 EST",Christine Dobby - Business Reporter,https://www.thestar.com/business/2023/03/03/ca...,Canada’s largest banks have set aside almost $...
1,"One year after the housing peak, a record drop...","Fri, 3 Mar 2023 05:00:00 EST",Tess Kalinowski - Real Estate Reporter,https://www.thestar.com/news/gta/2023/03/03/on...,Exactly one year after the real estate market ...
2,‘We were not equipped to handle a mass shootin...,"Fri, 3 Mar 2023 05:00:00 EST",Noor Javed - Staff Reporter,https://www.thestar.com/news/gta/2023/03/03/do...,Jack Rozdilsky vividly remembers the night one...
3,Canadians should be cautious about claims of f...,"Thu, 2 Mar 2023 10:50:00 EST",Stephanie Levitz - Ottawa Bureau,https://www.thestar.com/politics/federal/2023/...,OTTAWA — No criminal charges laid. No diplomat...
4,Snow predicted to hit Toronto around 5 or 6 p....,"Fri, 3 Mar 2023 07:50:00 EST","Star staff,wire services",https://www.thestar.com/news/gta/2023/03/03/to...,Toronto can expect to see a dumping of snow Fr...


In [18]:
df.to_csv("toronto_star.csv", encoding='utf-8', index=False)

### Le Devoir

A list of RSS feeds for Le Devoir can be found here: https://www.ledevoir.com/flux-rss

Let's use the World (le Monde) RSS feed.

In [19]:
print_RSS_fields('https://www.ledevoir.com/rss/section/monde.xml?id=76')

['surtitle', 'title', 'title_detail', 'published', 'published_parsed', 'links', 'link', 'id', 'guidislink', 'tags', 'summary', 'summary_detail', 'authors', 'author', 'author_detail']


In [26]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.ledevoir.com/rss/section/monde.xml?id=76', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,"Au congrès des conservateurs américains, Nikki...","Fri, 03 Mar 2023 15:10:07 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/etats-unis/7840...,"Sans jamais le nommer directement, la candidat..."
1,"En Grèce, la colère engendrée par la catastrop...","Fri, 03 Mar 2023 14:35:58 -0500",webmestre@ledevoir.com (Vassilis Kyriakoulis),https://www.ledevoir.com/monde/europe/783964/-...,"« Assassins », « crime » : des milliers de per..."
2,"Au Liban, l’inquiétant effondrement des instit...","Fri, 03 Mar 2023 12:28:29 -0500",webmestre@ledevoir.com (Acil Tabbara),https://www.ledevoir.com/monde/moyen-orient/78...,Le chef de l’un des principaux organes de sécu...
3,"Bakhmout «pratiquement encerclée», selon le gr...","Fri, 03 Mar 2023 08:03:34 -0500",webmestre@ledevoir.com (Daria Andriievska),https://www.ledevoir.com/monde/europe/783954/-...,"Le groupe paramilitaire russe Wagner, dont les..."
4,"Ales Bialiatski, colauréat du Nobel de la paix...","Fri, 03 Mar 2023 07:15:43 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/europe/783952/-...,Un tribunal de Minsk a condamné vendredi à 10 ...


In [21]:
df.to_csv("le_devoir.csv", encoding='utf-8-sig', index=False)

### CBC

A list of RSS feeds for the CBCs can be found here: https://www.cbc.ca/rss/

Let's use the World RSS feed.

In [22]:
print_RSS_fields('https://rss.cbc.ca/lineup/world.xml')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'tags', 'summary', 'summary_detail']


In [23]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://rss.cbc.ca/lineup/world.xml', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Nobel Peace Prize winner Ales Bialiatski sente...,"Fri, 3 Mar 2023 06:41:51 EST",The Associated Press,https://www.cbc.ca/news/world/belarus-bialiats...,The UN Human Rights Office said it was 'very d...
1,South Carolina lawyer Alex Murdaugh sentenced ...,"Fri, 3 Mar 2023 10:25:01 EST",The Associated Press,https://www.cbc.ca/news/world/us-sc-murdaugh-s...,Disgraced South Carolina lawyer Alex Murdaugh ...
2,King Charles to try to thaw post-Brexit relati...,"Fri, 3 Mar 2023 09:45:35 EST",The Associated Press,https://www.cbc.ca/news/world/king-charles-fir...,King Charles will travel to France and Germany...
3,Mexican man who died on U.S. border struggled ...,"Fri, 10 Feb 2023 13:58:40 EST",,https://www.cbc.ca/news/canada/montreal/mexica...,The Mexican man who died Feb. 19 shortly after...
4,"'Stop the bleeding,' Philippines health offici...","Fri, 3 Mar 2023 04:00:00 EST",Karen Pauls,https://www.cbc.ca/news/canada/manitoba/philip...,Rhea Patulay saw the shortage of Filipino nurs...


In [24]:
df.to_csv("cbc.csv", encoding='utf-8', index=False)

## 3. Text Processing