<a href="https://colab.research.google.com/github/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with RSS Feeds

Prepared by Group 14

The purpose of this notebook is to show how to scrape news articles from the RSS feeds of various news sources, and perform text processing techniques. This notebook can be accessed at https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

## 1. Installation

In [1]:
# !git clone https://github.com/farhanwadia/MIE1624.git

In [2]:
# %cd MIE1624
# %cd 'Course Presentation'

In [3]:
# !ls

In [4]:
!pip install feedparser
!pip install newspaper3k

Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25ldone
[?25h  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6065 sha256=8c3d4a1175cb160a05ed1774546aec727f3105f15b9957278de672b8111cd0a8
  Stored in directory: /Users/dhairyaparmar/Library/Caches/pip/wheels/00/a1/4a/3aaa30857be3b96a4a11fccfa1336686def7d898b8be2509dd
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[

## 2. Working with RSS Feeds

### New York Times

A list of all RSS feeds from the New York Times can be accessed at https://www.nytimes.com/rss.

Let's use the World feed from https://rss.nytimes.com/services/xml/rss/nyt/World.xml as an example:

#### Form the dataframe

In [5]:
import feedparser

d = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/World.xml')

In [6]:
# Get a list of all possible fields from the RSS
all_fields = []
for field in d.entries[0]:
    all_fields.append(field)

print(all_fields)

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'summary', 'summary_detail', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'tags']


In [7]:
# Define the fields of interest that we want to obtain from the RSS
fields = ['title', 'published', 'summary', 'author', 'link']

In [8]:
import pandas as pd

# Create a list of lists to hold the required RSS data from each entry
data = []
for i, entry in enumerate(d.entries):
    row = []
    for field in fields:
        row.append(d.entries[i][field])
    data.append(row)

# Convert the list of lists to a df
df = pd.DataFrame(data, columns = fields)

In [9]:
df.head()

Unnamed: 0,title,published,summary,author,link
0,Whirring Into Action in Ukraine’s Skies,"Sat, 04 Mar 2023 16:55:04 +0000","Against the odds, Ukraine’s helicopter brigade...",Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/04/world/europ...
1,They Shared Erotic Images in a Group Chat. The...,"Sat, 04 Mar 2023 18:55:16 +0000",A couple in Singapore created a Telegram accou...,Sui-Lee Wee,https://www.nytimes.com/2023/03/04/world/asia/...
2,"In West Bank, New Armed Groups Emerge, and Dor...","Sat, 04 Mar 2023 15:33:25 +0000",The small but influential Lions’ Den network h...,Patrick Kingsley and Hiba Yazbek,https://www.nytimes.com/2023/03/04/world/middl...
3,A Massacre That Rippled Through Generations in...,"Sat, 04 Mar 2023 13:35:31 +0000",For the parents and the grandparents of the ch...,Ryn Jirenuwat and Sui-Lee Wee,https://www.nytimes.com/2023/03/04/world/asia/...
4,"The Woman Shaking Up Italian Politics (No, Not...","Sat, 04 Mar 2023 08:00:11 +0000",Daughter of Italian and Jewish American parent...,Jason Horowitz,https://www.nytimes.com/2023/03/04/world/europ...


In [10]:
print("The shape of the dataframe is", df.shape)

The shape of the dataframe is (61, 5)


#### Add full texts for the corresponding articles to the dataframe

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dhairyaparmar/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
from newspaper import Article

links = df["link"]

article_text_dict = {}
for link in links:
  article = Article(link)
  article.download()
  article.parse()
  article.nlp()
  article_text_dict[link] = article.text
  
df['text'] = list(article_text_dict.values())

In [13]:
df.head()

Unnamed: 0,title,published,summary,author,link,text
0,Whirring Into Action in Ukraine’s Skies,"Sat, 04 Mar 2023 16:55:04 +0000","Against the odds, Ukraine’s helicopter brigade...",Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/04/world/europ...,"On a snowbound field, three Soviet-era helicop..."
1,They Shared Erotic Images in a Group Chat. The...,"Sat, 04 Mar 2023 18:55:16 +0000",A couple in Singapore created a Telegram accou...,Sui-Lee Wee,https://www.nytimes.com/2023/03/04/world/asia/...,The video shows the woman in a spaghetti strap...
2,"In West Bank, New Armed Groups Emerge, and Dor...","Sat, 04 Mar 2023 15:33:25 +0000",The small but influential Lions’ Den network h...,Patrick Kingsley and Hiba Yazbek,https://www.nytimes.com/2023/03/04/world/middl...,After a violent uprising by Palestinians again...
3,A Massacre That Rippled Through Generations in...,"Sat, 04 Mar 2023 13:35:31 +0000",For the parents and the grandparents of the ch...,Ryn Jirenuwat and Sui-Lee Wee,https://www.nytimes.com/2023/03/04/world/asia/...,When a former police officer in rural Thailand...
4,"The Woman Shaking Up Italian Politics (No, Not...","Sat, 04 Mar 2023 08:00:11 +0000",Daughter of Italian and Jewish American parent...,Jason Horowitz,https://www.nytimes.com/2023/03/04/world/europ...,"ROME — Growing up in Switzerland, Elly Schlein..."


In [14]:
df.to_csv("new_york_times.csv", encoding='utf-8', index=False)

### Toronto Star

#### Function Development
Create a function to assist with the scraping process

In [15]:
def print_RSS_fields(rss_link):
    
    d = feedparser.parse(rss_link)

    all_fields = []
    for field in d.entries[0]:
        all_fields.append(field)
    print(all_fields)

def df_from_RSS(rss_link, fields):
    
    d = feedparser.parse(rss_link)
    
    # Create a list of lists to hold the required RSS data from each entry
    data = []
    for i, entry in enumerate(d.entries):
        row = []
        for field in fields:
            row.append(d.entries[i][field])
        data.append(row)

    # Convert the list of lists to a df
    df = pd.DataFrame(data, columns = fields)

    links = df["link"]

    article_text_dict = {}
    for link in links:
        article = Article(link)
        article.download()
        article.parse()
        article.nlp()
        article_text_dict[link] = article.text
    
    df['text'] = list(article_text_dict.values())

    return df

A list of RSS feeds for the Toronto Star can be found here: https://www.thestar.com/about/rssfeeds.html

Let's use the Top Stories RSS feed.

In [16]:
print_RSS_fields('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'summary', 'summary_detail', 'media_content', 'media_thumbnail', 'href', 'content', 'media_credit', 'credit']


In [17]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Toronto to ban parking on major snow routes as...,"Sat, 4 Mar 2023 10:02:00 EST",Jennifer Pagliaro - Crime Reporter,https://www.thestar.com/news/gta/2023/03/04/to...,Toronto crews are working to dig out the city ...
1,Winter storm updates: Toronto declares ‘major ...,"Sat, 4 Mar 2023 07:27:00 EST",Star staff,https://www.thestar.com/news/gta/2023/03/04/wi...,KEY FACTS Snow removal process to go into next...
2,What will the weather be like this spring? Ont...,"Fri, 3 Mar 2023 11:09:31 EST",Joshua Chong - Staff Reporter,https://www.thestar.com/news/gta/2023/03/03/wh...,As southern Ontario braces for yet another sno...
3,‘Gotta love nature.’ Toronto residents react t...,"Sat, 4 Mar 2023 07:56:00 EST",Thea Gribilas - Staff Reporter,https://www.thestar.com/news/gta/2023/03/04/go...,As Toronto was dumped with 20 to 30 cm of snow...
4,Why this town is ground zero of Canada’s risin...,"Sat, 4 Mar 2023 06:00:00 EST",May Warren - Housing Reporter,https://www.thestar.com/news/gta/2023/03/04/wh...,"Turning down a yet-to-be-paved road, local cou..."


In [18]:
df.to_csv("toronto_star.csv", encoding='utf-8', index=False)

### Le Devoir

A list of RSS feeds for Le Devoir can be found here: https://www.ledevoir.com/flux-rss

Let's use the World (le Monde) RSS feed.

In [19]:
print_RSS_fields('https://www.ledevoir.com/rss/section/monde.xml?id=76')

['surtitle', 'title', 'title_detail', 'published', 'published_parsed', 'links', 'link', 'id', 'guidislink', 'tags', 'summary', 'summary_detail', 'authors', 'author', 'author_detail']


In [20]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.ledevoir.com/rss/section/monde.xml?id=76', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Des histoires de sécheresse,"Sat, 04 Mar 2023 15:15:23 -0500",aprovost@ledevoir.com (Anne-Marie Provost),https://www.ledevoir.com/monde/afrique/784006/...,"6 Shake Guyo, 72 ans, Malabot | Les éleveurs p..."
1,Dernière ligne droite pour éviter un naufrage ...,"Sat, 04 Mar 2023 15:01:55 -0500",webmestre@ledevoir.com (Amélie Bottollier-Depois),https://www.ledevoir.com/monde/784115/-dernier...,Les États membres de l’ONU tentaient toujours ...
2,Le Canada interpellé alors que le nombre de ré...,"Sat, 04 Mar 2023 12:29:47 -0500",webmestre@ledevoir.com (Dylan Robertson),https://www.ledevoir.com/monde/784107/-le-cana...,Les Nations Unies se préparent à une nouvelle ...
3,"Combats acharnés à Bakhmout, visite du ministr...","Sat, 04 Mar 2023 10:33:26 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/europe/784100/-...,Le ministre russe de la Défense a mené une ins...
4,La solidarité de l’OTAN mise à l’épreuve,"Sat, 04 Mar 2023 04:09:41 -0500",mvastel@ledevoir.com (Marie Vastel),https://www.ledevoir.com/monde/784053/un-an-de...,Un an après que la Russie a lancé son invasion...


In [21]:
df.to_csv("le_devoir.csv", encoding='utf-8-sig', index=False)

### CBC

A list of RSS feeds for the CBCs can be found here: https://www.cbc.ca/rss/

Let's use the World RSS feed.

In [22]:
print_RSS_fields('https://rss.cbc.ca/lineup/world.xml')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'tags', 'summary', 'summary_detail']


In [23]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://rss.cbc.ca/lineup/world.xml', fields)

df.head()

ValueError: Length of values (19) does not match length of index (20)

In [None]:
df.to_csv("cbc.csv", encoding='utf-8', index=False)

## 3. Text Processing