# > **GOOGLE NEWS SCRAPER SCRIPT 🍏 🐿**

Duygu Ider, Nov. 30th, 2021 ☕ 

MSc Thesis: Cryptocurrency Price Forecasting Using BERT-Based Sentiment Analysis

*   More information about RSS feed pulling: https://www.jcchouinard.com/read-rss-feed-with-python/
*   More information about Google News RSS: https://newscatcherapi.com/blog/google-news-rss-search-parameters-the-missing-documentaiton
*   URL template for date, source and topic filtering: link = Template('https://news.google.com/rss/search?q=inurl%3Acoindesk%20OR%20inurl%3Acointelegraph+Bitcoin%20OR%20BTC+after:$early_date+before:$late_date&ceid=US:en&hl=en-US&gl=US')


# Setup

Import packages

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import string
import datetime
import time
import numpy as np
import copy
from google.colab import files, drive

Define headers with browser user agent:

In [2]:
headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 OPR/79.0.4143.72'
        }

Define link as a template string, in order to filter Google News posts by date and topic later.

Keep in mind that news articles from CoinDesk and CoinTelegraph are scraped only. All other sources are filtered out.

In [3]:
link = string.Template('https://news.google.com/rss/search?q=inurl%3Acoindesk%20OR%20inurl%3Acointelegraph+$currency%20OR%20$symbol+after:$early_date+before:$late_date&ceid=US:en&hl=en-US&gl=US')

In [4]:
#link = string.Template('https://news.google.com/rss/search?q=inurl%3Abloomberg%20OR%20inurl%3Areuters+$currency%20OR%20$symbol+after:$early_date+before:$late_date&ceid=US:en&hl=en-US&gl=US')

RSS reader class definition:

In [5]:
class ReadRss:
 
    def __init__(self, rss_url, headers):
 
        self.url = rss_url
        self.headers = headers
        try:
            self.r = requests.get(rss_url, headers=self.headers)
            self.status_code = self.r.status_code
        except Exception as e:
            print('Error fetching the URL: ', rss_url)
            print(e)
        try:    
            self.soup = BeautifulSoup(self.r.text, 'lxml')
        except Exception as e:
            print('Could not parse the xml: ', self.url)
            print(e)
        self.articles = self.soup.findAll('item')
        self.articles_dicts = [{'title':a.find('title').text,'link':a.link.next_sibling.replace('\n','').replace('\t',''),'description':a.find('description').text,'pubdate':a.find('pubdate').text} for a in self.articles]
        self.urls = [d['link'] for d in self.articles_dicts if 'link' in d]
        self.titles = [d['title'] for d in self.articles_dicts if 'title' in d]
        self.descriptions = [d['description'] for d in self.articles_dicts if 'description' in d]
        self.pub_dates = [d['pubdate'] for d in self.articles_dicts if 'pubdate' in d]

# Initial Test

Initialize URL using link template for testing:

In [None]:
URL = link.substitute(currency = "Bitcoin", symbol = "BTC", early_date = start_date, late_date = end_date)

Create RSS feed reader object of ReadRSS class:

In [None]:
feed = ReadRss(URL, headers)

Get article data as a list of dictionaries:

In [None]:
feed.articles_dicts

[{'description': '<a href="https://www.coindesk.com/markets/2021/02/12/first-mover-bullish-1-million-bitcoin-forecast-as-year-of-ox-begins/" target="_blank">First Mover: Bullish ($1 Million) Bitcoin Forecast as Year of Ox Begins</a>&nbsp;&nbsp;<font color="#6f6f6f">Coindesk</font>',
  'link': 'https://www.coindesk.com/markets/2021/02/12/first-mover-bullish-1-million-bitcoin-forecast-as-year-of-ox-begins/',
  'pubdate': 'Fri, 12 Feb 2021 08:00:00 GMT',
  'title': 'First Mover: Bullish ($1 Million) Bitcoin Forecast as Year of Ox Begins - Coindesk'},
 {'description': '<a href="https://www.coindesk.com/markets/2021/02/11/first-mover-bitcoin-at-center-stage-and-record-high-as-mastercard-bny-go-crypto/" target="_blank">First Mover: Bitcoin at Center Stage (and Record High) as Mastercard, BNY Go Crypto</a>&nbsp;&nbsp;<font color="#6f6f6f">Coindesk</font>',
  'link': 'https://www.coindesk.com/markets/2021/02/11/first-mover-bitcoin-at-center-stage-and-record-high-as-mastercard-bny-go-crypto/',


Get list of news article URLs in the feed:

In [None]:
feed.urls

['https://www.coindesk.com/podcasts/the-breakdown-with-nlw/ukraine-legalizes-bitcoin/',
 'https://cointelegraph.com/news/here-are-the-btc-price-levels-to-watch-as-38k-emerges-as-bulls-line-in-the-sand',
 'https://www.coindesk.com/business/2021/09/11/on-having-fun-staying-poor/',
 'https://www.coindesk.com/business/2021/09/11/ark-investment-management-opens-door-for-fund-to-invest-in-canadian-etfs/',
 'https://cointelegraph.com/news/stanford-researcher-led-pledge-raises-3m-for-decentralized-lending-protocol']

Show list of article titles in the feed:

In [None]:
feed.titles

['First Mover: Bullish ($1 Million) Bitcoin Forecast as Year of Ox Begins - Coindesk',
 'First Mover: Bitcoin at Center Stage (and Record High) as Mastercard, BNY Go Crypto - Coindesk',
 'Bitcoin Back Above $40K as Institutions Lead the Way - Coindesk',
 'Bitcoin hits $43K all-time high as Tesla invests $1.5 billion in BTC - Cointelegraph',
 'Crypto Long & Short: Could Scalable Payments for Bitcoin Undermine Its Value? - Coindesk',
 'Bitcoin bulls eye $50K as data show BTC’s liquid supply in steady decline - Cointelegraph',
 "Blockchain Bites: Will Bitcoin See 'Reflexive' Buys After Tesla? - Coindesk",
 "Launching in 3, 2... Here's why Bitcoin breaking $40,000 is different than last time - Cointelegraph",
 'Bitcoin Miners Earn Record Hourly Revenue of $4M - Coindesk',
 'More People HODLing Bitcoin Hurts Case for Buying, Selling With It, Says Morgan Stanley - Coindesk',
 'Blockchain Bites: Canada Approves Bitcoin ETF, Options Markets Not Pricing for $100K BTC - Coindesk',
 'Bitcoin goes

Show feed descriptions:

In [None]:
feed.descriptions

['<a href="https://www.coindesk.com/markets/2022/02/09/market-wrap-bitcoin-and-stocks-rise-signaling-greater-investor-appetite-for-risk/" target="_blank">Market Wrap: Bitcoin and Stocks Rise, Signaling Greater Investor Appetite for Risk</a>&nbsp;&nbsp;<font color="#6f6f6f">CoinDesk</font>',
 '<a href="https://www.coindesk.com/markets/2022/02/09/bitcoin-miners-offloaded-holdings-as-prices-dropped-to-33k/" target="_blank">Bitcoin Miners Offloaded Holdings as Prices Dropped to $33K</a>&nbsp;&nbsp;<font color="#6f6f6f">CoinDesk</font>',
 '<a href="https://www.coindesk.com/markets/2022/02/03/long-term-buyers-unfazed-by-bitcoins-recent-drop-to-33k/" target="_blank">Long-Term Buyers Unfazed by Bitcoin\'s Recent Drop to $33K</a>&nbsp;&nbsp;<font color="#6f6f6f">CoinDesk</font>',
 '<a href="https://www.coindesk.com/markets/2022/02/06/the-corporate-argument-for-bitcoin/" target="_blank">The Corporate Argument for Bitcoin</a>&nbsp;&nbsp;<font color="#6f6f6f">CoinDesk</font>',
 '<a href="https://w

Show article publication dates:

In [None]:
feed.pub_dates

['Wed, 09 Feb 2022 21:17:00 GMT',
 'Wed, 09 Feb 2022 10:17:00 GMT',
 'Thu, 03 Feb 2022 08:00:00 GMT',
 'Sun, 06 Feb 2022 15:50:00 GMT',
 'Tue, 08 Feb 2022 16:46:00 GMT',
 'Tue, 01 Feb 2022 08:00:00 GMT',
 'Fri, 04 Feb 2022 08:00:00 GMT',
 'Tue, 08 Feb 2022 12:25:00 GMT',
 'Mon, 07 Feb 2022 12:33:00 GMT',
 'Mon, 07 Feb 2022 11:32:00 GMT',
 'Wed, 09 Feb 2022 20:01:12 GMT',
 'Mon, 07 Feb 2022 19:46:00 GMT',
 'Wed, 09 Feb 2022 22:42:00 GMT',
 'Wed, 09 Feb 2022 11:07:00 GMT',
 'Thu, 10 Feb 2022 05:39:00 GMT',
 'Wed, 09 Feb 2022 23:44:00 GMT',
 'Wed, 09 Feb 2022 18:20:00 GMT',
 'Tue, 08 Feb 2022 16:47:07 GMT',
 'Wed, 09 Feb 2022 18:14:16 GMT',
 'Sun, 06 Feb 2022 19:40:29 GMT',
 'Wed, 09 Feb 2022 18:34:00 GMT',
 'Fri, 04 Feb 2022 08:00:00 GMT',
 'Tue, 08 Feb 2022 18:23:00 GMT',
 'Wed, 09 Feb 2022 11:39:21 GMT',
 'Thu, 03 Feb 2022 08:00:00 GMT',
 'Tue, 01 Feb 2022 08:00:00 GMT',
 'Wed, 09 Feb 2022 23:28:00 GMT',
 'Thu, 03 Feb 2022 08:00:00 GMT',
 'Mon, 07 Feb 2022 17:31:44 GMT',
 'Mon, 07 Feb 

In [None]:
pd.DataFrame(feed.articles_dicts)

Unnamed: 0,title,link,description,pubdate
0,"Bitcoin Price Continues to Fall, Breaking Thro...",https://www.coindesk.com/markets/2015/01/14/bi...,"<a href=""https://www.coindesk.com/markets/2015...","Wed, 14 Jan 2015 08:00:00 GMT"
1,State of Bitcoin 2015: Ecosystem Grows Despite...,https://www.coindesk.com/markets/2015/01/07/st...,"<a href=""https://www.coindesk.com/markets/2015...","Wed, 07 Jan 2015 08:00:00 GMT"
2,Bitcoin Price Crashes Through $250 Mark - Coin...,https://www.coindesk.com/markets/2015/01/13/bi...,"<a href=""https://www.coindesk.com/markets/2015...","Tue, 13 Jan 2015 08:00:00 GMT"
3,How Anonymous is Bitcoin? A Backgrounder for P...,https://www.coindesk.com/markets/2015/01/25/ho...,"<a href=""https://www.coindesk.com/markets/2015...","Sun, 25 Jan 2015 08:00:00 GMT"
4,'Bitcoin Box' Can Process Payments With No Web...,https://www.coindesk.com/markets/2015/01/18/bi...,"<a href=""https://www.coindesk.com/markets/2015...","Sun, 18 Jan 2015 08:00:00 GMT"
...,...,...,...,...
78,No More Burners: Using BitSMS and ChiliPhone T...,https://cointelegraph.com/news/no-more-burners...,"<a href=""https://cointelegraph.com/news/no-mor...","Thu, 08 Jan 2015 08:00:00 GMT"
79,"Dutch Anycoin Direct Announces €500k Funding, ...",https://cointelegraph.com/news/dutch-anycoin-d...,"<a href=""https://cointelegraph.com/news/dutch-...","Thu, 22 Jan 2015 08:00:00 GMT"
80,Silk Road Reloaded Ditches Tor and Accepts Mul...,https://cointelegraph.com/news/silk-road-reloa...,"<a href=""https://cointelegraph.com/news/silk-r...","Tue, 13 Jan 2015 08:00:00 GMT"
81,Charlie Shrem: ‘Mark Karpeles Wanted to Take t...,https://cointelegraph.com/news/charlie-shrem-m...,"<a href=""https://cointelegraph.com/news/charli...","Tue, 27 Jan 2015 08:00:00 GMT"


# Entire Dataset Scraping

Define all currencies of interest:

In [6]:
all_currencies = {"BTC": ["Bitcoin", "BTC"],
                  "ETH": ["Ethereum", "ETH"]}

Select date and current currency parameters:

In [13]:
start_date = "2021-11-01"
end_date = "2022-02-22"
currency = "ETH"

Create function to scrape daily news data from Google News RSS feed in bulk, for a selected currency:

In [8]:
def scrape_RSS_feed(cur_currency):
    print("Current cryptocurrency of interest is", cur_currency)

    #news_data = pd.DataFrame(columns = ['cryptocurrency', 'date', 'title', 'url'])
    news_data = pd.DataFrame(columns = ['title', 'link', 'description', 'pubdate'])
    
    start_date_obj = datetime.datetime.strptime(start_date, '%Y-%m-%d').date()
    end_date_obj = datetime.datetime.strptime(end_date, '%Y-%m-%d').date()
    
    cur_date_obj = copy.copy(start_date_obj)
    
    
    while cur_date_obj <= end_date_obj:
        cur_date = cur_date_obj.strftime('%Y-%m-%d')
        if cur_date_obj.strftime("%d")=="01":
          print("Started news data scraping for", cur_date_obj.strftime("%Y-%m"))
        
        next_date_obj = cur_date_obj + datetime.timedelta(days=1)
        next_date = next_date_obj.strftime('%Y-%m-%d')
        
        #URL = link.substitute(early_date = cur_date, late_date = next_date)
        URL = link.substitute(currency = all_currencies[currency][0],
                              symbol = all_currencies[currency][1],
                              early_date = cur_date,
                              late_date = next_date)
    
        feed = ReadRss(URL, headers)
        cur_date_data = pd.DataFrame(feed.articles_dicts)
        cur_date_data = pd.DataFrame([feed.pub_dates, feed.titles, feed.urls]).T
        cur_date_data.columns = ['date', 'title', 'url']
    
        news_data = news_data.append(cur_date_data)
        
        time.sleep(np.random.randint(2, 5))
    
        cur_date_obj += datetime.timedelta(days=1)
    
    news_data['cryptocurrency'] = cur_currency

    return news_data

Run the scraper for the current currency of interest.

In [14]:
start_time = time.time()

news_data = scrape_RSS_feed(currency)
news_data['date'] = pd.to_datetime(news_data['date']).dt.date
#news_data = news_data.sort_values(by='date')

end_time = time.time()

Current cryptocurrency of interest is ETH
Started news data scraping for 2021-11
Started news data scraping for 2021-12
Started news data scraping for 2022-01
Started news data scraping for 2022-02


In [None]:
news_data.shape

(195, 5)

In [None]:
link.substitute(currency = all_currencies[currency][0],
                              symbol = all_currencies[currency][1],
                              early_date = start_date,
                              late_date = end_date)

'https://news.google.com/rss/search?q=inurl%3Abloomberg%20OR%20inurl%3Areuters+Bitcoin%20OR%20BTC+after:2019-01-01+before:2022-02-10&ceid=US:en&hl=en-US&gl=US'

View the news data output:

In [10]:
news_data.head(3)

Unnamed: 0,title,link,description,pubdate,date,url,cryptocurrency
0,Market Wrap: Bitcoin Rangebound as Traders Exp...,,,,2021-11-01,https://www.coindesk.com/markets/2021/11/01/ma...,BTC
1,Institutional managers bought $2B worth of Bit...,,,,2021-11-01,https://cointelegraph.com/news/institutional-m...,BTC
2,NFL Star Aaron Rodgers Gives Ringing Endorseme...,,,,2021-11-01,https://www.coindesk.com/business/2021/11/01/n...,BTC


In [None]:
news_data.pubdate

0    Thu, 08 Jan 2015 08:00:00 GMT
1    Thu, 08 Jan 2015 08:00:00 GMT
2    Thu, 08 Jan 2015 08:00:00 GMT
0    Fri, 09 Jan 2015 08:00:00 GMT
0    Sat, 10 Jan 2015 08:00:00 GMT
1    Sat, 10 Jan 2015 08:00:00 GMT
2    Sat, 10 Jan 2015 08:00:00 GMT
Name: pubdate, dtype: object

In [None]:
news_data.to_csv("btc_hourly_news.csv", header=True, index=False, columns=list(news_data.columns))
files.download("btc_hourly_news.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
news_data.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/btc_news_reuters_bloomberg.csv", header=True, index=False, columns=list(news_data.axes[1]))

# Output Organization

Print out column names in output data:

In [None]:
print("Column names:", news_data.axes[1])

Column names: Index(['cryptocurrency', 'date', 'title', 'url'], dtype='object')


Runtime and execution results:

In [None]:
print("Execution runtime: %s minutes" % round((end_time - start_time)/60, 2))
print("Number of posts scraped: %s samples" % len(news_data))

Execution runtime: 142.92 minutes
Number of posts scraped: 23870 samples


Download output dataframes as .csv:

In [None]:
filename = "news_"+str.lower(currency)+"_"+str(start_date)+"_"+str(end_date)
#filename_raw = filename+"_raw.csv"
#filename_filtered = filename+"_filtered.csv"

# save as csv and download file
news_data.to_csv(filename+".csv", header=True, index=False, columns=list(news_data.axes[1]))
files.download(filename+".csv")

#posts_selected.to_csv(filename_raw, header=True, index=False, columns=list(posts_selected.axes[1]))
#files.download(filename_raw)

#posts_filtered.to_csv(filename_filtered, header=True, index=False, columns=list(posts_filtered.axes[1]))
#files.download(filename_filtered)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [11]:
drive.mount('/drive')

Mounted at /drive


In [15]:
news_data.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/news_eth_test.csv", header=True, index=False, columns=list(news_data.axes[1]))

In [None]:
news_data.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/"+filename+".csv", header=True, index=False, columns=list(news_data.axes[1]))

# Addition: Shortcut

Shortcut to scrape news data for all currencies of interest at once. Keep in mind that it makes debugging and analyzing intermediate steps more complicated. Nonetheless, a quicker and more automized option 🙊

In [None]:
# map RSS feed scraper to all cryptocurrencies
scrape_all_currencies = map(scrape_RSS_feed, list(all_currencies.keys()))

# run scraper for all
dataset_all = list(scrape_all_currencies)

# concatenate to single data frame
dataset_concat = pd.concat(dataset_all)

BTC
2021-09-11
2021-09-12
ETH
2021-09-11
2021-09-12
