#### Author
Yury Kashnitsky

#### Reference
[Notion ticket](https://www.notion.so/a74951e4e815480584dea7d61ddce6cc?v=dbfdb1207d0e451b827d3c5041ed0cfd&p=d5d0948bde5f43c7b77c5e9329d52980)

#### Idea
Parse RSS feeds from the list `../data/crypto_rss_feeds.txt` and perform an EDA to come up with filtering rules.

#### Results
Came up with filtering on publication timestamp, title length, excluded records without a verb, also non-English records, etc. 

~1200 titles written to `../data/20220530_rss_feed.csv`

## Scraping RSS feeds

In [1]:
import re
import feedparser
from tqdm import tqdm
import pandas as pd
from mmh3 import hash as mmh3_hash
from dateutil import parser as dateutil_parser
from delorean import parse as delorean_date_parse
import datetime
import pytz
import langid
import langdetect

In [2]:
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

# https://stackoverflow.com/questions/66712753/how-to-use-languagedetector-from-spacy-langdetect-package

def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)

<spacy_langdetect.spacy_langdetect.LanguageDetector at 0x149115c60>

In [3]:
with open('../data/crypto_rss_feeds.txt') as f:
    urls = [el.strip() for el in f.readlines()]
    
feeds = [feedparser.parse(url) for url in tqdm(urls)]

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [01:32<00:00,  1.10s/it]


In [4]:
def parse_feeds(feeds) -> pd.DataFrame:
    ids, parsed_titles, sources, pub_times = [], [], [], []
    for feed in feeds:
        for title_metadata in feed['entries']:
            parsed_titles.append(title_metadata.title)
            ids.append(mmh3_hash(title_metadata.title, seed=17))
            if "summary_detail" in title_metadata.keys():
                sources.append(title_metadata.summary_detail["base"])
            else:
                sources.append(title_metadata.title_detail["base"])
            if "published" in title_metadata.keys():
                pub_times.append(title_metadata.published)
            else:
                pub_times.append(None)
    df = pd.DataFrame(
        {
            "title_id": ids,
            "title": parsed_titles,
            "source": sources,
            "pub_time": pub_times,
        }
    ).drop_duplicates(
        subset="title_id"
    )  # TODO resolve duplications better

    df.set_index("title_id", inplace=True)
    
    return df

In [5]:
df = parse_feeds(feeds).dropna()

In [6]:
len(df)

2317

In [7]:
df.head()

Unnamed: 0_level_0,title,source,pub_time
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1385300947,BITMAIN Invited Key Customers to Visit North A...,https://blog.bitmain.com/en/feed/?x=1,"Mon, 25 Apr 2022 13:47:46 +0000"
750874296,BITMAIN Successfully Held the 2022 Customer Ov...,https://blog.bitmain.com/en/feed/?x=1,"Tue, 15 Mar 2022 11:30:04 +0000"
1442203391,New Batches of ANTMINER Z15 Officially Go on S...,https://blog.bitmain.com/en/feed/?x=1,"Mon, 14 Feb 2022 08:01:04 +0000"
158395930,ANTMINER 2021 Customer Satisfaction Survey,https://blog.bitmain.com/en/feed/?x=1,"Fri, 28 Jan 2022 11:57:20 +0000"
-1332721541,Customer Compensation Notice for L7 and other ...,https://blog.bitmain.com/en/feed/?x=1,"Wed, 12 Jan 2022 13:58:28 +0000"


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2317 entries, 1385300947 to -1415782506
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     2317 non-null   object
 1   source    2317 non-null   object
 2   pub_time  2317 non-null   object
dtypes: object(3)
memory usage: 72.4+ KB


Due to mixed timezones/offsets we have to come up with a custom date parsing method

In [9]:
df['pub_time'].apply(lambda s: s[-5:]).value_counts()

+0000    1523
0 EST     619
0 GMT      36
+0100      30
1 GMT      19
7 GMT      18
2 GMT      16
6 GMT      13
9 GMT      11
4 GMT       9
3 GMT       9
8 GMT       8
5 GMT       6
Name: pub_time, dtype: int64

In [10]:
def parse_time(ts: str, named_timezones = ('EST', 'GMT', 'UTC')):
    
    tzinfos = {tz: pytz.timezone(tz) for tz in named_timezones}
    if ts[-3:] in named_timezones:
        return dateutil_parser.parse(ts, tzinfos=tzinfos)
    
    # if instead an offset is specified like +0100, we use delorean parser
    elif re.match(pattern="[\+\-]\d{4}", string=ts[-5:]):
        return delorean_date_parse(ts).datetime
    
    # otherwise we failed to parse
    else:
        return pd.NaT

In [11]:
df['pub_time'] = df['pub_time'].apply(lambda ts: parse_time(ts=ts))

In [12]:
df['pub_time'].min(), df['pub_time'].max()

(datetime.datetime(2017, 3, 27, 22, 34, 1, tzinfo=<StaticTzInfo 'GMT'>),
 datetime.datetime(2022, 5, 30, 6, 59, 37, tzinfo=<UTC>))

In [13]:
df.sort_values(by='pub_time', ascending=False).head(50)

Unnamed: 0_level_0,title,source,pub_time
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
99246378,Why Did Francophone Bitcoiners Met With Centra...,http://bitcoinist.com/feed/,2022-05-30 06:59:37+00:00
209551122,Nigerian Presidential Hopeful Says He Supports...,https://news.bitcoin.com/feed/,2022-05-30 06:30:01+00:00
-564316595,TRON TVL Soars to $6 Billion as USDD Algorithi...,https://cryptopotato.com/feed,2022-05-30 06:14:24+00:00
-248330369,Staking On Pancakeswap – Earn Up To 122% APY |...,http://feeds.feedburner.com/nigeriabitcoincomm...,2022-05-30 06:00:00+00:00
154347339,Tim Draper: Women will drive the next Bitcoin ...,https://cointelegraph.com/feed,2022-05-30 06:50:09+01:00
-1722184652,10 Top Metaverse Cryptos Today,https://insidebitcoins.com/feed,2022-05-30 05:48:52+00:00
-1063029047,"Nifty News: ‘Blue-chips’ halve in value, free-...",https://cointelegraph.com/feed,2022-05-30 06:39:30+01:00
1391086857,10 Top Defi Coins Today,https://insidebitcoins.com/feed,2022-05-30 05:37:57+00:00
-189404569,"TA: Ethereum Reaches Crucial Juncture, Can The...",https://www.helenabitcoinmining.com/feed/,2022-05-30 05:08:42+00:00
600910363,Investors dumping on Terra as LUNA 2 tanks 70%...,https://cointelegraph.com/feed,2022-05-30 06:06:25+01:00


##  Basic EDA and Filtering

### Selecting only news with a verb

In [14]:
def tag_verb_spacy(sent):
    # https://ashutoshtripathi.com/2020/04/13/parts-of-speech-tagging-and-dependency-parsing-using-spacy-nlp/
    doc = nlp(sent)
    
    verb_found = False
    for t in doc:
        if t.pos_ == 'VERB':
            verb_found = True
            break
            
    return verb_found

In [15]:
df['has_verb'] = df['title'].apply(lambda x: tag_verb_spacy(x))

In [16]:
df[df['has_verb']].head()

Unnamed: 0_level_0,title,source,pub_time,has_verb
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1385300947,BITMAIN Invited Key Customers to Visit North A...,https://blog.bitmain.com/en/feed/?x=1,2022-04-25 13:47:46+00:00,True
750874296,BITMAIN Successfully Held the 2022 Customer Ov...,https://blog.bitmain.com/en/feed/?x=1,2022-03-15 11:30:04+00:00,True
1442203391,New Batches of ANTMINER Z15 Officially Go on S...,https://blog.bitmain.com/en/feed/?x=1,2022-02-14 08:01:04+00:00,True
-838502866,The World Digital Mining Summit 2021 Gathers T...,https://blog.bitmain.com/en/feed/?x=1,2021-10-29 15:30:08+00:00,True
104331920,Bitmain Will Hold the World Digital Mining Sum...,https://blog.bitmain.com/en/feed/?x=1,2021-09-27 15:24:19+00:00,True


In [17]:
df[~df['has_verb']].head()

Unnamed: 0_level_0,title,source,pub_time,has_verb
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
158395930,ANTMINER 2021 Customer Satisfaction Survey,https://blog.bitmain.com/en/feed/?x=1,2022-01-28 11:57:20+00:00,False
-1332721541,Customer Compensation Notice for L7 and other ...,https://blog.bitmain.com/en/feed/?x=1,2022-01-12 13:58:28+00:00,False
-2027763596,Siam Commercial Bank Enters The DeFi Space Via...,https://www.dcforecasts.com/feed/,2022-05-29 18:21:13+00:00,False
1666604457,"Cardano Slips Below $0.50, More Danger Ahead: ...",https://www.dcforecasts.com/feed/,2022-05-29 18:16:06+00:00,False
-1678680154,Bit.Com Exchange Rolls Out New Toncoin Savings...,https://www.dcforecasts.com/feed/,2022-05-29 16:17:03+00:00,False


Examples having at least one verb make more sense as "news" - we'll further work only with those.

In [18]:
df['has_verb'].sum()

1343

In [19]:
df = df[df['has_verb']]

## Ignore questions

In [20]:
def has_question_mark(string):
    return "?" in string

In [21]:
df['has_quest_mark'] = df['title'].apply(has_question_mark)

In [22]:
df['has_quest_mark'].value_counts()

False    1256
True       87
Name: has_quest_mark, dtype: int64

In [23]:
df.loc[df['has_quest_mark'], 'title'].head(10).tolist()

['Ethereum Slipped Again, What’s The Next Trading Level?',
 'What Do We Know About India’s CBDC And Its Graded Approach?',
 'What Is ApeCoin (APE) And How Does It Work?',
 'Do GM Wagmi (GM), Parody Coin (PARO) and Vita Inu (VINU) Have What it Takes to Become a Record Breaking Meme Coin?',
 'How can I protect myself against Bitcoin scams?',
 'What are people using bitcoin for?',
 'How do I get started with mining?',
 'What are altcoins and why should I care?',
 'With SpaceX Poised To Accept Dogecoin Payments For Merch Soon, Will DOGE Get Its Glory Back?',
 'DeFi Version of Telegram is Launching Market, What can Defigram Bring us?']

In [24]:
df = df[~df['has_quest_mark']]

## Non-eng languages

In [25]:
def text_in_english_spacy(text: str):
    doc = nlp(text, disable=['ner', 'tok2vec'])
    lang, confidence = doc._.language['language'], doc._.language['score']
    return (lang == 'en') and (confidence >= 0.8)

In [26]:
df['lang_eng'] = df['title'].apply(text_in_english_spacy)

In [27]:
df['lang_eng'].value_counts()

True     1213
False      43
Name: lang_eng, dtype: int64

There are many false positives in detecting non-english text

In [28]:
df.loc[~df['lang_eng'], 'title'].tolist()

['Bitcoin Latinum Backers Get Boost',
 'Flyp.me implements Monero’s sub-addresses',
 'Long Term Update: Weekly average broken down!',
 'Volume on BitMEX Spot Exchange surpasses $10M in 24 hours since launch',
 'What Warren Buffett Gets Right About Bitcoin',
 'Hashing It Out #91-Paradigm Dan Robinson',
 'Hashing It Out #89-Optimism Karl Floersch',
 'Hashing It Out #86- Ethereum Founder Vitalik Buterin',
 'Hashing It Out #83-Panvala Niran Babalola',
 'Hashing It Out #74 - SKALE Labs- Stan Kladko',
 'Hashing It Out-#69 - OpenLIbra - Lucas Geiger and Zaki Manian',
 'Hashing It Out #62 - Bolt Labs - Ayo Akinyele',
 'Hashing It Out #59 - Truebit - Jason Teutsch',
 'Hashing It Out #54 - Ethereum Geth 1.9.0 - Péter Szilágyi & Martin Swende',
 'Hashing It Out #50 - AVA Labs - Emin Gün Sirer',
 'Hashing It Out #48 - MakerDAO pt. 2 - Nik Kunkel',
 'Hashing It Out #47 - MakerDAO pt. 1 - Nik Kunkel',
 'Hashing It Out #46 - Ren - Loong Wang',
 "Hashing It Out #31: Pokt Network - Michael O'Rourke &  

In [29]:
df.loc[df['lang_eng'], 'title'].sample(30).tolist()

['Hashing It Out #85- Gauntlet.Network Tarun Chitra',
 'Hashing It Out #64 - Plasma - Kelvin Fichter',
 'Creating a Humanist Blockchain Future #40',
 'Terra’s UST Burn Proposal Gets the Green Light',
 'SneakMart Announces A Collaboration With StockX On It’s First Metakicks Drop',
 'Purse now supports Amazon Germany',
 'Why crypto bookmakers are ideal for in play betting',
 'Cannabis Stock News: Nextleaf (CSE: OILS) Completes First Shipment to Ontario and Provides Update on British Columbia Market',
 'gm Podcast, Episode 9: Jesse Powell Releases the Kraken',
 'Water Stock News: Eco Wave Power (Nasdaq: $WAVE) Signs Agreement to Construct 2MW Wave Energy Power Station in Spain',
 'Why meme-stock speculators ditching Bitcoin is good for crypto',
 'Hashing It Out #33: Loom Network / Plasma Cash - Georgios Konstantopoulos',
 'Introducing the Coinbase Institute: Advancing the policy debate around crypto and the future of…',
 'Parity Technologies receives 5 million dollars from Ethereum Founda

In [30]:
df = df[df['lang_eng']]

## Sources

In [31]:
df['source'].nunique()

78

In [32]:
df['source'].value_counts().head()

https://www.investorideas.com/rss/feeds/iimain.xml    455
https://feeds.simplecast.com/xCQr3ykc                 173
https://decrypt.co/feed                                34
https://bitcoinmagazine.com/.rss/full/                 27
https://www.cryptoninjas.net/feed/                     22
Name: source, dtype: int64

## Exact dupes

No exact dupes

In [33]:
df['title'].value_counts().head()

BITMAIN Invited Key Customers to Visit North America’s First Fully Operational Hydro Cooling Data Center                                                         1
Cannabis Stock News: Craftport Cannabis (CSE: CFT) Provides Corporate Update                                                                                     1
One-fifth of US listed SPACs averse to Chinese targets amid evolving regulatory landscape, finds GlobalData                                                      1
Biotech Stock News: Celularity (NASDAQ: CELU) Reports Fourth Quarter and Full Year 2021 Financial Results and Provides Corporate Update                          1
Pharma Stock News: Pharmadrug (CSE: PHRX) (OTCQB: LMLLF) Advances PD-001, Its Patented, Improved Version of Cepharanthine for Oncology and Infectious Disease    1
Name: title, dtype: int64

### Short news

In [34]:
df['title_len_words'] = df['title'].apply(lambda s: len(s.split()))

In [35]:
df['title_len_words'].value_counts().head()

12    145
10    139
9     128
11    117
8      98
Name: title_len_words, dtype: int64

In [36]:
len(df[df['title_len_words'] <= 5])

30

Most of the titles under 4 words in length make no sense. There are exceptions of course "Forget Bitcoin", "Go buy Bitcoin", "Bitcoin crosses $50,000", etc. But there are not too many of them. 

Most of the 4-word titles make sense.

In [37]:
df.loc[df['has_verb'] & (df['title_len_words']  == 4)]

Unnamed: 0_level_0,title,source,pub_time,has_verb,has_quest_mark,lang_eng,title_len_words
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
-442349478,Methods for storing Ether!,https://newsbitcoin247.com/feed/,2022-05-26 17:47:04+00:00,True,False,True,4
343444067,Smart SEO Shopify 101,https://joeytrend.com/feed/,2021-03-22 06:30:10+00:00,True,False,True,4
-1495029528,Firstrade Adds Crypto Trading,https://www.investorideas.com/rss/feeds/iimain...,2022-04-11 09:20:00-05:00,True,False,True,4
-964189157,Optimizing for Network Success,https://vinnylingham.com/feed,2017-06-20 06:46:29+00:00,True,False,True,4
613496416,Offchain: Green with Envy,https://blog.coinjar.com/rss/,2022-04-14 01:08:49+00:00,True,False,True,4
-682059815,Perfomance of privacy-focused cryptocurrencies,https://cryptoinsider.media/feed/,2019-04-21 14:54:00+00:00,True,False,True,4


Most of the 5-word titles are not much valuable

In [38]:
df.loc[df['has_verb'] & (df['title_len_words']  == 5)]

Unnamed: 0_level_0,title,source,pub_time,has_verb,has_quest_mark,lang_eng,title_len_words
title_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
-464610260,Bitcoin Latinum Pre-listed on CoinMarketCap,https://blog.coinspectator.com/feed/,2021-06-16 14:50:30+00:00,True,False,True,5
-1890267142,How to Overcome Sales Objections,https://about.crunchbase.com/blog/feed/,2022-05-26 21:37:50+00:00,True,False,True,5
-1619680382,How to Buy Cannabis Seeds,https://smartereum.com/feed/,2022-05-26 19:19:01+00:00,True,False,True,5
-1073875228,Why Do People Buy NFTs,https://www.mycryptopedia.com/feed/,2022-05-25 04:18:26+00:00,True,False,True,5
-12816746,Dose of Ether #38-Toasting Mate,https://feeds.simplecast.com/xCQr3ykc,2020-02-04 10:00:22+00:00,True,False,True,5
-1809669745,Hashing It Out #72-Althea-Jehan Tremback,https://feeds.simplecast.com/xCQr3ykc,2020-01-31 00:06:45+00:00,True,False,True,5
-362259115,Dose of Ether #13: Scalegate,https://feeds.simplecast.com/xCQr3ykc,2018-12-14 14:41:11+00:00,True,False,True,5
-1642715568,Spotify Plans To Integrate NFTs,https://www.cryptocynews.com/feed/,2022-05-28 20:12:26+00:00,True,False,True,5
-344358989,ETH struggles to find direction,https://bitcoingoldprice.blogspot.com/feeds/po...,2021-09-08 20:06:00+00:00,True,False,True,5
-1130173434,Dogecoin price approaches crucial threshold,https://bitcoingoldprice.blogspot.com/feeds/po...,2021-07-19 04:27:00+00:00,True,False,True,5


In [39]:
df = df[df['title_len_words'] >= 6]

In [40]:
len(df)

1183

In [41]:
df.columns

Index(['title', 'source', 'pub_time', 'has_verb', 'has_quest_mark', 'lang_eng',
       'title_len_words'],
      dtype='object')

In [42]:
df[['title', 'source', 'pub_time']].to_csv('../data/20220530_rss_feed.csv')