#### Author
Yury Kashnitsky

#### Reference
[Notion ticket](https://www.notion.so/a74951e4e815480584dea7d61ddce6cc?v=dbfdb1207d0e451b827d3c5041ed0cfd&p=9be8fa739ceb4eecbcd0000a6ae12fbc)

#### Idea
Scrape https://bitcointicker.co/news/ to get more unlabeled data. Perform basic EDA to come up with good rules for filtering. 

#### Results
 - ~27k news titles are saved locally and to the `cryptonews.bitcointicker_news` Mongo collection
 - Filtering based on presence of at least one verb seems to make sense
 - Most of the titles under 4 words in length make no sense. Starting from 4 words is reasonable
 - Placeholder titles need to be removed. E.g. "Bitcoin daily chart alert" is seen >90 times.

## Scraping bitcointicker.co – the easy way (only 50)

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)
import requests
from bs4 import BeautifulSoup

In [2]:
URL = 'https://bitcointicker.co/news/'

In [3]:
def parse_bitcointicker(soup_object):
    
    parsed_titles, sources, pub_times = [], [], []
    
    # getting titles and sources
    for el in soup.find_all('div', attrs={'style': "overflow:hidden;"}):
        elem_text = el.get_text()
        title = elem_text.split(' - ')[0].strip()
        source = elem_text.split(' - ')[-1].strip()
        parsed_titles.append(title)
        sources.append(source)
    
    # getting publication times 
    for el in soup.find_all('div', attrs={'style': "margin-left:30px"}):
        elem_text = el.get_text()
        if "Posted" in elem_text:   # e.g. "Posted: 2022-04-15 06:00:19\n0 Comments"
            pub_time = elem_text.strip().split('\n')[0].replace('Posted:', '').strip()
            pub_times.append(pub_time)
            
    df = pd.DataFrame({'Title': parsed_titles, 'Source': sources, 'Pubtime': pub_times})
    
    return df

In [4]:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
df = parse_bitcointicker(soup_object=soup)

In [6]:
len(df)

50

In [7]:
df.head()

Unnamed: 0,Title,Source,Pubtime
0,Binance and Coinbase silent on Bitcoin Lightning: Community tries to understand why,Cointelegraph,2022-04-15 06:43:53
1,Fumb Games Mobile App Bitcoin Miner Integrates Real BTC Rewards via Zebedee – Bitcoin News,Bitcoin News,2022-04-15 06:00:19
2,US Links North Korean Hacker Groups To Axie Infinity Crypto Heist; Bitcoin Down By 3%,Outlook India,2022-04-15 04:24:20
3,"Cryptocurrency Prices Today April 15: Bitcoin, Ethereum edge lower; Dogecoin bucks trend",Moneycontrol,2022-04-15 03:50:41
4,"Crypto Taxes: Online Tax Software to Report Bitcoin, NFTs and Digital Currencies",CNET,2022-04-15 01:15:03


## Scraping bitcointicker.co – the harder way

[Tutorial](https://medium.com/analytics-vidhya/using-python-and-selenium-to-scrape-infinite-scroll-web-pages-825d12c24ec7)

Prereqs:
 - [chromedriver](https://chromedriver.chromium.org/downloads) (for Mac - `brew install chromedriver`)
 - `pip install selenium`

In [8]:
import time
from selenium import webdriver

CHROMEDRIVER_PATH = "/usr/local/bin/chromedriver"
TOTAL_PARSING_TIME = 3600 # secs to parse, still to figure out how to change to the number of elements parsed

# Web scrapper for infinite scrolling page 
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
driver.get(URL)
time.sleep(2)  # Allow 2 seconds for the web page to open
scroll_pause_time = 1 # You can set your own pause time. 
screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
i = 1
start_time = time.time()

while True:
    # scroll one screen height each time
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i += 1
    time.sleep(scroll_pause_time)
    # update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
    scroll_height = driver.execute_script("return document.body.scrollHeight;")  
    # Break the loop when the height we need to scroll to is larger than the total scroll height
#     if (screen_height) * i > scroll_height * 1.1:
#         break
    if time.time() - start_time > TOTAL_PARSING_TIME:
        break
        
soup = BeautifulSoup(driver.page_source, "html.parser")
df = parse_bitcointicker(soup_object=soup)
print(f"{len(df)} titles parsed from {URL} in {TOTAL_PARSING_TIME} secs.")

  driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)


27550 titles parsed from https://bitcointicker.co/news/ in 3600 secs.


In [9]:
df.head()

Unnamed: 0,Title,Source,Pubtime
0,Binance and Coinbase silent on Bitcoin Lightning: Community tries to understand why,Cointelegraph,2022-04-15 06:43:53
1,Fumb Games Mobile App Bitcoin Miner Integrates Real BTC Rewards via Zebedee – Bitcoin News,Bitcoin News,2022-04-15 06:00:19
2,US Links North Korean Hacker Groups To Axie Infinity Crypto Heist; Bitcoin Down By 3%,Outlook India,2022-04-15 04:24:20
3,"Cryptocurrency Prices Today April 15: Bitcoin, Ethereum edge lower; Dogecoin bucks trend",Moneycontrol,2022-04-15 03:50:41
4,"Crypto Taxes: Online Tax Software to Report Bitcoin, NFTs and Digital Currencies",CNET,2022-04-15 01:15:03


##  Basic EDA and Filtering

### Selecting only news with a verb

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')

def tag_verb_spacy(sent):
    # https://ashutoshtripathi.com/2020/04/13/parts-of-speech-tagging-and-dependency-parsing-using-spacy-nlp/
    doc = nlp(sent)
    
    verb_found = False
    for t in doc:
        if t.pos_ == 'VERB':
            verb_found = True
            
    return verb_found

In [12]:
df['has_verb'] = df['Title'].apply(lambda x: tag_verb_spacy(x))

In [13]:
df[df['has_verb']].head()

Unnamed: 0,Title,Source,Pubtime,has_verb
0,Binance and Coinbase silent on Bitcoin Lightning: Community tries to understand why,Cointelegraph,2022-04-15 06:43:53,True
1,Fumb Games Mobile App Bitcoin Miner Integrates Real BTC Rewards via Zebedee – Bitcoin News,Bitcoin News,2022-04-15 06:00:19,True
3,"Cryptocurrency Prices Today April 15: Bitcoin, Ethereum edge lower; Dogecoin bucks trend",Moneycontrol,2022-04-15 03:50:41,True
4,"Crypto Taxes: Online Tax Software to Report Bitcoin, NFTs and Digital Currencies",CNET,2022-04-15 01:15:03,True
5,"Saylor Calls MicroStrategy’s Bitcoin Play ‘Tremendous Success,’ Promises More BTC Buys",CoinDesk,2022-04-14 21:09:00,True


In [14]:
df[~df['has_verb']].head()

Unnamed: 0,Title,Source,Pubtime,has_verb
2,US Links North Korean Hacker Groups To Axie Infinity Crypto Heist; Bitcoin Down By 3%,Outlook India,2022-04-15 04:24:20,False
6,Clean Bitcoin: North Dakota To Double Its $3B Green Crypto Mining Efforts,Benzinga,2022-04-14 21:08:50,False
11,Bitcoin Well CEO and CRO to Present at Alberta Technology Symposium,GlobeNewswire,2022-04-14 20:30:00,False
12,Escambia Co. Sheriff: Beware of Bitcoin scams,WKRG News 5,2022-04-14 20:28:26,False
18,Report: Arcane Crypto Releases Bitcoin Lightning Network Research,Bitcoin Magazine,2022-04-14 18:59:39,False


In [15]:
df.to_csv('../data/20220415_parsed_27k_bitcointicker_news_feb2021_apr2022.csv')

Examples having at least one verb make more sense as "news" - we'll further work only with those.

In [16]:
df['has_verb'].sum()

21070

## Sources

In [17]:
df['Source'].nunique()

1511

In [18]:
df['Source'].value_counts().head()

Yahoo Finance       2225
Bitcoin News        1991
Cointelegraph       1683
Bitcoin Magazine    1580
CoinDesk            1517
Name: Source, dtype: int64

## Exact dupes

Some placeholders are found. These can be ignored.

In [20]:
df['Title'].value_counts().head(10)

Bitcoin daily chart alert                                                                                            91
Crypto Chartbook                                                                                                     17
Bitcoin and Ethereum                                                                                                 15
Bitcoin                                                                                                              7 
Bitcoin and Ripple's XRP                                                                                             6 
Bitcoin millionaire puts money on Greens in German election                                                          4 
Bitcoin Conference coming to College Station virtually, talking technology, business, finance, law and regulation    3 
Breakingviews                                                                                                        3 
Bitcoin Fails to Break $60K; Short-Term 

### Short news

In [21]:
df['title_len_words'] = df['Title'].apply(lambda s: len(s.split()))

In [24]:
df['title_len_words'].value_counts().head()

10    3664
11    3330
9     3203
12    2840
8     2548
Name: title_len_words, dtype: int64

In [25]:
len(df[df['title_len_words'] < 4])

160

Most of the titles under 4 words in length make no sense. There are exceptions of course "Forget Bitcoin", "Go buy Bitcoin", "Bitcoin crosses $50,000", etc. But there are not too many of them. 

In [35]:
df.loc[df['has_verb'] & (df['title_len_words']  < 4)]

Unnamed: 0,Title,Source,Pubtime,has_verb,title_len_words
231,De-anonymizing Bitcoin,Security Boulevard,2022-04-11 12:53:26,True,2
478,Bitcoin Up Review,Business 2 Community,2022-04-07 08:06:02,True,3
841,UPDATE,GlobeNewswire,2022-04-01 10:00:00,True,1
2918,Busting bitcoin bribers,The Star Online,2022-02-27 23:17:45,True,3
3379,Bitcoin Versions Compared,iExpats.com,2022-02-21 00:46:18,True,3
4299,Buy Bitcoin Now,Forbes,2022-02-06 17:15:00,True,3
4534,What to Buy,Analytics Insight,2022-02-02 15:16:34,True,3
6096,Bitcoin is misbehaving,FXStreet,2022-01-10 06:56:23,True,3
10608,Do you Bitcoin?,Business Insider India,2021-10-29 10:36:32,True,3
11413,Bitcoin Is Undervalued,Seeking Alpha,2021-10-18 13:51:00,True,3


Most of the 4-word titles make sense.

In [33]:
df.loc[df['has_verb'] & (df['title_len_words']  == 4) & 
       (df['Title'] != 'Bitcoin daily chart alert')].sample(30)

Unnamed: 0,Title,Source,Pubtime,has_verb,title_len_words
2629,Understanding Bitcoin's Energy Consumption,Bitcoin Magazine,2022-03-04 04:30:00,True,4
25111,Analyzing Bitcoin's Network Effect,Seeking Alpha,2021-03-11 16:41:00,True,4
14032,Ukraine Has Legalized Bitcoin,Bitcoin Magazine,2021-09-08 22:25:44,True,4
7929,Watch: The Satoshi Mystery,RTE.ie,2021-12-08 11:16:34,True,4
1073,Bitcoin: Buy the Dip?,The Motley Fool,2022-03-29 11:15:00,True,4
26042,Bitcoin: Buying The Pullback?,FX Empire,2021-02-25 07:54:13,True,4
5227,Why bitcoin just crashed,New Zealand Herald,2022-01-22 20:00:00,True,4
963,Bitcoin Exposes Monetary Aristocrats,Bitcoin Magazine,2022-03-30 19:00:00,True,4
26842,"Bitcoin $50,000 milestone nears",Livemint,2021-02-15 08:33:29,True,4
26915,"Bitcoin, Tesla And Gamestop",Forbes,2021-02-13 14:41:37,True,4


Most of the 5-word titles make sense.

In [30]:
df.loc[df['has_verb'] & (df['title_len_words']  == 5)].sample(30)

Unnamed: 0,Title,Source,Pubtime,has_verb,title_len_words
27,How can Bitcoin affect Prada?,FingerLakes1.com,2022-04-14 16:43:03,True,5
12269,Bitcoin rises 5% to $51711,Reuters,2021-10-05 22:24:00,True,5
10027,USU bitcoin club challenges stereotypes,Utahstatesman,2021-11-07 15:00:00,True,5
20768,DarkSide Hackers’ Bitcoin Stash Tracked,Yahoo Finance,2021-05-17 20:21:47,True,5
8913,Bitcoin plunge destroys gold proposition,NEWS.com.au,2021-11-23 03:40:25,True,5
8069,Bitcoin scam claims another victim,GuelphToday,2021-12-06 15:17:36,True,5
12602,Bitcoin rises 5.2% to $43717,Reuters,2021-09-30 22:18:00,True,5
1830,"Bitcoin price holds above $40,000",Fox Business,2022-03-17 06:54:39,True,5
26881,Bitcoin’s rise reflects America’s decline,Financial Times,2021-02-14 14:00:00,True,5
27203,Why did Tesla buy bitcoin?,MarketWatch,2021-02-09 20:23:00,True,5
