#### Author
Yury Kashnitsky

#### Reference
[Notion ticket](https://www.notion.so/a74951e4e815480584dea7d61ddce6cc?v=dbfdb1207d0e451b827d3c5041ed0cfd&p=4cc8d88f4a1349d7bcdf9be55724d4cf)

#### Idea
Scrape https://bitcointicker.co/news/ and run the model against this data to select diverse examples (when the model is very confident and when it's lost) to be annotated with Amazon Mechanical Turk. 

#### Results
 - 3250 news titles are saved locally. Of those 2 batches, each of 200 records, are selected for annotation.
 - Filtering based on presence of at least one verb seems to make sense
 - Prediction entropy works well to distinguish easy and hard examples but some hard examples seem to be rubbish

## Scraping bitcointicker.co – the easy way (only 50)

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)
import requests
from bs4 import BeautifulSoup

In [2]:
URL = 'https://bitcointicker.co/news/'

In [3]:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
parsed_titles = []

In [5]:
for el in soup.find_all('div', attrs={'style': "overflow:hidden;"}):
    news = el.get_text().split(' - ')[0].strip()
    parsed_titles.append(news)

In [6]:
len(parsed_titles)

50

In [7]:
parsed_titles[:5]

['Green Bitcoin mining proof of concept using Tesla solar and battery storage',
 'Why bitcoin ETFs are grounded',
 'Steps On To Open A Bitcoin Account In Nigeria',
 "Twitter Founder Jack Dorsey's Bitcoin Obsession, Explained",
 'Cryptocurrency Prices Today April 13: Bitcoin edges down, Binance Coin biggest gainer']

## Scraping bitcointicker.co – the harder way

[Tutorial](https://medium.com/analytics-vidhya/using-python-and-selenium-to-scrape-infinite-scroll-web-pages-825d12c24ec7)

Prereqs:
 - [chromedriver](https://chromedriver.chromium.org/downloads) (for Mac - `brew install chromedriver`)
 - `pip install selenium`

In [8]:
import time
from selenium import webdriver

CHROMEDRIVER_PATH = "/usr/local/bin/chromedriver"
TOTAL_PARSING_TIME = 300 # secs to parse, still to figure out how to change to the number of elements parsed

# Web scrapper for infinite scrolling page 
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
driver.get(URL)
time.sleep(2)  # Allow 2 seconds for the web page to open
scroll_pause_time = 1 # You can set your own pause time. 
screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
i = 1
start_time = time.time()

while True:
    # scroll one screen height each time
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i += 1
    time.sleep(scroll_pause_time)
    # update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
    scroll_height = driver.execute_script("return document.body.scrollHeight;")  
    # Break the loop when the height we need to scroll to is larger than the total scroll height
#     if (screen_height) * i > scroll_height * 1.1:
#         break
    if time.time() - start_time > TOTAL_PARSING_TIME:
        break
        
parsed_titles = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for el in soup.find_all('div', attrs={'style': "overflow:hidden;"}):
    title = el.get_text().split(' - ')[0].strip()
    parsed_titles.append(title)
    
print(f"{len(parsed_titles)} titles parsed from {URL} in {TOTAL_PARSING_TIME} secs.")

  driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)


3250 titles parsed from https://bitcointicker.co/news/ in 300 secs.


## Filtering the news: Selecting only news with a verb

In [10]:
test_df = pd.DataFrame({'Title': parsed_titles})

In [11]:
import nltk

def tag_verb_nltk(sent):
    # Doesn't work well
    # https://stackoverflow.com/questions/52324004/how-to-check-presense-of-verb-using-spacy-and-pandas
    words = nltk.word_tokenize(sent)
    tags = nltk.pos_tag(words)
    
    verb_found = False
    for t in tags:
        if t[1] == 'VB':
            verb_found = True
            
    return verb_found

In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')

def tag_verb_spacy(sent):
    # https://ashutoshtripathi.com/2020/04/13/parts-of-speech-tagging-and-dependency-parsing-using-spacy-nlp/
    doc = nlp(sent)
    
    verb_found = False
    for t in doc:
        if t.pos_ == 'VERB':
            verb_found = True
            
    return verb_found

In [13]:
test_df['has_verb'] = test_df['Title'].apply(lambda x: tag_verb_spacy(x))

In [14]:
test_df[test_df['has_verb']].head(10)

Unnamed: 0,Title,has_verb
0,Green Bitcoin mining proof of concept using Tesla solar and battery storage,True
1,Why bitcoin ETFs are grounded,True
2,Steps On To Open A Bitcoin Account In Nigeria,True
3,"Twitter Founder Jack Dorsey's Bitcoin Obsession, Explained",True
4,"Cryptocurrency Prices Today April 13: Bitcoin edges down, Binance Coin biggest gainer",True
6,Here's where LFG's sustained acquisition of Bitcoin is leaving LUNA,True
7,ETF bigwigs gather as assets explode and a bitcoin fund remains a no-go,True
8,North Dakota aims to draw Bitcoin miners with promise of world’s ‘cleanest crypto’,True
9,No Bitcoin: Mozilla will only accept Proof of Stake crypto donations,True
10,"This bullish setup puts Bitcoin price at $50,000 in the next week",True


In [54]:
test_df[~test_df['has_verb']].head(10)

Unnamed: 0,Title,has_verb
5,"Blackrock, Fidelity to Invest in Crypto Firm Circle's $400 Million Funding Round – Finance Bitcoin News",False
27,Bitcoin’s price history: 2009 to 2022,False
29,Best Bitcoin Mining Sites for 2022,False
35,Best Bitcoin Mining Software in 2022,False
37,Robinhood Lists Shiba Inu and 3 More Cryptocurrencies — SHIB Price Soars – Altcoins Bitcoin News,False
43,Bitcoin: Problems and Prospects | Cato at Liberty Blog,False
46,8 Bitcoin Facts: Why is This Cryptocurrency Bad for The Environment?,False
47,Bitcoin Is Venice: Capitalism Without Capitalists,False
48,90% Vote for SHIB vs. BTC in Recent Poll by Bitcoin of America ATM Chain,False
54,Bearish Investors: Bitcoin Records $132M Weekly Outflow,False


In [16]:
test_df.to_csv('../data/20220413_parsed_3250_bitcointicker_titles.csv')

Examples having at least one verb make more sense as "news" - we'll further work only with those.

In [27]:
test_df_filtered = test_df[test_df['has_verb']]

In [28]:
len(test_df_filtered)

2452

## Loading the model
We'd like to select examples for annotation based on prediction entropy, to select examples which the model struggles to classify. The model is trained in notebook `20220411_btc_4500_titles_fix_folds_for_validations.ipynb` or in [this repo](https://github.com/crypto-sentiment/crypto_sentiment_model_fast_api).

In [19]:
import joblib

model = joblib.load('../models/tf-idf-logreg-baseline.joblib')

## Making predictions for the crawled titles

In [20]:
from scipy.stats import entropy

In [34]:
test_df_filtered['predicted_sentiment'] = model.predict(test_df_filtered["Title"])
pred_probs = model.predict_proba(test_df_filtered["Title"])

# prediction entropy
pred_entropy = entropy(pred_probs, axis=1)

# merging test data with pred probs and entropy
test_df_filtered = pd.concat([test_df_filtered, 
                              pd.DataFrame(pred_probs, columns=model.named_steps['logit'].classes_,
                                          index=test_df_filtered.index)],
                            axis=1)
test_df_filtered['pred_entropy'] = pred_entropy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df_filtered['predicted_sentiment'] = model.predict(test_df_filtered["Title"])


In [35]:
test_df_filtered['predicted_sentiment'].value_counts()

Positive    1491
Negative    713 
Neutral     248 
Name: predicted_sentiment, dtype: int64

In [36]:
test_df_filtered.head()

Unnamed: 0,Title,has_verb,predicted_sentiment,Negative,Neutral,Positive,pred_entropy
0,Green Bitcoin mining proof of concept using Tesla solar and battery storage,True,Negative,0.396809,0.220477,0.382714,1.067708
1,Why bitcoin ETFs are grounded,True,Positive,0.192555,0.180867,0.626578,0.919406
2,Steps On To Open A Bitcoin Account In Nigeria,True,Positive,0.061937,0.033662,0.9044,0.377326
3,"Twitter Founder Jack Dorsey's Bitcoin Obsession, Explained",True,Neutral,0.381594,0.436436,0.181971,1.039544
4,"Cryptocurrency Prices Today April 13: Bitcoin edges down, Binance Coin biggest gainer",True,Positive,0.423398,0.031271,0.545332,0.802909


Let's look at examples with highest prediction entropy, these are uncertain cases.

Observations:
 - some titles are indeed hard: "Bitcoin, Ethereum transaction fees drop 90% becoming cheaper to use" (both "fees" and "drop" seem to be negative but "fees drop" is actually positive) and the model would benefit a lot from labeling such cases
 - some examples are rubbish, not "news" or neutral news, e.g. "Intel's Second-Gen Bitcoin Miner's Performance and Pricing Listed", "Trust Through Verification: The Establishment’s Nightmare"

In [37]:
test_df_filtered.sort_values(by='pred_entropy', ascending=False).head(10)

Unnamed: 0,Title,has_verb,predicted_sentiment,Negative,Neutral,Positive,pred_entropy
2845,Intel's Second-Gen Bitcoin Miner's Performance and Pricing Listed,True,Positive,0.332262,0.329662,0.338076,1.098557
3140,"LIVE BLOG: Stocks, Bitcoin volatile amid Russia-Ukraine tensions, oil, natural gas jumps",True,Negative,0.34523,0.32069,0.33408,1.098159
226,"""There Will Be Many Things You Did Not Predict"": Jordan Peterson Critiques Bitcoin at Bitcoin 2022",True,Positive,0.330857,0.321662,0.347481,1.0981
1822,"Bitcoin, Ethereum transaction fees drop 90% becoming cheaper to use",True,Positive,0.322409,0.328323,0.349268,1.098018
3143,"Cardano is 47,000x more energy-efficient than Bitcoin, data shows",True,Neutral,0.317704,0.348848,0.333448,1.097884
2618,"CryptoWorldCon, the Largest Conference Focused on Blockchain, Crypto, NFT, Metaverse, Bitcoin, Will Be Happening in Miami",True,Neutral,0.336844,0.349672,0.313485,1.097597
1766,"Bitcoin could form the backbone for CBDCs, according to a new report",True,Positive,0.315996,0.329612,0.354392,1.097481
1869,Trust Through Verification: The Establishment’s Nightmare,True,Positive,0.329217,0.315983,0.3548,1.09745
2170,Why Bitcoin- And Ethereum-Related Stock Marqeta Is Skyrocketing During Wednesday's After-Hours Session,True,Neutral,0.316709,0.356757,0.326534,1.097316
3220,Bitcoin and Stock Futures Feel Choppy Following Heightened Russia-Ukraine Dispute,True,Neutral,0.3114,0.35529,0.33331,1.097167


Let's look at examples with lowest prediction entropy, these are cases where the model is very confident about the prediction.

In [39]:
test_df_filtered.sort_values(by='pred_entropy', ascending=True).head(10)

Unnamed: 0,Title,has_verb,predicted_sentiment,Negative,Neutral,Positive,pred_entropy
2999,Bitcoin drops 8% and other cryptocurrencies tumble after Russia attacks Ukraine,True,Negative,0.988761,0.003519,0.007721,0.068608
2369,"Market Watch: Bitcoin Dips Below $38K, Fantom (FTM) Tumbles 15% Following Andre Cronje’s Leave",True,Negative,0.984598,0.008657,0.006745,0.090116
1667,NFT Weekly Sales Volume Improves Jumping 17% Higher Than the Week Prior – Bitcoin News,True,Positive,0.007534,0.009584,0.982882,0.098345
376,4 Cryptos to Buy as Bitcoin Surges Higher,True,Positive,0.012881,0.005468,0.981651,0.102719
3006,Bitcoin price falls after Russia attacks Ukraine,True,Negative,0.980997,0.007802,0.011201,0.107001
2078,Bitcoin cashpoints forced to shut down after being declared illegal,True,Negative,0.979825,0.006901,0.013274,0.111682
2764,Kazakhstan government shuts down 13 illegal Bitcoin mines,True,Negative,0.97989,0.010627,0.009484,0.112375
1037,Bitcoin climbs to highest value in 2022 and sees market cap hit $900bn,True,Positive,0.01284,0.007531,0.979628,0.112904
1714,"Bitcoin price holds above $40,000",True,Positive,0.01444,0.006229,0.979331,0.11328
3000,Bitcoin's price falls after Russia attacks Ukraine,True,Negative,0.977968,0.011486,0.010546,0.121097


In [42]:
len(test_df_filtered[test_df_filtered['pred_entropy'] > 1])

521

## Selecting data for annotation

For annotation, we select 300 "hard" cases and 100 "easy" cases. 

In [48]:
sample_df = pd.concat([test_df_filtered[test_df_filtered['pred_entropy'] <= 0.6].sample(300, random_state=17),
                      test_df_filtered[test_df_filtered['pred_entropy'] > 1].sample(100, random_state=17)])\
            .sample(frac=1, random_state=17).reset_index(drop=True)

In [49]:
len(sample_df)

400

In [50]:
sample_df.head()

Unnamed: 0,Title,has_verb,predicted_sentiment,Negative,Neutral,Positive,pred_entropy
0,Luna Foundation Guard raises $1 billion to form bitcoin reserve for UST stablecoin,True,Positive,0.041531,0.033519,0.92495,0.318102
1,Mexico Is Going to Accept Bitcoin,True,Positive,0.019635,0.014969,0.965396,0.174069
2,Titans become first NFL team to accept Bitcoin as payment,True,Positive,0.037177,0.00985,0.952973,0.213803
3,Warrant arrest Bitcoin scam targeting registered sex offenders in Mat-Su,True,Negative,0.940806,0.017296,0.041898,0.260503
4,Celsius Unveils Wrapped Bitcoin at Bitcoin 2022,True,Positive,0.189047,0.304545,0.506407,1.021559


In [53]:
sample_df.head(200)['Title'].to_csv('../data/20120413_sample_to_annotate_batch_1.csv', index=None)
sample_df.tail(200)['Title'].to_csv('../data/20120413_sample_to_annotate_batch_2.csv', index=None)