# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
from datetime import datetime, timedelta
import nltk as nltk
nltk.download('vader_lexicon')
from newsapi.newsapi_client import NewsApiClient
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
import string
from collections import Counter
from nltk import ngrams
from nltk.corpus import stopwords, reuters
%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ananthigokul/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
# Load .env enviroment variables
load_dotenv()

news_api_key = os.getenv("NEWS_API_KEY")
print (news_api_key)

b011deb799c1445e868b4c1defd6a8c1


In [3]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=news_api_key) 

In [38]:
# Fetch the Bitcoin news articles

# Set current date and the date from one month ago using the ISO format
current_date = pd.Timestamp(datetime.now(), tz="America/New_York").isoformat()
past_date = pd.Timestamp(datetime.now()- timedelta(3), tz="America/New_York").isoformat()

# Use newsapi client to get most relevant 20 headlines per day in the past month
def get_news_articles(keyword):
    all_headlines = []
    all_dates = []    
    date = datetime.strptime(current_date[:10], "%Y-%m-%d")
    end_date = datetime.strptime(past_date[:10], "%Y-%m-%d")
    print(f"Fetching news about '{keyword}'")
    print("*" * 30)
    while date > end_date:
        print(f"retrieving news from: {date}")
        articles = newsapi.get_everything(
            q=keyword,
            from_param=str(date)[:10],
            #to=str(date)[:10],
            language="en",
            sort_by="relevancy",
            page=1,
        )
        headlines = []
        for i in range(0, len(articles["articles"])):
            headlines.append(articles["articles"][i]["title"])
        all_headlines.append(headlines)
        all_dates.append(date)
        date = date - timedelta(days=1)
    return all_headlines, all_dates

news = get_news_articles('Bitcoin')


Fetching news about 'Bitcoin'
******************************
retrieving news from: 2021-08-09 00:00:00
retrieving news from: 2021-08-08 00:00:00
retrieving news from: 2021-08-07 00:00:00
([[], ["Why a Waste-Coal Power Plant is 'Burning for Bitcoin'", 'Blockchain could change the world. This $20 course package could show you how.', '$300 Billion Crypto Price Boom: Bitcoin Is Suddenly Soaring Toward $50,000 As Ethereum, BNB, Cardano, XRP, Dogecoin And Uniswap Surge', 'Crypto Price Prediction: Dogecoin ‘Pump And Dump’ Cycle Could Send The Memecoin Soaring By The End Of 2021', 'The Crypto Daily – Movers and Shakers – August 8th, 2021', 'Bitcoin Long-Term Buy Indicator Just Flashed as BTC Faces Critical Resistance (Price Analysis) - CryptoPotato', 'Bitcoin Trader’s Quietly Using In-cloud Apps for Steady Profits!', 'Stablecoins: Risks & regulatory imperatives', "Why a Waste-Coal Power Plant is 'Burning for Bitcoin'", 'What is ethereum’s London hard fork & how it will impact the crypto world?

In [39]:
news.head()
## df with news - TODO

AttributeError: 'tuple' object has no attribute 'head'

In [5]:
# Fetch the Ethereum news articles
#get_news_articles('Ethereum')
ethereum_headlines = newsapi.get_everything(q="Ethereum", language="en")
#print(ethereum_headlines)
print(ethereum_headlines['articles'][1]['description'])

bitcoin_headlines = newsapi.get_everything(q="Bitcoin", language="en")
#print(ethereum_headlines)
print(ethereum_headlines['articles'][1]['description'])

Blockchain infrastructure startups are heating up as industry fervor brings more developers and users to a space that still feels extremely young despite a heavy institutional embrace of the crypto space in 2021. The latest crypto startup to court the attenti…
Blockchain infrastructure startups are heating up as industry fervor brings more developers and users to a space that still feels extremely young despite a heavy institutional embrace of the crypto space in 2021. The latest crypto startup to court the attenti…


In [34]:
# Create the Bitcoin sentiment scores DataFrame
def headline_sentiment_summarizer_avg(headlines):
    sentiment = []
    for day in headlines:
        day_score = []
        for h in day:
            if h == None:
                continue
            else:
                day_score.append(sid.polarity_scores(h)["compound"])
        sentiment.append(sum(day_score) / len(day_score))
    return sentiment
#bitcoin_headlines = newsapi.get_top_headlines(q="bitcoin", language="en", country="ca")
bitcoin_headlines = newsapi.get_everything(q="Bitcoin", language="en")
#print(bitcoin_headlines)
bitcoin_df = pd.DataFrame.from_dict(bitcoin_headlines["articles"])


bitcoin_df_first = bitcoin_df.head(1) ## bitcoin_df.iloc(:1)
print (bitcoin_df_first['content'])

0    When my wife started a little garden in our ur...
Name: content, dtype: object


0    When my wife started a little garden in our ur...
Name: content, dtype: object

In [7]:
# Create the Ethereum sentiment scores DataFrame
# YOUR CODE HERE!

In [8]:
# Describe the Bitcoin Sentiment
# YOUR CODE HERE!

In [9]:
# Describe the Ethereum Sentiment
# YOUR CODE HERE!

### Questions:

Q: Which coin had the highest mean positive score?

A: 

Q: Which coin had the highest compound score?

A: 

Q. Which coin had the highest positive score?

A: 

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [10]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [11]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
stop_words = set(stopwords.words('english')) 
print (len(stop_words))

# Expand the default stopwords list if necessary
# YOUR CODE HERE!

179


In [12]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    #text.astype()
    text.translate(str.maketrans('', '', string.punctuation))#.translate(None, string.punctuation)
    
    # Create a tokenized list of the words
    word_tokens = word_tokenize(re_clean)

    # Lemmatize words into root words
    lem = [lemmatizer.lemmatize(word) for word in word_tokens]
   
    # Convert the words to lowercase
    output = [word.lower() for word in lem if word.lower() not in stop_words]
    
    # Remove the stop words
    filtered_sentence = [] 
    for w in word_tokens: 
        if w not in stop_words: 
            filtered_sentence.append(w) 

    print("\n\nOriginal Sentence \n\n")
    print(" ".join(word_tokens)) 

    print("\n\nFiltered Sentence \n\n")
    print(" ".join(filtered_sentence)) 
    ## TODO check with tutor 
    tokens = filtered_sentence
    return tokens

In [13]:
# Create a new tokens column for Bitcoin
text_df = bitcoin_df_first['content'].to_frame()
text_bitcoin = text_df['content'].astype(str)
print (text_bitcoin)
#tokens = tokenizer(text) 

0    When my wife started a little garden in our ur...
Name: content, dtype: object


In [14]:
# Create a new tokens column for Ethereum
text_ethereum = ethereum_headlines['articles'][1]['description']
print (text_ethereum)
tokens = tokenizer(text_ethereum) 
#print (news[0][1])

Blockchain infrastructure startups are heating up as industry fervor brings more developers and users to a space that still feels extremely young despite a heavy institutional embrace of the crypto space in 2021. The latest crypto startup to court the attenti…


Original Sentence 


Blockchain infrastructure startups are heating up as industry fervor brings more developers and users to a space that still feels extremely young despite a heavy institutional embrace of the crypto space in The latest crypto startup to court the attenti


Filtered Sentence 


Blockchain infrastructure startups heating industry fervor brings developers users space still feels extremely young despite heavy institutional embrace crypto space The latest crypto startup court attenti


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [15]:
def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return output

In [16]:
# Generate the Bitcoin N-grams where N=2
#
processed_text_bitcoin = process_text(text_bitcoin)#text_bitcoin)
print(processed_text_bitcoin)

bigram_counts_bitcoin = Counter(ngrams(processed_text_bitcoin, n=2))
print(dict(bigram_counts_bitcoin))

['blockchain', 'infrastructure', 'startup', 'heating', 'industry', 'fervor', 'brings', 'developer', 'user', 'space', 'still', 'feel', 'extremely', 'young', 'despite', 'heavy', 'institutional', 'embrace', 'crypto', 'space', 'latest', 'crypto', 'startup', 'court', 'attenti']
{('blockchain', 'infrastructure'): 1, ('infrastructure', 'startup'): 1, ('startup', 'heating'): 1, ('heating', 'industry'): 1, ('industry', 'fervor'): 1, ('fervor', 'brings'): 1, ('brings', 'developer'): 1, ('developer', 'user'): 1, ('user', 'space'): 1, ('space', 'still'): 1, ('still', 'feel'): 1, ('feel', 'extremely'): 1, ('extremely', 'young'): 1, ('young', 'despite'): 1, ('despite', 'heavy'): 1, ('heavy', 'institutional'): 1, ('institutional', 'embrace'): 1, ('embrace', 'crypto'): 1, ('crypto', 'space'): 1, ('space', 'latest'): 1, ('latest', 'crypto'): 1, ('crypto', 'startup'): 1, ('startup', 'court'): 1, ('court', 'attenti'): 1}


In [17]:
# Generate the Ethereum N-grams where N=2
processed_text_ethereum = process_text(text_ethereum)#text_bitcoin)
print(processed_text_ethereum)

bigram_counts = Counter(ngrams(processed_text_ethereum, n=2))
print(dict(bigram_counts))

['blockchain', 'infrastructure', 'startup', 'heating', 'industry', 'fervor', 'brings', 'developer', 'user', 'space', 'still', 'feel', 'extremely', 'young', 'despite', 'heavy', 'institutional', 'embrace', 'crypto', 'space', 'latest', 'crypto', 'startup', 'court', 'attenti']
{('blockchain', 'infrastructure'): 1, ('infrastructure', 'startup'): 1, ('startup', 'heating'): 1, ('heating', 'industry'): 1, ('industry', 'fervor'): 1, ('fervor', 'brings'): 1, ('brings', 'developer'): 1, ('developer', 'user'): 1, ('user', 'space'): 1, ('space', 'still'): 1, ('still', 'feel'): 1, ('feel', 'extremely'): 1, ('extremely', 'young'): 1, ('young', 'despite'): 1, ('despite', 'heavy'): 1, ('heavy', 'institutional'): 1, ('institutional', 'embrace'): 1, ('embrace', 'crypto'): 1, ('crypto', 'space'): 1, ('space', 'latest'): 1, ('latest', 'crypto'): 1, ('crypto', 'startup'): 1, ('startup', 'court'): 1, ('court', 'attenti'): 1}


In [18]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [19]:
# Use token_count to get the top 10 words for Bitcoin
# YOUR CODE HERE!

In [20]:
# Use token_count to get the top 10 words for Ethereum
print(dict(bigram_counts.most_common(10)))

{('blockchain', 'infrastructure'): 1, ('infrastructure', 'startup'): 1, ('startup', 'heating'): 1, ('heating', 'industry'): 1, ('industry', 'fervor'): 1, ('fervor', 'brings'): 1, ('brings', 'developer'): 1, ('developer', 'user'): 1, ('user', 'space'): 1, ('space', 'still'): 1}


---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [22]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [33]:
# Generate the Ethereum word cloud
#https://monash.bootcampcontent.com/monash-coding-bootcamp/monu-mel-virt-fin-pt-05-2021-u-c/-/blob/master/Activities/Week%2012/2/07-Ins_Tone_Analysis/Solved/tone_analysis.ipynb
from nltk.corpus import reuters
print (reuters.fileids('Ethereum')) ## TODO 

ids = reuters.fileids(categories='Ethereum') ## TODO


corpus = [reuters.raw(i) for i in ids]

big_string = ' '.join(corpus)
input_text = process_text(big_string)

print (processed)

wc = WordCloud().generate(input_text)
plt.imshow(wc)

ValueError: Category Ethereum not found

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [26]:
import spacy
from spacy import displacy

In [27]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [28]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [29]:
# Concatenate all of the Bitcoin text together
 # Set article to be analyzed with spaCy
doc = nlp(text_ethereum) ## TODO 

In [30]:
# Run the NER processor on all of the text
displacy.render(doc, style='ent')

# Add a title to the document
# YOUR CODE HERE!

In [31]:
# Render the visualization
# YOUR CODE HERE!

In [32]:
# List all Entities
print([ent.text for ent in doc.ents if ent.label_ == 'GPE'])

[]


---

### Ethereum NER

In [None]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [None]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [None]:
# Render the visualization
# YOUR CODE HERE!

In [None]:
# List all Entities
# YOUR CODE HERE!

---