# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/christydain/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
# news_api_key = os.getenv("NEWS_API_KEY")

api_key = os.getenv("NEWS_API_KEY")

In [3]:
# Create a newsapi client
from newsapi import NewsApiClient

# newsapi = NewsApiClient(api_key=news_api_key)

newsapi = NewsApiClient(api_key='bc5bef3ca50e48f9a1021b87781a3e3d')



In [4]:
# Fetch the Bitcoin news articles

bitcoin_headlines = newsapi.get_everything(
    q="bitcoin AND BTC",
    language="en",
#    page_size=100
)


In [5]:
# Fetch the Ethereum news articles

Ethereum_headlines = newsapi.get_everything(
    q="ethereum AND ETH",
    language="en",
#    page_size=100
)

In [6]:
# Create the Bitcoin sentiment scores DataFrame

bitcoin_sentiments = []

for article in bitcoin_headlines["articles"]:
    try:
        sentiment = analyzer.polarity_scores(article['content'])
      
        bitcoin_sentiments.append({
            "Text": article["content"],
            "Compound": sentiment["compound"],
            "Positive": sentiment["pos"],
            "Negative": sentiment["neg"],
            "Neutral": sentiment["neu"]
            
        })
        
    except AttributeError:
        pass
    
btc_df = pd.DataFrame(bitcoin_sentiments)


cols =["Compound", "Negative", "Neutral", "Positive", "Text"]
btc_df = btc_df[cols]

btc_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,0.6908,0.0,0.831,0.169,"It's all about clean energy, it seems. \r\nElo..."
1,0.5574,0.0,0.893,0.107,"Several crypto fans that descended on Miami, F..."
2,0.128,0.0,0.957,0.043,El Salvador has become the first country in th...
3,0.5859,0.0,0.866,0.134,By Reuters Staff\r\nJune 13 (Reuters) - Tesla ...
4,-0.5994,0.126,0.874,0.0,"Bitcoin hit a two-week peak just shy of $40,00..."


In [7]:
# Create the Ethereum sentiment scores DataFrame

Ethereum_sentiments = []

for article in Ethereum_headlines["articles"]:
    try:
        sentiment = analyzer.polarity_scores(article['content'])
      
        Ethereum_sentiments.append({
            "Text": article["content"],
            "Compound": sentiment["compound"],
            "Positive": sentiment["pos"],
            "Negative": sentiment["neg"],
            "Neutral": sentiment["neu"]
            
        })
        
    except AttributeError:
        pass
    
# Create DataFrame
eth_df = pd.DataFrame(Ethereum_sentiments)

# Reorder DataFrame columns
cols =["Compound", "Negative", "Neutral", "Positive", "Text"]
eth_df = eth_df[cols]

eth_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,-0.34,0.066,0.934,0.0,This article was translated from our Spanish e...
1,0.3612,0.0,0.935,0.065,"Sir Tim Berners-Lee, credited as the inventor ..."
2,0.3182,0.0,0.927,0.073,"Neither the author, Kai Morris, nor this websi..."
3,0.0,0.0,1.0,0.0,ENS stands for Ethereum Name Service and it is...
4,0.0,0.0,1.0,0.0,"In February 2021, Figma CEO Dylan Fields sold ..."


In [8]:
# Describe the Bitcoin Sentiment
btc_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,-0.04509,0.0486,0.91175,0.03965
std,0.417796,0.058344,0.055107,0.05261
min,-0.7184,0.0,0.831,0.0
25%,-0.34,0.0,0.8655,0.0
50%,0.0,0.0305,0.9205,0.0
75%,0.17,0.0665,0.943,0.06775
max,0.6908,0.162,1.0,0.169


In [9]:
# Describe the Ethereum Sentiment
eth_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.13575,0.0297,0.91195,0.05835
std,0.312333,0.050573,0.080361,0.056367
min,-0.4404,0.0,0.775,0.0
25%,0.0,0.0,0.85225,0.0
50%,0.05135,0.0,0.931,0.066
75%,0.3612,0.06225,1.0,0.0755
max,0.7531,0.151,1.0,0.211


### Questions:

Q: Which coin had the highest mean positive score?

A: 

Q: Which coin had the highest compound score?

A: 

Q. Which coin had the highest positive score?

A: 

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [10]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [11]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
# YOUR CODE HERE???
def clean_text(article):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    
    re_clean = regex.sub('', article)
    words = word_tokenize(re_clean)
    output = [word.lower() for word in words if word.lower() not in sw]
    return output


# Expand the default stopwords list if necessary
addl_stopwords = [',', '', 'https', 'http', 'btc', 'bitcoin', 'eth', 'ethereum']


In [12]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    # Create a tokenized list of the words    
    # Lemmatize words into root words
    # Convert the words to lowercase   
    # Remove the stop words
    
    
    return tokens

def tokenizer(text):
    """Tokenizes text."""
    
     # Remove the punctuation
    regex = re.compile("[^a-zA-Z ]")
    text = [regex.sub('', word) for word in text]
    
    # Create a list of the words
    text = word_tokenize(text)
    
     # Lemmatize Words into root words
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    text = [word for word in text if word not in sw]
    return text

    # Convert the words to lowercase
    text = [word.lower() for word in text]
    
    # Remove the stop words    
    sw = set(stopwords.words('english') + addl_stopwords)

    

In [13]:
# Create a new tokens column for Bitcoin
btc_tokens = []
[btc_tokens.append(tokenizer(text)) for text in btc_df.Text]   
btc_df['Tokens'] = btc_tokens
btc_df.head()

TypeError: expected string or bytes-like object

In [None]:
# Create a new tokens column for Ethereum
eth_tokens = []
[eth_tokens.append(tokenizer(text)) for text in eth_df.Text]   
eth_df['Tokens'] = eth_tokens
eth_df.head()

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [None]:
from collections import Counter
from nltk import ngrams

In [None]:
# Generate the Bitcoin N-grams where N=2
corpus = bitcoin_articles["description"]
bigram_counter(corpus)

In [None]:
# Generate the Ethereum N-grams where N=2
corpus = ethereum_articles["description"]
bigram_counter(corpus)

In [None]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [None]:
# Use token_count to get the top 10 words for Bitcoin
def token_count(tokens, N=10):
    """Returns the top N tokens from the frequency count"""
    # Combine all articles in corpus into one large string
    big_string = ' '.join(tokens)
    processed = process_text(big_string)
    top = dict(Counter(processed).most_common(10))
    return pd.DataFrame(list(top.items()), columns=['word', 'count'])

In [None]:
# Use token_count to get the top 10 words for Ethereum
tokens = ethereum_articles["description"]
token_count(tokens, N=10)


---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]


In [None]:
# Generate the Bitcoin word cloud
corpus = bitcoin_articles["description"]

def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return ' '.join(output)

big_string = ' '.join(corpus)
input_text = process_text(big_string)

wc = WordCloud().generate(input_text)
plt.imshow(wc)

In [None]:
# Generate the Ethereum word cloud
corpus = ethereum_articles["description"]

big_string = ' '.join(corpus)
input_text = process_text(big_string)

wc = WordCloud().generate(input_text)
plt.imshow(wc)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [None]:
import spacy
from spacy import displacy

In [None]:
# Download the language model for SpaCy
!python -m spacy download en_core_web_sm

In [None]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [None]:
# Concatenate all of the Bitcoin text together

article = bitcoin_articles["description"].str.cat()


In [None]:
# Run the NER processor on all of the text
doc = nlp(article)

btc_ner = nlp(btc_corpus)

# Add a title to the document

btc_ner.user_data["title"] = "Bitcoin NER"

In [None]:
# Render the visualization
# displacy.render(doc, style='ent')
btc_ner = nlp(btc_corpus)
btc_ner.user_data["title"] = "Bitcoin NER"

In [None]:
# List all Entities
# print([ent.text for ent in doc.ents if ent.label_ == 'GPE'])
btc_ents = set([ent.text for ent in btc_ner.ents ])
btc_ents

---

### Ethereum NER

In [None]:
# Concatenate all of the Ethereum text together
article = ethereum_articles["description"].str.cat()

In [None]:
# Run the NER processor on all of the text
doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE???
eth_ner.user_data["title"] = "Ethereum NER"

In [None]:
# Render the visualization
displacy.render(doc, style='ent')


In [None]:
# List all Entities
print([ent.text for ent in doc.ents if ent.label_ == 'GPE'])


---