# Sentiment Analysis of Top Google News Articles for keyword bitcoin
---

This notebook outlines my process of analyzing sentiments of each article contained in the article list for a particular date starting from January 7, 2014 up to December 12, 2017 with respect to the keyword **bitcoin**.
> You can view how I scraped these articles in this [notebook](Google News Scraper.ipynb).

For the analysis and computation of sentiment scores I decided to use Google's [Cloud Natural Language](https://cloud.google.com/natural-language/?utm_source=google&utm_medium=cpc&utm_campaign=na-US-all-en-dr-bkws-all-all-trial-p-dr-1002250&utm_content=text-ad-none-any-DEV_c-CRE_185611873602-ADGP_SKWS+%7C+Multi+%7E+null_Sentiment+Analysis-KWID_43700019167264275-kwd-2176092866&utm_term=KW_sentiment%20analysis-ST_sentiment+analysis&gclid=Cj0KCQiAgs7RBRDoARIsANOo-HhpmqiO5CsfaH9PMwL2dDVs8rrNyeiBE7QSac4Gmzzrgt9YpJlnSAIaAstzEALw_wcB&dclid=CMeTv6zbjdgCFdcKNwodDcQCAA).
> **Note**: This notebook was run locally after configuring the local system according the protocols defined [here](https://cloud.google.com/natural-language/docs/quickstart).

## Introduction

Sentiment analysis attempts to determine the overall attitude (positive or negative) expressed within the text. Sentiment is represented by numerical score and magnitude values.

### Sentiment Analysis Response Fields

A sample analyzeSentiment response to the [Gettysburg Address](https://en.wikipedia.org/wiki/Gettysburg_Address) is shown below:

![alt text](analyzeSentiment JSON response.png "analyzeSentiment JSON response")

These field values are described below:

* **documentSentiment** contains the overall sentiment of the document, which consists of the following fields:
    * **score** of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
    * **magnitude** indicates the overall strength of emotion (both positive and negative) within the given text, between **0.0** and **+inf**. Unlike **score**, **magnitude** is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's **magnitude** (so longer text blocks may have greater magnitudes).
* **language** contains the language of the document, either passed in the initial request, or automatically detected if absent.
* **sentences** contains a list of the sentences extracted from the original document, which contains:
    * **sentiment** contains the sentence level sentiment values attached to each sentence, which contain **score** and **magnitude** values as described above.
    
A response value to the Gettysburg Address of **0.2** score indicates a document which is slightly positive in emotion, while the value of **3.6** indicates a relatively emotional document, given its small size (of about a paragraph). Note that the first sentence of the Gettysburg address contains a very high positive **score** of **0.8**.

### Interpreting sentiment analysis values

The **score** of a document's sentiment indicates the overall emotion of a document. The **magnitude** of a document's sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document.

It is important to note that the Natural Language API indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, "angry" and "sad" are both considered negative emotions. However, when the Natural Language API analyzes text that is considered "angry", or text that is considered "sad", the response only indicates that the sentiment in the text is negative, not "sad" or "angry".

A document with a neutral score (around **0.0**) may indicate a low-emotion document, or may indicate mixed emotions, with both high positive and negative values which cancel each out. Generally, you can use **magnitude** values to disambiguate these cases, as truly neutral documents will have a low **magnitude** value, while mixed documents will have higher **magnitude** values.

When comparing documents to each other (especially documents of different length), make sure to use the **magnitude** values to calibrate your **scores**, as they can help you gauge the relevant amount of emotional content.

## Read the Data

All article CSVs have been renamed to qualify for alphabetical sorting to enable faster processing of individual months CSV files for years 2015-2017.

### 2014

In [1]:
import pandas as pd
from ast import literal_eval

# pd.set_option('display.max_rows', 500) ##Uncomment if you want to view entire dataframe

# Year 2014 Articles
df_2014 = pd.read_csv('Articles/2014/articles_2014.csv', converters={"Articles": literal_eval})

# Function to remove articles containing '\n\n' and omit empty list items
def clean_article_list(articles_list):
    articles_list[:] = [article.replace('\n\n','') for article in articles_list if article != ""]
    return articles_list

# Applying above function    
df_2014['Articles'] = df_2014['Articles'].apply(clean_article_list)

# Uniformity with respect to historical bitcoin price values date range
df_2014 = df_2014[df_2014['Date']>'01/06/2014']

df_2014.head()

Unnamed: 0,Date,Articles
6,01/07/2014,[Move over Dogecoin: the Herncoin is here. But...
7,01/08/2014,[Pop star Lily Allen has voiced her regret at ...
8,01/09/2014,[The world's first insured bitcoin storage ser...
9,01/10/2014,[Less than 30 minutes after online retailer Ov...
10,01/11/2014,"[The price of bitcoin reached $1,000 again tod..."


### 2015

In [2]:
import os, glob

path = 'Articles/2015'
all_files = sorted(glob.glob(os.path.join(path, "*.csv")))

# Year 2015 Articles
df_2015 = pd.concat((pd.read_csv(f, converters={"Articles": literal_eval}, encoding='windows-1252', index_col=None, header=0) for f in all_files))

df_2015['Articles'] = df_2015['Articles'].apply(clean_article_list)

df_2015.head()

Unnamed: 0,Date,Articles
0,01/01/2015,[The disappearance of 99% of the bitcoins miss...
1,01/02/2015,[[Editor's note: The original version of this ...
2,01/03/2015,"[So, it happened. After some time moving in a ..."
3,01/04/2015,[2014 was a bad year for people who own bitcoi...
4,01/05/2015,[UPDATE (5th January 09:43 GMT): Bitstamp has ...


### 2016

In [3]:
import os, glob
path = 'Articles/2016'
all_files = sorted(glob.glob(os.path.join(path, "*.csv")))

# Year 2016 Articles
df_2016_1 = pd.concat((pd.read_csv(f, converters={"Articles": literal_eval}, encoding='windows-1252', index_col=None, header=0) for f in all_files[0:7]))
df_2016_2 = pd.concat((pd.read_csv(f, converters={"Articles": literal_eval}, index_col=None, header=0) for f in all_files[7:]))
df_2016 = pd.concat([df_2016_1,df_2016_2])

df_2016['Articles'] = df_2016['Articles'].apply(clean_article_list)

df_2016.head()

Unnamed: 0,Date,Articles
0,01/01/2016,[Bitcoin Weekly Recap 1-1-2016Happy New Year f...
1,01/02/2016,[The most epochal financial transaction of thi...
2,01/03/2016,"[Like most coffee shops, Java Express in Dougl..."
3,01/04/2016,[Mike Tyson’s Bitcoin ambitions don’t stop wit...
4,01/05/2016,[Movie star and former heavyweight champion of...


### 2017

In [4]:
import os, glob
path = 'Articles/2017'
all_files = sorted(glob.glob(os.path.join(path, "*.csv")))

# Year 2017 Articles
df_2017_1 = pd.concat((pd.read_csv(f, converters={"Articles": literal_eval}, index_col=None, header=0) for f in all_files[0:2]))
df_2017_2 = pd.concat((pd.read_csv(f, converters={"Articles": literal_eval}, encoding='windows-1252', index_col=None, header=0) for f in all_files[2:]))
df_2017 = pd.concat([df_2017_1,df_2017_2])

df_2017['Articles'] = df_2017['Articles'].apply(clean_article_list)

df_2017.head()

Unnamed: 0,Date,Articles
0,01/01/2017,[The price of bitcoin inched upward over the c...
1,01/02/2017,[LONDON (Reuters) - Digital currency bitcoin k...
2,01/03/2017,[Image copyright Getty ImagesDigital currency ...
3,01/04/2017,"[Bitcoin hit an all-time high Wednesday, accor..."
4,01/05/2017,[Bitcoin is off the lowest levels of its plung...


### Final Dataframe

In [5]:
df_bitcoin_articles = pd.concat([df_2014, df_2015, df_2016, df_2017])
df_bitcoin_articles['Num of Articles'] = df_bitcoin_articles['Articles'].map(lambda x: len(x))
df_bitcoin_articles.head()

Unnamed: 0,Date,Articles,Num of Articles
6,01/07/2014,[Move over Dogecoin: the Herncoin is here. But...,10
7,01/08/2014,[Pop star Lily Allen has voiced her regret at ...,10
8,01/09/2014,[The world's first insured bitcoin storage ser...,10
9,01/10/2014,[Less than 30 minutes after online retailer Ov...,10
10,01/11/2014,"[The price of bitcoin reached $1,000 again tod...",8


### Writing final dataframe to CSV

In [6]:
# Output has been computed and stored in the Articles/Combined folder
# Uncomment if you want to compute output again

# df_bitcoin_articles.to_csv('bitcoin_news_articles_2014_2017.csv', index=False)

## Compute Sentiment Scores and Classify News Articles

Extend the logic below for the years 2014, 2015 and 2017.

In [7]:
# Imports the Google Cloud client library
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="My-First-Project-3377f41be4cf.json"

# Instantiates a client
client = language.LanguageServiceClient()

def compute_sentiment_scores(articles_list):
    try:
        articles_list[:] = [[client.analyze_sentiment(document=language.types.Document(content=article,type='PLAIN_TEXT',), encoding_type='UTF32',).document_sentiment.score, client.analyze_sentiment(document=language.types.Document(content=article,type='PLAIN_TEXT',), encoding_type='UTF32',).document_sentiment.magnitude] for article in articles_list]
    except:
        for article in articles_list:
            articles_list.remove(article)
    return articles_list

df_2016['Article[Sentiment,Magnitude]'] = df_2016['Articles'].apply(compute_sentiment_scores)
df_2016.head()

Unnamed: 0,Date,Articles,"Article[Sentiment,Magnitude]"
0,01/01/2016,"[[-0.10000000149011612, 7.900000095367432], [0...","[[-0.10000000149011612, 7.900000095367432], [0..."
1,01/02/2016,"[[0.0, 16.700000762939453], [0.0, 13.800000190...","[[0.0, 16.700000762939453], [0.0, 13.800000190..."
2,01/03/2016,"[[0.10000000149011612, 9.0], [0.10000000149011...","[[0.10000000149011612, 9.0], [0.10000000149011..."
3,01/04/2016,"[[0.0, 1.899999976158142], [0.0, 13.3000001907...","[[0.0, 1.899999976158142], [0.0, 13.3000001907..."
4,01/05/2016,"[[0.20000000298023224, 8.600000381469727], [0....","[[0.20000000298023224, 8.600000381469727], [0...."


In [8]:
df_2016.to_csv('articles_sentiments_2016.csv', index=False)

### Read and merge CSVs for article sentiments

In [94]:
import pandas as pd
from ast import literal_eval

articles_sentiments_2014 = pd.read_csv('Article Sentiments/articles_sentiments_2014.csv', converters={"Article[Sentiment,Magnitude]": literal_eval})
articles_sentiments_2014 = articles_sentiments_2014.drop(['Articles'],axis=1)
articles_sentiments_2015 = pd.read_csv('Article Sentiments/articles_sentiments_2015.csv', converters={"Article[Sentiment,Magnitude]": literal_eval})
articles_sentiments_2015 = articles_sentiments_2015.drop(['Articles'],axis=1)
articles_sentiments_2016 = pd.read_csv('Article Sentiments/articles_sentiments_2016.csv', converters={"Article[Sentiment,Magnitude]": literal_eval})
articles_sentiments_2016 = articles_sentiments_2016.drop(['Articles'],axis=1)
articles_sentiments_2017 = pd.read_csv('Article Sentiments/articles_sentiments_2017.csv', converters={"Article[Sentiment,Magnitude]": literal_eval})
articles_sentiments_2017 = articles_sentiments_2017.drop(['Articles'],axis=1)

In [95]:
articles_sentiments_1 = pd.concat([articles_sentiments_2014,articles_sentiments_2015])
articles_sentiments_2 = pd.concat([articles_sentiments_2016,articles_sentiments_2017])

articles_sentiments = pd.concat([articles_sentiments_1,articles_sentiments_2])
        
articles_sentiments.head(7)

Unnamed: 0,Date,"Article[Sentiment,Magnitude]"
0,01/07/2014,"[[0.0, 17.5], [0.0, 10.100000381469727], [-0.3..."
1,01/08/2014,"[[-0.30000001192092896, 1.399999976158142], [0..."
2,01/09/2014,"[[0.0, 7.199999809265137], [-0.400000005960464..."
3,01/10/2014,"[[0.20000000298023224, 10.199999809265137], [-..."
4,01/11/2014,"[[0.10000000149011612, 9.399999618530273], [0...."
5,01/12/2014,"[This past week, online retailer Overstock.com..."
6,01/13/2014,"[[-0.20000000298023224, 2.200000047683716], [0..."


### Clean the Combined Articles Sentiment CSV

In [96]:
# Imports the Google Cloud client library
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="My First Project-c0f75529e52c.json"

# Instantiates a client
client = language.LanguageServiceClient()


def compute_sentiment_scores(articles_list):
    try:
        articles_list[:] = [[client.analyze_sentiment(document=language.types.Document(content=article,type='PLAIN_TEXT',), encoding_type='UTF32',).document_sentiment.score, client.analyze_sentiment(document=language.types.Document(content=article,type='PLAIN_TEXT',), encoding_type='UTF32',).document_sentiment.magnitude] for article in articles_list]
    except:
        for article in articles_list:
            articles_list.remove(article)
    return articles_list

for dt in ['01/12/2014','01/19/2014','01/25/2014','02/01/2014','02/09/2014','02/16/2014','03/02/2014','03/15/2014','03/22/2014','03/29/2014','04/05/2014','04/11/2014','04/13/2014','04/19/2014','04/26/2014','04/27/2014','06/14/2014','06/22/2014','07/12/2014','07/20/2014','07/26/2014','08/25/2014','09/07/2014','09/13/2014','09/27/2014','10/11/2014','11/15/2014','11/16/2014','12/25/2014', '01/25/2015','02/06/2015','02/08/2015','02/14/2015','02/22/2015','03/28/2015','05/10/2015','05/17/2015','06/28/2015','08/02/2015','08/08/2015','08/23/2015','09/13/2015','10/09/2015','10/18/2015','10/24/2015','12/27/2015','12/31/2015', '01/10/2016','01/23/2016','04/02/2016','05/28/2016','06/12/2016','06/18/2016','08/20/2016','10/09/2016','01/25/2017']:
    articles_sentiments['Article[Sentiment,Magnitude]'][articles_sentiments['Date'] == dt] = articles_sentiments['Article[Sentiment,Magnitude]'][articles_sentiments['Date'] == dt].apply(compute_sentiment_scores)

articles_sentiments.head(20)

Unnamed: 0,Date,"Article[Sentiment,Magnitude]"
0,01/07/2014,"[[0.0, 17.5], [0.0, 10.100000381469727], [-0.3..."
1,01/08/2014,"[[-0.30000001192092896, 1.399999976158142], [0..."
2,01/09/2014,"[[0.0, 7.199999809265137], [-0.400000005960464..."
3,01/10/2014,"[[0.20000000298023224, 10.199999809265137], [-..."
4,01/11/2014,"[[0.10000000149011612, 9.399999618530273], [0...."
5,01/12/2014,"[[0.0, 1.2999999523162842], [0.0, 2.2000000476..."
6,01/13/2014,"[[-0.20000000298023224, 2.200000047683716], [0..."
7,01/14/2014,"[[-0.10000000149011612, 1.600000023841858], [0..."
8,01/15/2014,"[[0.0, 1.5], [0.0, 7.400000095367432], [0.0, 3..."
9,01/16/2014,"[[0.0, 19.0], [-0.10000000149011612, 4.5999999..."


### Write Article Sentiment Dataframe to CSV

In [100]:
articles_sentiments.to_csv('bitcoin_news_articles_sentiments_2014_2017.csv', index=False)

## Classifying the news articles

In [188]:
import pandas as pd
from ast import literal_eval

articles_sentiments_final = pd.read_csv('bitcoin_news_articles_sentiments_2014_2017.csv', converters={"Article[Sentiment,Magnitude]": literal_eval})
articles_sentiments_final.head()

Unnamed: 0,Date,"Article[Sentiment,Magnitude]"
0,1/7/14,"[[0.0, 17.5], [0.0, 10.100000381469727], [-0.3..."
1,1/8/14,"[[-0.30000001192092896, 1.399999976158142], [0..."
2,1/9/14,"[[0.0, 7.199999809265137], [-0.400000005960464..."
3,1/10/14,"[[0.20000000298023224, 10.199999809265137], [-..."
4,1/11/14,"[[0.10000000149011612, 9.399999618530273], [0...."


For my classification I decided to compute the average sentiment value for each day. The chart below shows some sample values and how to interpret them:

![alt text](Sentiment_Score_Sample_Chart.png "Sentiment_Score_Sample_Chart")

The bar below represents the polarity of my scores. I chose a threshold value of 0.25 and above for Clearly Positive Sentiment and -0.25 and below for Clearly Negative Sentiment. Neutral at 0.

![alt text](Sentiment_Score_Range.png "Sentiment Score Range")

In [189]:
# Helper function to isolate and compute net Sentiment score for each day starting 1/7/2014 - 12/12/2017
def avg_sentiment_score(sent_score_list):
    sent_list = [sent_score[0] for sent_score in sent_score_list]
    avg_sent_score = sum(sent_list)/len(sent_list)
    return avg_sent_score

articles_sentiments_final['Average Sentiment Score'] = articles_sentiments_final['Article[Sentiment,Magnitude]'].apply(avg_sentiment_score)
articles_sentiments_final = articles_sentiments_final.drop(['Article[Sentiment,Magnitude]'], axis=1)
articles_sentiments_final.head()

Unnamed: 0,Date,Average Sentiment Score
0,1/7/14,-0.07
1,1/8/14,-0.15
2,1/9/14,-0.18
3,1/10/14,0.11
4,1/11/14,-0.025


## Write Secondary Feature Set Dataframe to CSV

In [190]:
articles_sentiments_final.to_csv('bitcoin_news_average_sentiments_2014_2017.csv', index=False)