# Enzo Yamamura

### Reddit BTC Sentiments

A 5KK database of Reddit Commentaries from Big Query is processed and the sentiments polarities are calculated with VADER and the aggregated per day.


***
## Necessary libs

In [None]:
import glob
import pandas as pd # usaremos o pandas com processamento paralelo do modin
import numpy as np
import fasttext
import unicodedata
import plotly.express as px
import re
import string
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.preprocessing import MinMaxScaler
import tqdm
from datetime import datetime
import swifter

***
## Functions

In [None]:
# Fast Text is a Facebook NLP lib.


# Training Fast Text in a example database:
model = fasttext.load_model('lid.176.ftz')

# Function to identify language (VADER only works for english):

def fast_detect(msg):
    try:
        # The following predict returns the text language
        ln = model.predict(msg)[0][0].split("__")[2] 
    except:
        # It defaults to None when it fails
        ln = None
    return ln



In [None]:
# Function to remove special characters from comments
def convert_text(text):
    
    text = re.sub(r'\s+', ' ',str(text))
    
    
    text = re.sub(' +', ' ', text)
    
    text = text.replace("\\'","'")
    text = text.replace('\\','')
    
    text = text.replace('*','')
    text = text.replace('_','')

    
    text = re.sub(r'http\S+', '', text)
    

    return text

In [None]:
# Function to use VADER:
def VADER(text):
    
    sent_an = SentimentIntensityAnalyzer()
    
    
    sentiment_dict = sent_an.polarity_scores(text)

    # only the compound polarity score is used:
    return sentiment_dict['compound']

***
## Importing downloaded data

As the query retrieved a huge dataset, BigQuery partitioned it into multiple files to export.

In [None]:

reddit = pd.DataFrame()
for file in glob.glob('reddit_btc_comments*'):
    df = pd.read_csv(file)
    print(file,df.shape)
    reddit=reddit.append(df)

reddit_btc_comments000000000000 (1027994, 4)
reddit_btc_comments000000000001 (1027941, 4)
reddit_btc_comments000000000002 (1028673, 4)
reddit_btc_comments000000000003 (1026929, 4)
reddit_btc_comments000000000004 (1026232, 4)


Checking the file size:

In [None]:
reddit.shape #5137769

(5137769, 4)

The dataset comprises 5,137,769 commentaries mentioning either bitcoin or btc.

In [None]:
reddit.sample(1)

Unnamed: 0,subreddit,created_utc,body,score
556656,Showerthoughts,1544381824,I think what OP means is that if OP owned a la...,3.0


The table returned is structured as follows:

* subreddit

* UTC created date

* commentary body of text

* score = upvotes - downvotes (liquid score)

***
## Data processing

In [None]:
# Converting from unix to UTC

reddit['Created Date'] = pd.to_datetime(reddit['created_utc'],unit = 's')

In [None]:
# Ordering by created date

reddit.sort_values('Created Date',inplace=True)

We use the previously defined function to remove special characters.

Due to the dataset size, **Swifter** was used for parallel processing.

In [None]:
reddit.head()

Unnamed: 0,subreddit,created_utc,body,score,Created Date
734279,millionairemakers,1420070443,Starting out is tough. Strict Regulations in t...,2.0,2015-01-01 00:00:43
998976,changetip,1420070482,To quit looking at the price of btc every 15 m...,0.0,2015-01-01 00:01:22
717883,sportsbook,1420070482,That's the reason I dislike betting using bitc...,2.0,2015-01-01 00:01:22
900356,Bitcoin,1420070488,"Yup, you probably already installed bitcoin-se...",2.0,2015-01-01 00:01:28
254346,Bitcoin,1420070506,I really like Coinbase for its iPhone app. I h...,0.0,2015-01-01 00:01:46


After removing special characters, the body language is classified with Fast Text:

In [None]:
# Removing special characters
reddit['Comment'] = reddit['body'].swifter.apply(convert_text)


# Infering language
reddit['Language'] = reddit['Comment'].swifter.apply(fast_detect)

Pandas Apply: 100%|██████████| 5137769/5137769 [03:47<00:00, 22625.71it/s]
Pandas Apply: 100%|██████████| 5137769/5137769 [06:04<00:00, 14084.16it/s]


In [None]:
#  Languages
cont = reddit['Language'].value_counts().to_frame('contagem')
cont2 = pd.DataFrame({'english':cont.loc['en'], 'others':cont.iloc[1:].sum(axis=0)})
cont2


Unnamed: 0,english,others
contagem,5077054,60715


As VADER only works for english text, the other languages are removed.

In [None]:
reddit.query('Language == "en"',inplace=True)

VADER is then used to obtain the compound sentiment polarity score for each comment.

In [None]:
#Once again using swifter to accelearate processing:
reddit['Sentiment'] = reddit['Comment'].swifter.apply(VADER,axis=0)



Pandas Apply: 100%|██████████| 5077054/5077054 [15:35:10<00:00, 90.48it/s]   


In [None]:
reddit.to_csv('bup1.csv') #saving a backup


The scores are normalized with MinMaxScaler (there are negative scores) and used to weight the compound polarity.

In [None]:

reddit = pd.read_csv('bup1.csv')
reddit.head()

Unnamed: 0.1,Unnamed: 0,subreddit,created_utc,body,score,Created Date,Comment,Language,Sentiment
0,734279,millionairemakers,1420070443,Starting out is tough. Strict Regulations in t...,2.0,2015-01-01 00:00:43,Starting out is tough. Strict Regulations in t...,en,0.8163
1,998976,changetip,1420070482,To quit looking at the price of btc every 15 m...,0.0,2015-01-01 00:01:22,To quit looking at the price of btc every 15 m...,en,0.0
2,717883,sportsbook,1420070482,That's the reason I dislike betting using bitc...,2.0,2015-01-01 00:01:22,That's the reason I dislike betting using bitc...,en,0.1441
3,900356,Bitcoin,1420070488,"Yup, you probably already installed bitcoin-se...",2.0,2015-01-01 00:01:28,"Yup, you probably already installed bitcoin-se...",en,0.5927
4,254346,Bitcoin,1420070506,I really like Coinbase for its iPhone app. I h...,0.0,2015-01-01 00:01:46,I really like Coinbase for its iPhone app. I h...,en,0.9657


In [None]:
# Necessary Columns:
reddit = reddit[['subreddit','score','Created Date','Comment','Sentiment']]
reddit.head()

Unnamed: 0,subreddit,score,Created Date,Comment,Sentiment
0,millionairemakers,2.0,2015-01-01 00:00:43,Starting out is tough. Strict Regulations in t...,0.8163
1,changetip,0.0,2015-01-01 00:01:22,To quit looking at the price of btc every 15 m...,0.0
2,sportsbook,2.0,2015-01-01 00:01:22,That's the reason I dislike betting using bitc...,0.1441
3,Bitcoin,2.0,2015-01-01 00:01:28,"Yup, you probably already installed bitcoin-se...",0.5927
4,Bitcoin,0.0,2015-01-01 00:01:46,I really like Coinbase for its iPhone app. I h...,0.9657


In [None]:
reddit.duplicated(keep='first').sum()

6578

Removing the exact duplicated observations:

In [None]:
reddit.drop_duplicates(inplace=True)
reddit.shape[0]

5070476

The dataset goes from 5077054 to 5070476 observations after removing duplicates.

In [None]:
# Checking vader performance:
print(reddit.query('Sentiment==1').sample(1)['Comment'].values)

['BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL SUPPORT CONTACT NUMBER 1-877-367-7461 CALL BITCOIN TECH SUPPORT NUMBER $$ 1.877.367.7461 %%BITCOIN TECHNICAL

In [None]:
print(reddit.query('Sentiment==-1').sample(1)['Comment'].values)

["The following comment by throwaway34234234523 was [openly]( greylisted. The original comment can be found(in censored form) at this link: np.reddit.com/r/ CryptoCurrency/comments/7poep7/-/dsire5o?context=4 The original comment's content was as follows: --- &gt; XRP is cancer for cryptocurrencies. Im sure you dont know abour technical specs of ripple, but is the worst coin for crypto. If it turns in main one day (as bitcoin), they will decrease the price of all alts until they cost almost zero, creating new coins. &gt; &gt; &gt;  100% Premine (cant mine new coins) &gt;  60% Held for creators (control) &gt;  No staking / No minig (no rewards for normal users) &gt;  Private blockchain (control, sell personal data) &gt;  Centralized (they can shutdown XRP blockchain when they want and crash the parity alts) &gt;  50$ stucked in wallet (20 XRP) (staking without reward) &gt;  Bound to identity (control, sell personal data) &gt; &gt; &gt; DO NOT SUPPORT RIPPLE (XRP), IT WANTS TO KILL CRYPTO

It is clear that the comments that scored either the maximum / minimum compound polarity (+1, -1) are spams.

This happens because VADER considers upper and lower cases as intensity indicator. Considering that only a few observations had the maximum/minimum score (3 observations), they are removed.


The other similar occurrences are controled by using the influence as weight (score).

In [None]:
reddit.query('Sentiment !=-1 & Sentiment != 1', inplace=True)

In [None]:
# Normalizing scores between 0-1

scaler = MinMaxScaler()

reddit['score_norm'] = scaler.fit_transform(reddit['score'].values.reshape(-1, 1))

# Normalized Scores * Compount Polarity

reddit['Sentiment_weighted'] = reddit['Sentiment']*(reddit['score_norm'])

# Thus, the weighted sentiment varies between -1 and 1.

In [None]:
# Defining daily dates:
reddit['Day'] = pd.to_datetime(reddit['Created Date']).dt.date

In [None]:
reddit.head()

Unnamed: 0,subreddit,score,Created Date,Comment,Sentiment,score_norm,Sentiment_weighted,Day
0,millionairemakers,2.0,2015-01-01 00:00:43,Starting out is tough. Strict Regulations in t...,0.8163,0.053064,0.043316,2015-01-01
1,changetip,0.0,2015-01-01 00:01:22,To quit looking at the price of btc every 15 m...,0.0,0.052999,0.0,2015-01-01
2,sportsbook,2.0,2015-01-01 00:01:22,That's the reason I dislike betting using bitc...,0.1441,0.053064,0.007646,2015-01-01
3,Bitcoin,2.0,2015-01-01 00:01:28,"Yup, you probably already installed bitcoin-se...",0.5927,0.053064,0.031451,2015-01-01
4,Bitcoin,0.0,2015-01-01 00:01:46,I really like Coinbase for its iPhone app. I h...,0.9657,0.052999,0.051181,2015-01-01


The dataset is exported.

In [None]:
reddit.to_csv('reddit.csv')

In [None]:
reddit.shape

(5070473, 8)