## Reddit Sentiment Model

A look into the different data, models, and results I found.

Notable Findings:
- There is NO correlation between reddit subreddits and the prices of cryptocurrencies. Even when sorted by the coin itself, there is no correlation found by Pearson's R, Kendall Tau, or Spearman rank. This means another factor is most likely the reason for price fluctuations that we see.
- I also found that a lot of model calculations into positivity or negativity of certain statements were grossly incorrect. This proved to be much more true for a lot of 'investment lingo', such as "to the moon", "HODL", or "FOMO". This meant that a lot of statements that should have been extremely positive or negative ended up more neutral since the models trained had decided to make these words more neutral compared to pos or neg, meaning that a custom model might be the only method of making progress here.
- I reached out to 5 redditors who claim to do sentiment analysis on certain subreddits concerning investments (r/wallstreetbets, r/CryptoCurrency, r/Bitcoin) and 4 reached out back. Their conclusions ultimately mirrored mine.

Let's begin.

In [1]:
#To start, I import the libraries I need throughout the notebook (sentiment model imports come later).
import requests as re
import pandas as pd
import numpy as np
import os
import sys
import time
import yfinance as yf
import matplotlib.pyplot as plt
from ast import literal_eval

In [16]:
#This reads in specifically from r/CryptoCurrency, this is the first dataset that I was able to play with. It pulls all data until the beginning of 2021.
if os.path.isfile('rCC.csv'):
    df = pd.read_csv('rCC.csv')
    df['data'] = df['data'].apply(literal_eval)
else:
    #Use this function to combine the array of strings.
    def flatten(input):
        new_list = []
        for i in input:
            if not (isinstance(i, str)):
                try:
                    for j in i:
                        if len(j) > 0:
                            new_list.append(j.replace('\n', ''))
                except:
                    print(i)
                    print(input)
                    raise NameError('NaN')
            else:
                new_list.append(i)
        return new_list

    df = pd.DataFrame()
    #While the last fetched post's date is after the start of the new year
    while len(df) == 0 or int(df.iloc[-1]['created_utc']) > 1609459200:
        #If there is no data in the dataset
        if len(df) == 0:
            r = re.get('https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100')
        #If there is data, pull the last date as the reference for the next API call
        else:
            r = re.get('https://api.pushshift.io/reddit/search/submission/?subreddit=CryptoCurrency&size=100&before=' + str(int(df.iloc[-1]['created_utc'])))
        #If successful, convert the data from JSON to a Dataframe, and clean the data into sentences -> column ['data']
        if r.status_code == 200:
            r = pd.json_normalize(r.json(), record_path='data')[['author', 'created_utc', 'id', 'is_self', 'link_flair_text', 'num_comments', 'score', 'selftext', 'title', 'url']]
            r = r.where((len(r['selftext']) > 1) & (r['selftext'] != '[removed]') & (r['selftext'] != '[deleted]') & (r['selftext'].notna())).dropna(how='all')
            r['selftext'] = r['selftext'].str.split('.')
            r['data'] = r[['title', 'selftext']].values.tolist()
            r['data'] = r['data'].apply(flatten)
            df = pd.concat([df, r], ignore_index=True)
        #Delay in case the API pull is too fast and gets rejected.
        else:
            time.sleep(15)
    df.to_csv('rCC.csv')
df

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,is_self,link_flair_text,num_comments,score,selftext,title,url,data
0,0,Harleyblackpanther,1.631018e+09,pjm9or,0.0,NEW-COIN,0.0,1.0,[''],Julio Iglesias Coin,https://bitclout.com/posts/25c09a3c19e37384385...,[Julio Iglesias Coin]
1,1,Goatblort,1.631018e+09,pjm9k0,0.0,MINING-STAKING,1.0,1.0,[''],Crypto mining explodes in Vietnam,https://coinmarketcap.com/headlines/news/Crypt...,[Crypto mining explodes in Vietnam]
2,2,ssyamchasa,1.631018e+09,pjm93g,0.0,MEDIA,0.0,1.0,[''],Brokoli: Green Vision With Flawless Crypto Exe...,https://m.investing.com/news/cryptocurrency-ne...,[Brokoli: Green Vision With Flawless Crypto Ex...
3,3,Harleyblackpanther,1.631018e+09,pjm8ox,0.0,NEW-COIN,1.0,1.0,[''],Julioiglesias coin. https://bitclout.com/posts...,https://i.redd.it/la52cqfat2m71.jpg,[Julioiglesias coin. https://bitclout.com/post...
4,4,ITmancoderwannabe,1.631018e+09,pjm7wq,0.0,,0.0,1.0,[''],Cryptocurrency Prices Today on September 7: So...,https://www.moneycontrol.com/news/business/cry...,[Cryptocurrency Prices Today on September 7: S...
...,...,...,...,...,...,...,...,...,...,...,...,...
272468,272468,ki777iz,1.609455e+09,knzz90,0.0,GENERAL-NEWS,0.0,1.0,[''],Darknet Marketplace Has Stopped Supporting Pay...,https://thecryptobasic.com/2020/12/31/darknet-...,[Darknet Marketplace Has Stopped Supporting Pa...
272469,272469,HashMoose,1.609455e+09,knzwiq,0.0,TRADING,7.0,1.0,[''],Representatives ask Mnuchin to extend comment ...,https://www.theblockcrypto.com/linked/89783/re...,[Representatives ask Mnuchin to extend comment...
272470,272470,clydedyed,1.609455e+09,knzwh9,0.0,COMEDY,2.0,1.0,[''],wtf is wrong with these cultist,https://i.imgur.com/Wd8UqFk.png,[wtf is wrong with these cultist]
272471,272471,Crypto-Hero,1.609455e+09,knzvzd,1.0,FINANCE,7.0,1.0,['REMINDER: Cash out on your holding on PayPal...,REMINDER: Cash out on your holding on PayPal f...,https://www.reddit.com/r/CryptoCurrency/commen...,[REMINDER: Cash out on your holding on PayPal ...


In [17]:
#Read in bitcoin prices from the beginning of 2021 on an hourly basis.
#Now, since we want to focus on the actual sentiment analysis model, ignore this cell.
'''
if os.path.isfile('btcprice.csv'):
    btcprices = pd.read_csv('btcprice.csv')
else:
    btcprices = yf.download(tickers="BTC-USD", period="ytd", interval="1h")
    pd.DataFrame(btcprices).to_csv('btcprice.csv')
    pd.DataFrame(btcprices)
'''
print("Now, since we want to focus on the actual sentiment analysis model, ignore this cell.")

Now, since we want to focus on the actual sentiment analysis model, ignore this cell.


In [18]:
#Get the raw string data of the post.
df['rawdata'] = df['data'].apply(" ".join)
df

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,is_self,link_flair_text,num_comments,score,selftext,title,url,data,rawdata
0,0,Harleyblackpanther,1.631018e+09,pjm9or,0.0,NEW-COIN,0.0,1.0,[''],Julio Iglesias Coin,https://bitclout.com/posts/25c09a3c19e37384385...,[Julio Iglesias Coin],Julio Iglesias Coin
1,1,Goatblort,1.631018e+09,pjm9k0,0.0,MINING-STAKING,1.0,1.0,[''],Crypto mining explodes in Vietnam,https://coinmarketcap.com/headlines/news/Crypt...,[Crypto mining explodes in Vietnam],Crypto mining explodes in Vietnam
2,2,ssyamchasa,1.631018e+09,pjm93g,0.0,MEDIA,0.0,1.0,[''],Brokoli: Green Vision With Flawless Crypto Exe...,https://m.investing.com/news/cryptocurrency-ne...,[Brokoli: Green Vision With Flawless Crypto Ex...,Brokoli: Green Vision With Flawless Crypto Exe...
3,3,Harleyblackpanther,1.631018e+09,pjm8ox,0.0,NEW-COIN,1.0,1.0,[''],Julioiglesias coin. https://bitclout.com/posts...,https://i.redd.it/la52cqfat2m71.jpg,[Julioiglesias coin. https://bitclout.com/post...,Julioiglesias coin. https://bitclout.com/posts...
4,4,ITmancoderwannabe,1.631018e+09,pjm7wq,0.0,,0.0,1.0,[''],Cryptocurrency Prices Today on September 7: So...,https://www.moneycontrol.com/news/business/cry...,[Cryptocurrency Prices Today on September 7: S...,Cryptocurrency Prices Today on September 7: So...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
272468,272468,ki777iz,1.609455e+09,knzz90,0.0,GENERAL-NEWS,0.0,1.0,[''],Darknet Marketplace Has Stopped Supporting Pay...,https://thecryptobasic.com/2020/12/31/darknet-...,[Darknet Marketplace Has Stopped Supporting Pa...,Darknet Marketplace Has Stopped Supporting Pay...
272469,272469,HashMoose,1.609455e+09,knzwiq,0.0,TRADING,7.0,1.0,[''],Representatives ask Mnuchin to extend comment ...,https://www.theblockcrypto.com/linked/89783/re...,[Representatives ask Mnuchin to extend comment...,Representatives ask Mnuchin to extend comment ...
272470,272470,clydedyed,1.609455e+09,knzwh9,0.0,COMEDY,2.0,1.0,[''],wtf is wrong with these cultist,https://i.imgur.com/Wd8UqFk.png,[wtf is wrong with these cultist],wtf is wrong with these cultist
272471,272471,Crypto-Hero,1.609455e+09,knzvzd,1.0,FINANCE,7.0,1.0,['REMINDER: Cash out on your holding on PayPal...,REMINDER: Cash out on your holding on PayPal f...,https://www.reddit.com/r/CryptoCurrency/commen...,[REMINDER: Cash out on your holding on PayPal ...,REMINDER: Cash out on your holding on PayPal f...


In [19]:
#Now we do TF-IDF on the rawdata column.
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

samp = df.copy()
#Stem if necessary
'''
stemmer = SnowballStemmer("english")
samp['stemmed'] = samp.rawdata.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
samp.stemmed.head()
'''
#A few problems I found with stemming were that the stems weren't actually real stems (such as -ing, -ed, etc.) but they ended up actually being just the last letter of words (so some become som, disaster became disas). This led to a lot of inaccuracies and confusion.

'\nstemmer = SnowballStemmer("english")\nsamp[\'stemmed\'] = samp.rawdata.map(lambda x: \' \'.join([stemmer.stem(y) for y in x.split(\' \')]))\nsamp.stemmed.head()\n'

In [6]:
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec

CountVectorizer(max_df=0.5, ngram_range=(1, 2), stop_words='english')

In [7]:
cvec.fit(samp.rawdata)
cvec_counts = cvec.transform(samp.rawdata)
tr = TfidfTransformer()
transformed_weights = tr.fit_transform(cvec_counts)
transformed_weights

<272473x4278784 sparse matrix of type '<class 'numpy.float64'>'
	with 19236049 stored elements in Compressed Sparse Row format>

In [8]:
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': cvec.get_feature_names(), 'weight': weights})
weights_df.sort_values(by='weight', ascending=False).head(150)

Unnamed: 0,term,weight
1063877,crypto,0.017253
588139,bitcoin,0.012398
2155360,just,0.008529
2276787,like,0.007628
2814336,people,0.006237
...,...,...
3292351,run,0.001792
3501528,smart,0.001787
3868853,trade,0.001783
2865196,platform,0.001769


In [9]:
#Check the VADER values for each individual word.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
def senticol(cols):
    return analyzer.polarity_scores(cols)['compound']
weights_df['vader'] = weights_df['term'].apply(senticol)
weights_df.sort_values(by='weight', ascending=False).to_csv('wordweights.csv')
weights_df.sort_values(by='weight', ascending=False).head(150)

Unnamed: 0,term,weight,vader
1063877,crypto,0.017253,0.0000
588139,bitcoin,0.012398,0.0000
2155360,just,0.008529,0.0000
2276787,like,0.007628,0.3612
2814336,people,0.006237,0.0000
...,...,...,...
3292351,run,0.001792,0.0000
3501528,smart,0.001787,0.4019
3868853,trade,0.001783,0.0000
2865196,platform,0.001769,0.0000


We notice that a lot of the most weighted words in each post don't actually have ratings, which means our model might not return accurate results (especially terms like hold, hodl, sel, long, moon, buy). We need to tweak the VADER lexicon to fit our standard of sentiment.



In [10]:
EditSIA = SentimentIntensityAnalyzer()
new_words = {
    'buy': 2,
    'sell': -2,
    'long': 0.8,
    'buying': 2,
    'selling': -2,
    'dip': -1.5,
    'hodl': 1.5,
    'bull': 3,
    'bear': -3,
    'bullish': 3,
    'bearish': -3,
    'gold': 1.4,
    'sold': -1,
    'moon like': 2,
    'to the moon': 3.2,
    'purchase': 0.5,
    'bull run': 3,
    'bear market': -3,
    'green': 3,
    'correction': -3,
    'fomo': -2,
    'ath': 3,
    'volatile': -2.5,
    'rally': 2.5,
    'underperform': -3,
    'underperforming': -3,
    'slamming': -2,
    'slams': -2,
    'bubble': -3,
}
EditSIA.lexicon.update(new_words)

In [11]:
def senticol2(cols):
    return EditSIA.polarity_scores(cols)['compound']
weights_df['edit'] = weights_df['term'].apply(senticol2)
weights_df.sort_values(by='weight', ascending=False).to_csv('wordweights.csv')
weights_df.sort_values(by='weight', ascending=False).head(150)

Unnamed: 0,term,weight,vader,edit
1063877,crypto,0.017253,0.0000,0.0000
588139,bitcoin,0.012398,0.0000,0.0000
2155360,just,0.008529,0.0000,0.0000
2276787,like,0.007628,0.3612,0.3612
2814336,people,0.006237,0.0000,0.0000
...,...,...,...,...
3292351,run,0.001792,0.0000,0.0000
3501528,smart,0.001787,0.4019,0.4019
3868853,trade,0.001783,0.0000,0.0000
2865196,platform,0.001769,0.0000,0.0000


In [15]:
#Randomly sample some reddit posts, see if they have more accurate sentiment analyses.

postsamp = df.copy().sample(100, random_state=10)
def vadersent(cols):
    compound = []
    for i in range(0, len(cols['data'])):
        score = analyzer.polarity_scores(cols['data'][i])
        compound.append(score['compound'])
    return np.mean(compound)

def vadersent2(cols):
    compound = []
    for i in range(0, len(cols['data'])):
        score = EditSIA.polarity_scores(cols['data'][i])
        compound.append(score['compound'])
    return np.mean(compound)

postsamp['vader'] = postsamp.apply(vadersent, axis=1)
postsamp['new_vader'] = postsamp.apply(vadersent2, axis=1)
postsamp

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,is_self,link_flair_text,num_comments,score,selftext,title,url,data,rawdata,is_btc,is_eth,is_doge,vader,new_vader
28407,28407,Dreampopgazer,1.629376e+09,p7dtk5,1.0,TRADING,86.0,1.0,['On 20/21 July a lot of the crypto space (and...,Did you buy the July bottom? A celebration of ...,https://www.reddit.com/r/CryptoCurrency/commen...,[Did you buy the July bottom? A celebration of...,Did you buy the July bottom? A celebration of ...,True,True,False,0.276879,0.306421
247379,247379,SuineGeniuS,1.613447e+09,lkux3f,0.0,,0.0,1.0,[''],Voyager referral thread,/r/Voyagerreferralcodes/comments/lj4dca/referr...,[Voyager referral thread],Voyager referral thread,False,False,False,0.000000,0.000000
211718,211718,jrm7262,1.618646e+09,msmonl,1.0,,1.0,1.0,"['Hello Reddit Cryptocurrency Community,\n\nTh...",Is INX worth a punt?,https://www.reddit.com/r/CryptoCurrency/commen...,"[Is INX worth a punt?, Hello Reddit Cryptocurr...",Is INX worth a punt? Hello Reddit Cryptocurren...,False,False,False,0.090750,0.167217
265613,265613,Cryptodragonnz,1.610862e+09,kz0jjz,0.0,COMEDY,38.0,1.0,[''],"Hmmmmmm, this sounds familiar. Where have I se...",https://i.redd.it/3xgevzevytb61.jpg,"[Hmmmmmm, this sounds familiar. Where have I s...","Hmmmmmm, this sounds familiar. Where have I se...",False,False,False,0.361200,0.361200
79319,79319,BenderTheIV,1.626766e+09,onxb30,1.0,SELF-STORY,16.0,1.0,"[""I started this year investing at what was pr...",This is the first bear I encountered in the wild!,https://www.reddit.com/r/CryptoCurrency/commen...,[This is the first bear I encountered in the w...,This is the first bear I encountered in the wi...,False,False,False,0.110967,0.043783
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201852,201852,OhWiseWizard,1.619066e+09,mvxcg1,1.0,FOCUSED-DISCUSSION,2.0,1.0,"['Oasis Labs ([https://www', 'oasislabs', 'com...",Oasis Labs / ROSE - How come I never see this ...,https://www.reddit.com/r/CryptoCurrency/commen...,[Oasis Labs / ROSE - How come I never see this...,Oasis Labs / ROSE - How come I never see this ...,False,True,False,0.117186,0.117186
11085,11085,nj_crypto_news,1.630356e+09,peqtmd,0.0,EXCHANGE,2.0,1.0,[''],NFT Marketplace OpenSea Seeks Urgent Team Expa...,https://thecryptonewsweb.com/nft-marketplace-o...,[NFT Marketplace OpenSea Seeks Urgent Team Exp...,NFT Marketplace OpenSea Seeks Urgent Team Expa...,False,False,False,0.599400,0.599400
91456,91456,diamondhands_dev,1.625857e+09,oh256b,1.0,NEW-COIN,46.0,1.0,['I want to expand my portfolio as I only have...,Expanding the portfolio.,https://www.reddit.com/r/CryptoCurrency/commen...,"[Expanding the portfolio., I want to expand my...",Expanding the portfolio. I want to expand my p...,False,True,False,0.535180,0.535180
229027,229027,Kawalele,1.615881e+09,m63w0w,1.0,TRADING,9.0,1.0,"['While the title is true, it is also true tha...",1 Day of work traded for Bitcoin in the US in ...,https://www.reddit.com/r/CryptoCurrency/commen...,[1 Day of work traded for Bitcoin in the US in...,1 Day of work traded for Bitcoin in the US in ...,True,False,False,0.234600,0.250286


A few things we can notice with the new VADER model:
- VADER still has trouble determining words like "never" and "can't" to invert certain sentiment values. This is a big problem especially when it comes to certain sentences that say "never bullish" or "can't go to the moon", which are interpreted positively. This is the biggest problem for accurate sentiment analysis.
- Customizing the model actually flips certain posts making them positive instead of negative, which is actually extremely helpful for accuracy purposes.
- Most changes, however, are extremely marginal in the scope of the entire sentiment analysis, which could mean that the posts were already accurate to begin with, there needs to be more custom words added to the model, or the average of the sentences in the posts are not a valid method of getting a total sentiment value.

To edit the existing data that shows in the cell above, change the random_value integer to another integer and the dataset will change.

Below, I w

In [13]:
#Now we can sort the crypto posts by what coin they describe, starting with bitcoin, etherium, and doge.
df['is_btc'] = (df['rawdata'].str.lower().str.contains('btc | bitcoin'))
df['is_eth'] = (df['rawdata'].str.lower().str.contains('eth | etherium'))
df['is_doge'] = (df['rawdata'].str.lower().str.contains('doge | dogecoin'))
btc_only = df.where(df['is_btc'] == True).dropna(how='all')
eth_only = df.where(df['is_eth'] == True).dropna(how='all')
doge_only = df.where(df['is_doge'] == True).dropna(how='all')

In [14]:
#Bitcoin TF-IDF
btcv = TfidfVectorizer()
btcX = btcv.fit_transform(btc_only['rawdata'])
tfidf = dict(zip(btcv.get_feature_names(), btcX.toarray()[0]))
dict(sorted(tfidf.items(), key=lambda item: item[1], reverse=True))

NameError: name 'TfidfVectorizer' is not defined