# BERT Sentiment Analysis on GME Reddit /r/wallstreetbets

### Validate Environment

In [51]:
!python --version

Python 3.8.0


In [52]:
!which python

/Users/melissacirtain/work/envs/ait/bin/python


In [50]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import pandas as pd

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print('Pulled pretrained BERT transformer')

df = pd.read_csv('datainputs/reddit_wsb.csv')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pulled pretrained BERT transformer


In [53]:
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56


### Proof Of Concept

In [54]:
df.shape  # (36668, 8)
df['timestamp'].max()

'2021-02-28 16:53:18'

In [55]:
from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to include pipeline into the transformers repository.')
#[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_115']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

In [56]:
df['title'][:5].apply(classifier)

0    [{'label': 'NEGATIVE', 'score': 0.993991911411...
1    [{'label': 'NEGATIVE', 'score': 0.999741792678...
2    [{'label': 'NEGATIVE', 'score': 0.999687314033...
3    [{'label': 'NEGATIVE', 'score': 0.996105611324...
4    [{'label': 'NEGATIVE', 'score': 0.996898174285...
Name: title, dtype: object

In [57]:
results = df['title'][:5].apply(classifier)
type(results)

pandas.core.series.Series

In [58]:
results[0][0]['score']

0.9939919114112854

In [60]:
results[0][0]['label']

'NEGATIVE'

In [61]:
df100 = df[:100].copy()
df100.shape


(100, 8)

### Classify 100 titles: ~7 seconds

In [63]:
%%time
df100[['sentiment']] = df100['title'].apply(classifier)

CPU times: user 15.5 s, sys: 1.64 s, total: 17.2 s
Wall time: 8.17 s


In [64]:
df100.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,sentiment
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41,"[{'label': 'NEGATIVE', 'score': 0.993991911411..."
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10,"[{'label': 'NEGATIVE', 'score': 0.999741792678..."
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35,"[{'label': 'NEGATIVE', 'score': 0.999687314033..."
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57,"[{'label': 'NEGATIVE', 'score': 0.996105611324..."
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56,"[{'label': 'NEGATIVE', 'score': 0.996898174285..."


In [65]:
%%time
df100['sentiment'][0][0]['label']
def split_label_score(field):
    #return field[0]['label'], field[0]['score']
    return field[0]['label']

split_label_score(df100['sentiment'][0])

df100['sent'] = df100['sentiment'].apply(lambda x: x[0]['label'])
df100['sent_score'] = df100['sentiment'].apply(lambda x: x[0]['score'])
df100.head()

CPU times: user 3.9 ms, sys: 898 ¬µs, total: 4.79 ms
Wall time: 5.05 ms


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,sentiment,sent,sent_score
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41,"[{'label': 'NEGATIVE', 'score': 0.993991911411...",NEGATIVE,0.993992
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10,"[{'label': 'NEGATIVE', 'score': 0.999741792678...",NEGATIVE,0.999742
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35,"[{'label': 'NEGATIVE', 'score': 0.999687314033...",NEGATIVE,0.999687
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57,"[{'label': 'NEGATIVE', 'score': 0.996105611324...",NEGATIVE,0.996106
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56,"[{'label': 'NEGATIVE', 'score': 0.996898174285...",NEGATIVE,0.996898


In [47]:
df100[df100['sent'].str.contains('POSITIVE')]


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,sent,sentiment,sent_score
5,WE BREAKING THROUGH,405,l6uf7d,https://i.redd.it/2wef8tc062e61.png,84,1611862000.0,,2021-01-28 21:26:30,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.993596613407...",0.993597
6,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE,317,l6uf6d,https://www.reddit.com/r/wallstreetbets/commen...,53,1611862000.0,Hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.993022918701...",0.993023
7,THIS IS THE MOMENT,405,l6ub9l,https://www.reddit.com/r/wallstreetbets/commen...,178,1611862000.0,Life isn't fair. My mother always told me that...,2021-01-28 21:19:31,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.999053895473...",0.999054
9,I have nothing to say but BRUH I am speechless...,291,l6uas9,https://i.redd.it/bfzzw2yo42e61.jpg,27,1611862000.0,,2021-01-28 21:18:37,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.990014195442...",0.990014
10,"We need to keep this movement going, we all ca...",222,l6uao1,https://www.reddit.com/r/wallstreetbets/commen...,70,1611862000.0,I believe right now is one of those rare oppo...,2021-01-28 21:18:25,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.999506175518...",0.999506
14,I Love You Retards!!!!,176,l6u8hc,https://www.reddit.com/gallery/l6u8hc,32,1611861000.0,,2021-01-28 21:14:44,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.996614933013...",0.996615
16,To The Mass Relays & Beyond,107,l6u5j2,https://youtu.be/UXLVFnl3WcE,14,1611861000.0,,2021-01-28 21:09:32,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.985514760017...",0.985515
17,I come back to you now... At the turn of the t...,339,l6u40m,https://v.redd.it/nowyj61f22e61,33,1611861000.0,,2021-01-28 21:06:53,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.993447542190...",0.993448
19,"Daily Discussion Thread for January 28, 2021",841,l6u011,https://www.reddit.com/r/wallstreetbets/commen...,5942,1611860000.0,Your daily trading discussion thread. Please k...,2021-01-28 21:00:15,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.607607245445...",0.607607
23,I'm so proud of how far this subreddit has come,458,l6tuae,https://www.reddit.com/r/wallstreetbets/commen...,89,1611860000.0,I still remember when I first joined and most ...,2021-01-28 20:49:39,POSITIVE,"[{'label': 'POSITIVE', 'score': 0.999833345413...",0.999833


### Distribution of Positive/Negative classifications

In [49]:
df100['sent'].value_counts()

NEGATIVE    66
POSITIVE    34
Name: sent, dtype: int64

### Just peeking around

In [31]:
print([f'{x}\n\n' for x in df['title'][df['title'].str.contains('GME')]])



In [32]:
classifier('Really? I can‚Äôt even buy GME or AMC for now? üò§')

[{'label': 'NEGATIVE', 'score': 0.9993067383766174}]

In [33]:
classifier('''JUST PUT IN ANOTHER 30K IN NOK CALLS LET'S GO! $GME $NOK BUY AND HOLD''')

[{'label': 'NEGATIVE', 'score': 0.9988563656806946}]