# BERT Sentiment Analysis on GME Reddit /r/wallstreetbets

### Validate Environment

In [1]:
!python --version

Python 3.8.0


In [2]:
!which python

/Users/melissacirtain/work/envs/ait/bin/python


In [3]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import pandas as pd
import numpy as np

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print('Pulled pretrained BERT transformer')

df = pd.read_csv('../datainputs/reddit_wsb.csv')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pulled pretrained BERT transformer


In [4]:
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56


### Filter down to just GME-related titles or bodies

In [5]:
df.shape  # (36668, 8) rows to start
df[df['title'].str.contains('GME')].shape  # (6711, 8)
df[(df['body'].notna()) & (df['body'].str.contains('GME'))].shape   # (5743, 8)

(5743, 8)

### Next steps/chat w/ Fernando

In [6]:
strings_to_search = ['game stop', 'gamestop', 'gme']
# note: retards is positive in this context
# explore transfer learning...(maybe, post MVP)
# positive, negative, performance, volume plots
# share dataframe w/ sentiment

### Pull in all the matching posts I can find

In [7]:
#searchfor = ['og', 'at']
#s[s.str.contains('|'.join(searchfor))]

# Search over all titles and non-empty bodies for our search expressions
df[((df['body'].notna()) & (df['body'].str.contains('|'.join(strings_to_search)))) | 
  (df['title'].str.contains('|'.join(strings_to_search)))]  # 1267 rows × 8 columns

# ignoring case:  (?i)
df[((df['body'].notna()) & (df['body'].str.lower().str.contains('|'.join(strings_to_search)))) | 
  (df['title'].str.lower().str.contains('|'.join(strings_to_search)))]  # YES! 12240 rows × 8 columns

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1.611862e+09,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1.611862e+09,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1.611862e+09,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1.611862e+09,,2021-01-28 21:26:56
6,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE,317,l6uf6d,https://www.reddit.com/r/wallstreetbets/commen...,53,1.611862e+09,Hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27
...,...,...,...,...,...,...,...,...
36650,Wall Street won’t understand what’s happening ...,1378,lsdnyc,https://www.reddit.com/r/wallstreetbets/commen...,156,1.614308e+09,Where’s the boundary between gambling and inve...,2021-02-26 04:45:46
36651,Got in GME today. In a sea of red it was a gre...,111,lsdnk6,https://i.redd.it/o7kvl4zw5oj61.jpg,8,1.614308e+09,,2021-02-26 04:45:20
36654,Some DD on shorting Cramer's mustache.,107,lsdloy,https://www.reddit.com/r/wallstreetbets/commen...,10,1.614307e+09,So I was watching CNBC and saw Cramer talking ...,2021-02-26 04:43:12
36656,GameStop Round 2? How an options-buying frenzy...,59,lsdjwx,https://www.marketwatch.com/story/gamestop-rou...,13,1.614307e+09,,2021-02-26 04:41:08


In [8]:
# What else might we look for?

# grab indices matching the above, inspect rows not in that index...
known_gme_idx = df[((df['body'].notna()) & (df['body'].str.lower().str.contains('|'.join(strings_to_search)))) | 
  (df['title'].str.lower().str.contains('|'.join(strings_to_search)))].index

### Classify Titles
- so we can make a time series

In [11]:
%%time

# Make a GME-only classified dataframe
from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline('sentiment-analysis')

print([x for x in df.columns])

# Drop non-gme posts
df = df.iloc[known_gme_idx]

# Apply classifier over title for the GME-related rows
print('classifying titles')
df['title_sentiment'] = df['title'].apply(classifier)


Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_57']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


['title', 'score', 'id', 'url', 'comms_num', 'created', 'body', 'timestamp']
classifying titles
CPU times: user 29min 29s, sys: 2min 55s, total: 32min 24s
Wall time: 14min 28s


### Classify bodies

In [12]:
%%time

df.head()

CPU times: user 136 µs, sys: 2 µs, total: 138 µs
Wall time: 142 µs


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,title_sentiment
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10,"[{'label': 'NEGATIVE', 'score': 0.999741792678..."
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35,"[{'label': 'NEGATIVE', 'score': 0.999687314033..."
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57,"[{'label': 'NEGATIVE', 'score': 0.996105611324..."
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56,"[{'label': 'NEGATIVE', 'score': 0.996898174285..."
6,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE,317,l6uf6d,https://www.reddit.com/r/wallstreetbets/commen...,53,1611862000.0,Hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27,"[{'label': 'POSITIVE', 'score': 0.993022918701..."


### Replace NaN with ''

In [13]:
df['body'][df['body'].isna()] = ''
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['body'][df['body'].isna()] = ''


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,title_sentiment
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10,"[{'label': 'NEGATIVE', 'score': 0.999741792678..."
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35,"[{'label': 'NEGATIVE', 'score': 0.999687314033..."
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57,"[{'label': 'NEGATIVE', 'score': 0.996105611324..."
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56,"[{'label': 'NEGATIVE', 'score': 0.996898174285..."
6,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE,317,l6uf6d,https://www.reddit.com/r/wallstreetbets/commen...,53,1611862000.0,Hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27,"[{'label': 'POSITIVE', 'score': 0.993022918701..."


### Custom classifier wrapper

In [14]:
%%time
#assert False

df['short_body'] = df['body'].str[:500]

# Apply classifier over bodies for the GME-related rows that aren't null
print('classifying bodies')
#df['body_sentiment'] = df['body'].apply(classifier)
def classify_bodies(row):
    try:
        return classifier(row)
    except Exception as e:
        print(f'failed on row: \n\n{row} \n\n***** {e}\n\n')
        return 'fail'
    
df['body_sentiment'] = df['short_body'].apply(classify_bodies)

classifying bodies
CPU times: user 51min 26s, sys: 11min 58s, total: 1h 3min 25s
Wall time: 28min 9s


In [15]:
df.head()
df.to_csv('fully_classified_gme_posts.csv')

### Split classifications into labels and scores

In [17]:
%%time

# split out scores and labels for sentiment: body and title
df['body_sent'] = df['body_sentiment'].apply(lambda x: x[0]['label'])
df['body_score'] = df['body_sentiment'].apply(lambda x: x[0]['score'])
df['title_sent'] = df['title_sentiment'].apply(lambda x: x[0]['label'])
df['title_score'] = df['title_sentiment'].apply(lambda x: x[0]['score'])

CPU times: user 20.2 ms, sys: 1.83 ms, total: 22 ms
Wall time: 20.7 ms


### Save classified CSV

In [18]:
df.to_csv('classified_and_split_gme_posts.csv')