# Analysis of Reddit WallStreetBets Posts: Revealing Market Trends and Investor Sentiment

## Table of Contents

- [Introduction](#Introduction)
- [Data Description](#Data-Description)
- [Market Trends Analysis](#Market-Trends-Analysis)
- [Investor Sentiment Analysis](#Investor-Sentiment-Analysis)
- [Conclusion](#Conclusion)
- [Acknowledgements](#Acknowledgements)


## Introduction
With the widespread use of social media, the WallStreetBets (WSB) subreddit on Reddit has become a focal point for investors worldwide. This analysis aims to study the WSB posts dataset obtained from Kaggle to reveal market trends and investor sentiment. We will focus on the following key indicators: post title (title), score (score), number of comments (comms_num), and post creation time (created).

### import necessary python library

In [1]:
import pandas as pd
import numpy as np
import spacy
from string import punctuation
from collections import Counter
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [2]:
# load data
wsb = pd.read_csv('reddit_wsb.csv')
wsb.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56


## Data Description

First, we perform descriptive statistical analysis on the dataset to understand the basic characteristics of the posts.

- Number of posts: There are 53187 posts in the dataset.
- Score distribution: The average score of the posts is 1382, with the highest score being 348241 and the lowest score being 0.
- Comments distribution: The average number of comments per post is 263, with the highest number of comments being 93268 and the lowest being 0.
- Creation time distribution: The creation time span of the posts ranges from 2021-01-28 to 2021-08-02.


In [3]:
print(wsb.shape[0],"posts \n")

print('average score:',wsb['score'].mean(),'\n'
      'highest score:',wsb['score'].max(), '\n'
      'lowest score:', wsb['score'].min(),'\n')

print('average comment number:',wsb['comms_num'].mean(),'\n'
     'highest comment number:',wsb['comms_num'].max(),'\n'
     'lowest comment number:',wsb['comms_num'].min(),'\n')

print('earliest post:',wsb.iloc[0,7],'\n'
      'newest post:' , wsb.iloc[-1,7])

53187 posts 

average score: 1382.461052512832 
highest score: 348241 
lowest score: 0 

average comment number: 263.2602515652321 
highest comment number: 93268 
lowest comment number: 0 

earliest post: 2021-01-28 21:37:41 
newest post: 2021-08-02 12:00:14


## Market Trends Analysis

Next, we will analyze the market trends found within WSB posts.

- Popular stocks: By counting the occurrence frequency of stock ticker symbols mentioned in post titles, we can identify the most popular stocks in the market.



In [4]:

nlp = spacy.load("en_core_web_lg", disable=["ner", "textcat"])

def process(text):
    text = ''.join(c for c in text if c not in punctuation)
    tokens = [token.lemma_.lower() for token in nlp(' '.join(text.split())) if token.lemma_.lower() not in nlp.Defaults.stop_words]
    return tokens

title_comb = (process(title) for title in wsb['title'])

all_titles_processed = [token for title_tokens in title_comb for token in title_tokens]

title_counter = Counter(all_titles_processed)

In [5]:
title_counter.most_common(10)

[('🚀', 18086),
 ('gme', 8783),
 ('buy', 6274),
 ('💎', 5666),
 ('hold', 5068),
 ('amc', 3492),
 ('robinhood', 3171),
 ('stock', 3095),
 ('sell', 2966),
 ('🙌', 2234)]

By looking at the most comment list, I saw keywords like gme, amc, and robinhood. These keywords are highly related to the stock market during pandemic. Robinhood was the stock platform they used; AMC is a threater that got significantly effect by the pandemic because people don't want to get sick. GME is the stock symbol for Gamestop, which also got affected by the pandemic between 2020 to 2021.

- Market sentiment: By analyzing changes in post scores and comment counts, we can gauge the overall market sentiment. For example, an increase in scores and comment counts may indicate high market sentiment, while a decrease may indicate low sentiment.

In [6]:

sia = SentimentIntensityAnalyzer()
def get_sentiment_score(text):
    sentiment = sia.polarity_scores(text)
    return sentiment['compound']
wsb['sent_score'] = wsb['title'].apply(get_sentiment_score)
mean_sentiment=wsb['sent_score'].mean()
if mean_sentiment > 0.05:
    market_sentiment = "High"
elif mean_sentiment < -0.05:
    market_sentiment = "Low"
else:
    market_sentiment = "Neutral"

print("Market Sentiment:", market_sentiment, '\nsimple average sentiment analysis score:', mean_sentiment)

Market Sentiment: High 
simple average sentiment analysis score: 0.05002351702483689


## Investor Sentiment Analysis

To gain a deeper understanding of investor sentiment, we will analyze the content of WSB posts.

- Keyword extraction: Using natural language processing techniques, we can extract keywords from the post bodies to reveal investors' focuses and sentiments.


In [7]:
def extract_keywords(text, num_keywords=5):

    words = word_tokenize(text)

    words = [word.lower() for word in words]

    stop_words = set(stopwords.words("english"))
    filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]

    word_freq = Counter(stemmed_words)

    # utilize TF-IDF
    total_words = len(filtered_words)
    keyword_scores = {word: (freq / total_words) * (np.log(total_words / (1 + word_freq[word]))) for word, freq in word_freq.items()}


    sorted_keywords = sorted(keyword_scores.items(), key=lambda x: x[1], reverse=True)
    return [word for word, score in sorted_keywords[:num_keywords]]
wsb['keywords'] = wsb['title'].apply(lambda x: extract_keywords(x))

In [8]:
keywords=[keyword for kws in wsb['keywords'] for keyword in kws]
mostkw=Counter(keywords)
mostkw.most_common(10)

[('gme', 6524),
 ('buy', 3466),
 ('hold', 3456),
 ('robinhood', 2653),
 ('amc', 2596),
 ('yolo', 2007),
 ('stock', 1838),
 ('go', 1734),
 ('sell', 1703),
 ('still', 1409)]

After we apply TFIDF methods into the title, we can see that the point I made above is very similar to here. Other than the stock vocabulary like buy, hold, stock, or sell, the rest are the main topic they are focusing on like gme, robinhood, amc, and yolo.

## Conclusion

After completing an in-depth examination of Reddit WallStreetBets (WSB) posts, we have obtained valuable knowledge about market patterns and investor attitudes. The dataset from Kaggle allows us to analyze important factors like post titles, scores, comment counts, and creation times.

Our results show that well-known stocks like GME, AMC, and Robinhood have been popular subjects of discussion among investors, especially in light of the pandemic. After examining fluctuations in post scores and comment counts, we have concluded that the general market sentiment is mostly positive.

Moreover, our analysis of investor sentiment, employing methods such as keyword extraction and sentiment scoring using natural language processing techniques, has brought to light the key focuses and sentiments of investors. Keywords such as "gme," "buy," "hold," "robinhood," "amc," and "yolo" have become important subjects of conversation.

In short, this examination of WSB posts offers important insights for investors, analysts, and policymakers by providing a better grasp of market trends and investor attitudes. Further investigation can explore these data further to reveal more information about market dynamics and investor actions.


## Acknowledgements

I would like to express my gratitude to the providers of the Reddit WallStreetBets Posts dataset. This analysis would not have been possible without their valuable contribution. The dataset can be found at the following link:

https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts