## Problem Statement and Initial Ideas

**Problem Statement:**

The wallstreetbets reddit community gets a lot of flack for its incredibly bold -- at times, downright absurd -- trades. However, it was very right about the GME saga, thanks for the EXPR adn AMC sympathy gains. How can I measure the success of the community in taking certain trade positions? 

**Initial Idea:** 

I would need to consider what stocks to look for and where to look for them within wallstreetbets. Would it be a good idea to look for all traded stocks on the NYSE? Certain industries? The NASDAQ (ultimately came back to this one)? 

I began by trying to consider the stocks in the S&P 500. Webscraping these stocks from wikipedia was simple enough and shown below. The initial idea was to look at the section of posts called 'due dilligence.' Here, redditors post their trade ideas and justifications, sometimes justification is a stretch. 

The idea would be to scrape the post titles, then see the number of likes, comments, and the stocks mentioned. If I could get a substantial list of posts, then I could use the number of likes and comments as a benchmark for how much redditors liked the due diligence and, subsequently, the stock. 

However, this idea proved flawed. It would be difficult to get enough variation of stocks with this method. The post titles often did not clearly state the stock by ticker (ex: AAPL or $APPL). The spirit of the idea did prove helpful though: mentions. 

In [2]:
from bs4 import BeautifulSoup
import pandas as pd 
import requests
import time
import csv

# import s and p 500 companies only 
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
headers = { 'User-Agent':'Mozilla/5.0'}

page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

# clean way to get data table with all info
stocktable = soup.find('table', {'class': 'wikitable'}) # gets first instance
df = pd.read_html(str(stocktable))
df=pd.DataFrame(df[0])


# dirty way to get stock symbols as list 
tab = soup.find('table', class_= "wikitable")
test = [stock.text for stock in tab.find_all('a', rel = 'nofollow')]
ticker = [stock for stock in test if stock != 'reports']

In [3]:
# when idea was to look at the posts for due diligence and see the mentions of stocks

from bs4 import BeautifulSoup
import requests
import time
import csv

url = 'https://ns.reddit.com/r/wallstreetbets/search?sort=new&restrict_sr=on&q=flair%3ADD'
headers = {'User-Agent': 'Mozilla/5.0'}


page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
for post in soup.find_all('a', attrs={'class': 'search-title'}):
    print(post.text)
    
posts = [post.text for post in soup.find_all('a', attrs={'class': 'search-title'})]

TLRY needs your help!
Remember January 2021? Pepperidge Farm Remembers. What's more, it looks like things are stacking up for not only a repeat, but a larger run.
The VCs are all making money on Heliogen (HLGN) and it's time we did too (by shorting it)
Undervalued Stock $GS
Listen up dingleberries, you're about to miss the ☢️uranium☢️ rocketship🚀🚀🚀
$LCID puts are free money
$LMND clearly a short target Come on Apes break shorty
SKLZ at the all time low. Upside inevitable?
Due diligence: AAPL and TSLA
$MYBUF | $BORNY The Most Significant Advancement in Science Since They Invented the Sun DD
$ALL state fueling up the rocket. That's Allstate's stand.
$LH My fellow retards is what we have been waiting for… The stars align
Should you listen to Jim Cramer? - I analyzed 20,000+ recommendations made by Jim Cramer during the last 5 years. Here are the results.
$ZIP: set up for a massive run in 2022
DD - What are Dark Pools and how do they work?
MSGS - Madison Square Garden significantly underva

In [15]:
# get the past months due diligence 

days_ago = 0
url = 'https://ns.reddit.com/r/wallstreetbets/search?sort=new&restrict_sr=on&q=flair%3ADD'
headers = {'User-Agent': 'Mozilla/5.0'}


page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

while days_ago <=31:

    for post in soup.find_all('div', class_ = "search-result"):

        # get post name, likes, comments, and days ago submitted 
        name = post.find('a', class_ = 'search-title').text
        # issue is new posts may not have likes, comments, or even content
        if post.find('div', class_ = 'md') is None: 
            content = ""
        else: 
            content = post.find('div', class_ = 'md').text
        comments = post.find('a', class_='search-comments').text.split(' ')[0]
        if post.find('span', class_ = 'search-score') is None: 
            likes = 0
        else: 
            likes = post.find('span', class_ = 'search-score').text.split(' ')[0]
        days_ago = post.find('span', class_ = 'search-time').text.split(' ')[1:]
        if days_ago[1] == 'hours':
            days_ago = 0
        elif days_ago[1] == 'day':
            days_ago = 1
        elif days_ago[1] == 'days':
            days_ago = int(days_ago[0])
        elif days_ago[1] == 'month':
            days_ago = 32
            
        result = [name, content, likes, comments, days_ago]

        with open('dd.csv', 'a') as file:
            writer = csv.writer(file)
            writer.writerow(result)
            
    next_button = soup.find('span', class_ = 'nextprev')
    next_page_link = next_button.find("a", {'rel':'nofollow next'}).attrs['href']
    time.sleep(2)
    page = requests.get(next_page_link, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    file.close()

In [16]:
# data frame of past months due diligence p 
dd = pd.read_csv('dd.csv', header = 0, names = ["post", "content", "likes", "comments", "days_ago"])
dd

Unnamed: 0,post,content,likes,comments,days_ago
0,Remember January 2021? Pepperidge Farm Remembe...,"Hi everyone, bob here.\nI posted this same dat...",0,48,"['44', 'minutes', 'ago']"
1,The VCs are all making money on Heliogen (HLGN...,Summary\nHeliogen (HLGN) got a solid pump on t...,0,3,"['50', 'minutes', 'ago']"
2,Undervalued Stock $GS,"I know, I know. Someone is recommending a ban...",0,9,"['51', 'minutes', 'ago']"
3,"Listen up dingleberries, you're about to miss ...",Uranium has had a few parabolic-type moves in ...,0,24,"['1', 'hour', 'ago']"
4,$LCID puts are free money,Just wanted to give people a heads up that $LC...,39,40,0
...,...,...,...,...,...
169,NIO: Addressing Near-Term Risks.,Call to Action\nDrawing attention to the key d...,3,34,32
170,"PLBY - Cardi B named ""Creative Director in Res...",Keeping this one short and sweet. I've been sh...,469,260,32
171,"$Qdel, trading at 6 times their q1 2022 earnings?",After their 25.6m test kits (284.2$M)1 Federal...,10,11,32
172,Ballsack Nike Projections? 👟💩👟💩🚀😛🚀😛,Nike no no yummy puts 💩🚀\nGiven the current nu...,2,12,32


## Pivot: Daily Discussion and Most Mentioned

It did not seem that the above method would really be able to quantify the amount of support of the WSB community in the same manner as comment mentions within the daily discussion. This discussion is very raw and unfiltered, and the timing of a new one each day really isolates each post to one trading day. So, I forged ahead with getting the most mentioned tickers of the daily discussion. 

**Step 1**: Get stocks to track mentions in the daily discussion comments. 
I decided to get the nasdaq stocks by webscraping them off nasdaqtrader.com. Then create a dictionary which will be updated later as post comments are iterated over. 

**Step 2**: Get the id's for the each daily discussion page to pass along to PRAW, a Python Reddit API Wrapper. 
These id's were scraped using the beautiful soup library. 

**Step 3**: From there, I scraped the data for november daily discussion comments. I had to use the parent comments only, meaning replies to the comments were not included. This is justifiable since popular tickers will still be mentioned many more times throughout the day if all children comments (replies) are not scraped. I did this for December 31 2021 as a test. Then, repeated the process for an entire month of data -- november 2020. 

**Step 4**: The function 'toptickers' takes two arguments -- month and top. It scrapes daily discussion comments from 2021 for the 'month' inputted and returns a list of length 'top' of the most mentioned tickers. 


In [30]:
# create dictionary of nasdaq stocks and val which will be updated when found in wsb reddit comments 
import pandas as pd 

url = 'http://www.nasdaqtrader.com/dynamic/symdir/nasdaqlisted.txt'
nasdaq = pd.read_csv(url, sep = '|')

stocks = list(nasdaq['Symbol'])
# create list of 1's length of symbols
val = [0] * len(stocks)

lst1 = zip(stocks, val)
stocks_dict = dict(lst1)

In [178]:
# use webscraping to get all the needed id's so they can be mad einto list and passed to create praw objects to get comments
# maybe also scrape the last three strings to get mon-day-year for post


from bs4 import BeautifulSoup
import requests
import time
import csv

counter = 0 
url = 'https://old.reddit.com/r/wallstreetbets/search?q=flair_name%3A%22Daily+Discussion%22&restrict_sr=1&sort=new'
headers= {'User-Agent': 'Mozilla/5.0'}


page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

while counter <= 100:

    for post in soup.find_all('div', class_ = "search-result"):

        # get post name, likes, comments, and days ago submitted 
        month, day, year = post.find('a', class_ = 'search-title').text.split(' ')[-3:]
        ref = post.find('a', {'class': 'search-title'}).attrs['href'].split('/')[6]
        result = [month, day, year,ref]
        with open('wsb_daily.csv', 'a') as file:
            writer = csv.writer(file)
            writer.writerow(result)
    
        counter += 1
        
    next_button = soup.find('span', class_ = 'nextprev')
    next_page_link = next_button.find("a", {'rel':'nofollow next'}).attrs['href']
    time.sleep(2)
    page = requests.get(next_page_link, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    file.close()

In [47]:
# gives file with the id's and date information of daily discussions
# needed to pass to comments scraper
wsb = pd.read_csv('wsb_daily.csv', names= ['month', 'day', 'year', 'id'])
wsb["day"]=wsb["day"].str.replace(',','')
id_list = list(wsb['id'])
wsb[30:60]


Unnamed: 0,month,day,year,id
30,December,13,2021,rey64p
31,December,10,2021,rd68ig
32,December,10,2021,rcr4sg
33,December,09,2021,rcfcod
34,December,09,2021,rc0ux0
35,December,08,2021,rbov18
36,December,08,2021,rb9dg9
37,December,07,2021,rawn4n
38,December,07,2021,rahlge
39,December,06,2021,ra4qvi


In [258]:
# webscraping comments from wsb daily discussion (dec 31) --test run

# import praw

# submission = reddit.submission(id ='rst2f2')

# commentslist = []

# submission.comments.replace_more(limit=None)
# for comment in submission.comments.list():
#     commentslist.append(comment.body)
    



In [180]:
# checking number of appearances of ticker for dec 31
# import re 

# mentions_dict = {}

# # regex pattern which will be used to find stocks from stoks_dict in comments
# # can refine this to capture other such possibilities 
# pattern = r'\b([A-Z]+)\b'


# for comment in commentslist: 
#     for ticker in re.findall(pattern, comment): 
#         if ticker in stocks_dict:
#             if ticker not in mentions_dict: 
#                 mentions_dict[ticker] = 1
#             else:
#                 mentions_dict[ticker] += 1

# # top 10 most mentioned (trending) nasdaq stock tickers
# trending = pd.Series(mentions_dict).sort_values(ascending = False)
# trending[1:10]

NVDA    22
AMD     13
AAPL    11
QQQ     10
ON      10
AMZN     9
SOFI     7
HOOD     6
GET      6
dtype: int64

In [69]:
# write as a function - get month and list of top most mentioned stocks
import praw 
import re

def toptickers(month, top):
    '''insert month and number of top stocks, return top mentioned stocks'''
    nov = wsb[wsb['month'] == month]
    ids = list(nov['id'])
    mentions_dict = {}
    pattern = r'\b([A-Z]+)\b'
    
    # hid personal info 
    reddit = praw.Reddit(client_id='your_id', client_secret='your_secret', user_agent='your_agent')
    commentslist = []

    # # loop over all nov days and id's
    for day_id in ids:
        tmp = []
        submission = reddit.submission(id = str(day_id))
        tmp.append(submission.title)
        # only parent comments
        submission.comments.replace_more(limit=0)

        for comment in submission.comments.list():
            tmp.append(comment.body) 

        # saves list of lists of all parent comments
        commentslist.append(tmp)

        # update the mentions dict
        for dailycomment in tmp:
            for ticker in re.findall(pattern, dailycomment): 
                if ticker in stocks_dict:
                    if ticker not in mentions_dict: 
                        mentions_dict[ticker] = 1
                    else:
                        mentions_dict[ticker] += 1

    # top 10 most mentioned (trending) nasdaq stock tickers
    trending = pd.Series(mentions_dict).sort_values(ascending = False)
    return(trending[0:(top+1)])

In [None]:
toptickers("November", 10)

## Assessment 

**How to Assess:** From here, there had to be a justifiable measure of the assessment and appropriate way to visualize the results of the wallstreebets community. I decided the best measure for comparison would need the historical pricing of QQQ, since it is an ETF which tracks the NASDAQ.

My initial idea is to gather the top ticker mentions on wallstreetbets for a particular month (say November) and track the following months return from market open on the 1st to market close at the end of the month (say December 1 open - December 31 close). 

Thankfully, the yahoo finance library allows users to track historical prices for a ticker of choice. I created a function 'ticker_change' which takes arguments for ticker -- the ticker of interest -- and mont -- the month to compare. 

Below, we can see the very negative percent change of the top tickers in November over the period from December 1st until December 31st. 

In [34]:
# now get the price change for QQQ (nasdaq etf) from dec 1 to dec 30 
from bs4 import BeautifulSoup

import yfinance as yf

qqq = yf.Ticker("QQQ")


# get historical market data
hist = qqq.history(start="2021-12-01", end="2022-01-01")
qqq_change = ((hist['Close'][-1] - hist['Open'][0])/ hist['Open'][0]) * 100
qqq_change

0.01962124305854888

In [35]:
def ticker_change(ticker, month):
    '''insert stock and month to return percent change for month'''
    if month == 'January':
        start = '2021-01-01'   
        end = '2021-02-01'
    elif month == 'February':
        start= '2021-02-01'
        end = '2021-03-01'
    elif month == 'March':
        start= '2021-03-01'
        end = '2021-04-01'
    elif month == 'April':
        start= '2021-04-01'
        end = '2021-05-01'
    elif month == 'May':
        start= '2021-05-01'
        end = '2021-06-01'
    elif month == 'June':
        start= '2021-06-01'
        end = '2021-07-01'
    elif month == 'July':
        start= '2021-07-01'
        end = '2021-08-01'
    elif month == 'August':
        start= '2021-08-01'
        end = '2021-09-01'
    elif month == 'September':
        start= '2021-09-01'
        end = '2021-10-01'
    elif month == 'October':
        start= '2021-10-01'
        end = '2021-11-01'
    elif month == 'November':
        start= '2021-11-01'
        end = '2021-12-01'
    elif month == 'December':
        start= '2021-12-01'
        end = '2022-01-01'
        
    ticker = yf.Ticker(str(ticker))


    # get historical market data
    hist = ticker.history(start= start, end= end)
    change = ((hist['Close'][-1] - hist['Open'][0])/ hist['Open'][0]) * 100
    return(change)

In [70]:
# run toptickers, get list of ticker, run ticker_change, compare to ticker_change(qqq)

# change above tops to ust the tickers here
top_ten = list(toptickers('November', 10).index)

stocks = []
change = []

for ticker in top_ten: 
    val = ticker_change(ticker, "December")
    stocks.append(ticker)
    change.append(val)
    

d = {'stocks': stocks, 'percent change': change}
df = pd.DataFrame(data = d)
df

Unnamed: 0,stocks,percent change
0,TSLA,-8.953212
1,NVDA,-11.463324
2,LCID,-29.641276
3,AMD,-10.270002
4,RIVN,-13.978761
5,PYPL,0.431378
6,TLRY,-31.078428
7,OCGN,-27.892231
8,SOFI,-9.200552
9,AAPL,6.024607
