# Scraping Reddit

Our mini project was inspired by the rise of "meme stocks" like Gamestop and AMC, which were caused by retail investors rallying on this one particular Reddit thread named r/WallStreetBets. Hence, we wanted to scrape data from this particular thread, to see if the amount of activity on this thread would have a correlation with a stock's trade volume. Because more people talking about a stock = more people buying and selling that stock (or at least that's what we hypothesise). 

### PRAW

PRAW is an acronym for "Python Reddit API Wrapper", and it is a python package that allows for simple access to Reddit's API. After looking around for various options, PRAW seems like the best way to scrape data from Reddit.

In [1]:
pip install praw

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\ngjun\anaconda3\python.exe -m pip install --upgrade pip' command.


In [1]:
import praw

### Creating Credentials to Use PRAW

Credit to this tutorial for teaching us how to get our own credentials to start using PRAW:
https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

In [2]:
reddit = praw.Reddit(client_id='BZPt6Wy_TiwRZA', client_secret='xafgjbf0l1wTQ0x3j8aq8_-RxdXBuA', user_agent='dsai stonks')

Version 7.1.4 of praw is outdated. Version 7.2.0 was released Wednesday February 24, 2021.


### Date Ranges

We wanted to scrape Reddit across a range of dates, so we created a list that contained every date in this range.

Only weekdays are included in this list, because the stock markets are closed on weekends! The function pandas.bdate_range() gives us a simple way to do this, as it returns all weekdays within a specified range.

Our full set of data from Reddit spans from 22nd June 2018 to 14th April 2021. What you see in the cell below is just a subset of this, as we split up the scraping among the team. 

In [3]:
import datetime
import pandas as pd

In [4]:
datelist = pd.bdate_range(start = datetime.date(2021, 3, 26), end = datetime.date(2021, 4, 9)).to_pydatetime().tolist()
converted_datelist = []

for item in datelist:
    converted_datelist.append(item.strftime("%Y-%m-%d"))

for item in converted_datelist:
    print(item)

2021-03-26
2021-03-29
2021-03-30
2021-03-31
2021-04-01
2021-04-02
2021-04-05
2021-04-06
2021-04-07
2021-04-08
2021-04-09


## Actual Scraping
Time to start scraping some data from Reddit! 

### Thread Titles
The subreddit "r/WallStreetBets" has a main discussion thread that is regulated, and the bulk of the activity happens here. We accessed this thread for each date in our desired range, and we did this by searching with the thread's title. However, it's important to note that the thread is titled differently for Fridays.

Fridays: "Weekend Discussion Thread for the Weekend of March 19, 2021"

Mon-Thurs: "Daily Discussion Thread for February 08, 2021"

### Thread Comments
For each day, we iterate through the comments on that day's discussion thread, and we basically count how many times a certain company or stock is mentioned. We did this for 10 companies: Apple, Microsoft, Costco, Tesla, Nio, Gamestop, AMC, Virgin Galactic, Tilray, Bed Bath and Beyond.

### Choice of Stocks
We intentionally chose a wide range of stocks to do our analysis on. We have some legit growth stocks that are popular on Reddit, such as AAPL and TSLA. We also have some meme stocks that are extremely popular on Reddit, like AMC and GME. Finally, we included some stocks that are less popular on Reddit, such as SPCE, TLRY, and BBBY. 

In [5]:
apple_list = []
microsoft_list = []
costco_list = []
tesla_list = []
nio_list = []
gme_list = []
amc_list = []
spce_list = []
tlry_list = []
bbby_list = []

for item in datelist:
    string = None
    if item.strftime("%a") == "Fri":
        string = "Weekend Discussion Thread for the Weekend of " + item.strftime("%B %d, %Y")
    else:
        string = "Daily Discussion Thread for " + item.strftime("%B %d, %Y")
    
    comment_list = []
    submission = reddit.subreddit('WallStreetBets').search(string, sort = "relevance", limit = 1)
    for post in submission:
        print(post.title)
        post.comments.replace_more(limit=100)
        for comment in post.comments.list():
            comment_list.append(comment.body.lower())

    print(len(comment_list), item)
    
    apple = 0
    microsoft = 0
    costco = 0
    tesla = 0
    nio = 0
    gamestop = 0
    amc = 0
    spce = 0
    tlry = 0
    bbby = 0
    for string in comment_list:
        if ("apple" in string) or ("aapl" in string):
            apple += 1
        if ("msft" in string) or ("microsoft" in string):
            microsoft += 1
        if ("costco" in string) or ("cost" in string):
            costco += 1
        if ("tsla" in string) or ("tesla" in string):
            tesla += 1
        if ("nio" in string):
            nio += 1
        if ("gme" in string) or ("gamestop" in string):
            gamestop += 1
        if ("amc" in string):
            amc += 1
        if ("spce" in string) or ("virgin" in string):
            spce += 1
        if ("tlry" in string) or ("tilray" in string):
            tlry += 1
        if ("bbby" in string) or ("bed" in string) or ("bath" in string):
            bbby += 1
            
    apple_list.append(apple)
    microsoft_list.append(microsoft)
    costco_list.append(costco)
    tesla_list.append(tesla)
    nio_list.append(nio)
    gme_list.append(gamestop)
    amc_list.append(amc)
    spce_list.append(spce)
    tlry_list.append(tlry)
    bbby_list.append(bbby)

Weekend Discussion Thread for the Weekend of March 26, 2021
10252 2021-03-26 00:00:00
Daily Discussion Thread for March 29, 2021
9225 2021-03-29 00:00:00
Daily Discussion Thread for March 30, 2021
9102 2021-03-30 00:00:00
Daily Discussion Thread for March 31, 2021
8994 2021-03-31 00:00:00
Daily Discussion Thread for April 01, 2021
9164 2021-04-01 00:00:00
0 2021-04-02 00:00:00
Daily Discussion Thread for April 05, 2021
9062 2021-04-05 00:00:00
Daily Discussion Thread for April 06, 2021
7912 2021-04-06 00:00:00
Daily Discussion Thread for April 07, 2021
8301 2021-04-07 00:00:00
Daily Discussion Thread for April 08, 2021
8258 2021-04-08 00:00:00
Weekend Discussion Thread for the Weekend of April 09, 2021
9424 2021-04-09 00:00:00


### Exporting to CSV

This final portion of code is just to convert our stored results into a csv file, which we will use in our next notebook. 

In [6]:
import pandas as pd
df = pd.DataFrame(list(zip(converted_datelist, apple_list, microsoft_list, costco_list, tesla_list, nio_list, gme_list, amc_list, spce_list, tlry_list, bbby_list)),
                 columns = ['Date', 'Apple', 'Microsoft', 'Costco', 'Tesla', 'Nio', 'Gamestop', 'AMC', 'Virgin Galactic', 'Tilray', 'Bed Bath & Beyond'])

### Results

Here are the some of the results. As explained earlier, this is just a subset of our final dataset, which is much larger. The number in each cell basically represents how many times that particular company/stock was mentioned in the Reddit thread on that day. 

In [7]:
df

Unnamed: 0,Date,Apple,Microsoft,Costco,Tesla,Nio,Gamestop,AMC,Virgin Galactic,Tilray,Bed Bath & Beyond
0,2021-03-26,80,19,49,178,110,442,60,14,11,41
1,2021-03-29,119,10,27,292,93,257,72,18,7,26
2,2021-03-30,169,16,28,482,119,394,99,13,19,26
3,2021-03-31,169,95,16,425,152,239,91,6,67,36
4,2021-04-01,116,34,23,806,219,172,293,6,21,22
5,2021-04-02,0,0,0,0,0,0,0,0,0,0
6,2021-04-05,142,46,30,667,84,658,275,22,35,20
7,2021-04-06,114,28,20,315,78,199,99,9,32,33
8,2021-04-07,141,29,20,351,148,179,74,14,71,27
9,2021-04-08,313,33,30,250,73,383,66,10,83,25


In [8]:
df.to_csv("darren4.csv")

### Final Comments

Unfortunately, there were some anomalies in the data. If you look at row 5, there is a whole row of zeroes. On some days, the discussion thread is missing from r/WallStreetBets, probably because they have been archived. There's nothing much we can do about that except to delete row 5. Hence, you might notice some missing dates in our final dataset.

After cleaning up the anomalies and compiling all our data into one large csv named "scrappedData.csv", we used it together with data from the Alpha Vantage API. **On to the next notebook!**