# Problem Statement

## Project Objective
Our clients (Robinhood Markets Inc.) aims to expand its services from serving short-term options/stocks traders (such as those found in r/WallStreetBets) and start serving long-term investors (such as those found in r/stocks).  However, since these two subreddits have different interests, jargons, and audience, they would need to properly target the advertisement to the correct subreddit (r/WallStreetBets would not be interested in long-term investment). Hence, we are tasked with developing a model that can classify whether a post belongs to to subreddit r/WallStreetBets or r/stocks, In order to serve the correct post with the corresponding advertisement.

## Introduction:
r/WallStreetBets (also known as r/wsb) is a subreddit for discussing stocks and option trading. It has become notable for its colorful and profane jargon, aggressive trading strategies, harassment, and for playing a major role in the GameStop short squeeze that caused losses for some US firms and short sellers in a few days in early 2021 [[1]](https://en.wikipedia.org/wiki/R/wallstreetbets). The posts in r/wsb is dominated with memes, propsal/ideas for extremely risky stock/option plays, as well as reports about massive gains/losses from said plays. 

On the other hand r/stocks is a subreddit for for a more serious discussion on stocks and options, where the participants usually posts analysis and discussions on various stocks and companies. Discussions on highly risky plays on stocks with low capitalization and volume (typically known as "Penny Stocks") are outright banned in the subreddit. Instead, the type of discussions conducted at stocks is more geared towards serious long-term investments, which is spcifically the target of Robinhood's new expansion plan.

## Scope:
For this project, will be scraping all the posts from both subreddits in the period between August 2021 to August 2022. The reason for selecting this particular time period is because several months prior to August 2021, the majority of the discussions in both subreddits are still revolving around the GameStop short squeeze [[2]](https://en.wikipedia.org/wiki/GameStop_short_squeeze). The jargon, vocabulary, and talking points of this particular topic is quite different compared to subsequent topics. As such we have decided to not include the discussion of this particular topic on the analysis and classification project.

## Success Evaluation:
The difficulty for this task comes from the fact that this is a highly imbalanced classification problem. There are nearly 5x more posts coming from r/WSB then they are from r/stocks. As such using the simple accuracy metric (i.e.: ratio of correct predictions) would result in an erroneously high performance metric. For this project we'll be focusing on the precision and recall [[3]](https://en.wikipedia.org/wiki/Precision_and_recall) in predicting the target class. The following are the metric definition in the context of advertising to the target class:
- True positive: correctly classifying and serving the advertisement to the target class (r/stocks)
- False positive: incorrectly classifying the target class (r/stocks), and instead serving the advertisment to the wrong subreddit (r/wsb)
- True negative: correctly clasifying the other class (r/wsb) and not serving the advertisement
- False negative: incorrectly classifying the other class (r/wsb) which resulted in not serving the advertisement to the target class (r/stocks)
- Precision: ratio of advertisement served to the correct class
- Recall: ratio of posts in the correct class that is correctly served the advertisement

Based on the definitions above we are aiming to strike a balance between precision and recall, where the client is able to have a wide enough coverage in serving the advertisement to the target class (recall) while still maintaining a good enough precision so as not to waste the advertising budget on the wrong class. As such, we can use the f1-score [[4]](https://en.wikipedia.org/wiki/F-score) which takes into account both of the previous metrics in consideration.

## Secondary Objective:
The secondary objective for this project is to analyze the correlation between the subreddits' sentiments on a particular stock against the future performance of that stock (defined as price change in 7-days). This is to assess whether these subreddits have any predictive capability for making stock picks. If we find that these subreddits are able to have some predictive capability, we can use the subreddits prediction to inform/supplement the analysis of the clients' Investment team in making their stock purchase decision.

# Data Scraping

<span style="color:red">Note: To access the pre-scraped datasets, you will need to export the the .rar files from the 'data_compressed' folder, and put them in the 'data' folder</span>.

## Imports and Settings

In [1]:
## library imports

# data processing imports
import pandas as pd
import numpy as np

# scraping imports
import requests
from pmaw import PushshiftAPI

# misc imports
import time
import datetime
from dateutil.relativedelta import *

## Subreddit posts scraping using PMAW and Pushshift.io

In [2]:
scrape_start = datetime.datetime(2021, 8, 1) # set the start date for scraping
scrape_months = 12 # indicate the total duration of subreddits to be scraped (in months)

total_scraped_post = 0 # initialize total scraped post counter
total_timer = 0 # initialize total timer

wsb = pd.DataFrame() # instantiate empty dataframe to contain scraped wsb posts
stocks = pd.DataFrame() # instantiate empty dataframe to contain scraped stocks posts

for i in range(scrape_months): # looping through the number of months to be scraped
    
    # the subreddits' post data will be obtained through the PMAW, which is a wrapper of the PushShift API
    # the subreddits will be scraped in a monthly interval, this will be done by specifying the timestamp 'after' and 'before' which the subreddits will be scraped
    after = scrape_start+relativedelta(months=+(i)) # setting the 'after' datetime as a function of the start date and the current month increment in the loop
    before = scrape_start+relativedelta(months=+(i+1)) # setting the 'before' datetime as a function of the start date and the current month increment in the loop
    
    # converting the datetime to a timestamp
    scrape_after = int(after.timestamp())
    scrape_before = int(before.timestamp())
    
    
    # SCRAPING THE WALLSTREETBETS SUBREDDIT
    
    start_timer = time.time() # starting the timer (for displayinig the time required to scrape one particular month of the subreddit)
    
    wsb_posts = PushshiftAPI().search_submissions(subreddit="wallstreetbets", # using the PMAW wrapper for pushhift API to scrape the wsb subreddit
                                              limit=31*1000, # obtain a maximum of 310000 posts (or all the posts available in the month), assuming at most 1000 posts per day
                                              after=scrape_after, before=scrape_before) # setting the time boundaries of the scraping
    
    wsb_current_month = [post for post in wsb_posts] # storing the month's post in a list of dictionaries
    wsb_current_month = pd.DataFrame(wsb_current_month) # converting the list to a dataframe

    wsb = pd.concat([wsb,wsb_current_month]) # concatenating the current month's dataframe with the total dataframe
    
    duration_timer = round((time.time() - start_timer)/60, 2) # stopping the timer and storing the elapsed time in minutes
    total_timer += duration_timer # incrementing the total timer with the current timer
    total_scraped_post += wsb_current_month.shape[0] # incrementing the total post count with the current post count
    
    # displaying the report for the current month
    print(f"{i+1}a | Obtained {wsb_current_month.shape[0]} posts from 'r/wallstreetbets' between {after.strftime('%m/%Y')} and {before.strftime('%m/%Y')} | elapsed time: {duration_timer} mins")
    
    
    # SCRAPING THE STOCKS SUBREDDIT
    # starting the timer (for displayinig the time required to scrape one particular month of the subreddit)
    start_timer = time.time()
    
    stocks_posts = PushshiftAPI().search_submissions(subreddit="stocks", # using the PMAW wrapper for pushhift API to scrape the stocks subreddit
                                              limit=31*400, # below 400 posts per day
                                              after=scrape_after, before=scrape_before) # setting the time boundaries of the scraping
        
    stocks_current_month = [post for post in stocks_posts] # storing the month's post in a list of dictionaries
    stocks_current_month = pd.DataFrame(stocks_current_month) # converting the list to a dataframe

    stocks = pd.concat([stocks,stocks_current_month]) # concatenating the current month's dataframe with the total dataframe
    
    duration_timer = round((time.time() - start_timer)/60, 2) # stopping the timer and storing the elapsed time in minutes
    total_timer += duration_timer # incrementing the total timer with the current timer
    total_scraped_post += stocks_current_month.shape[0] # incrementing the total post count with the current post count
    
    # displaying the report for the current month
    print(f"{i+1}b | Obtained {stocks_current_month.shape[0]} posts from 'r/stocks' between {after.strftime('%m/%Y')} and {before.strftime('%m/%Y')} | elapsed time: {duration_timer} mins")

# displaying the overall report for the whole scraping process
print(f"SCRAPING COPMLETED! Obtained a total of {total_scraped_post} posts | Total elapsed time: {total_timer} mins")

1a | Obtained 25367 posts from 'r/wallstreetbets' between 08/2021 and 09/2021 | elapsed time: 3.25 mins
1b | Obtained 4567 posts from 'r/stocks' between 08/2021 and 09/2021 | elapsed time: 0.69 mins
2a | Obtained 25574 posts from 'r/wallstreetbets' between 09/2021 and 10/2021 | elapsed time: 2.83 mins
2b | Obtained 4176 posts from 'r/stocks' between 09/2021 and 10/2021 | elapsed time: 0.48 mins
3a | Obtained 25911 posts from 'r/wallstreetbets' between 10/2021 and 11/2021 | elapsed time: 3.09 mins
3b | Obtained 4441 posts from 'r/stocks' between 10/2021 and 11/2021 | elapsed time: 0.52 mins
4a | Obtained 28085 posts from 'r/wallstreetbets' between 11/2021 and 12/2021 | elapsed time: 3.21 mins
4b | Obtained 5457 posts from 'r/stocks' between 11/2021 and 12/2021 | elapsed time: 0.63 mins
5a | Obtained 20304 posts from 'r/wallstreetbets' between 12/2021 and 01/2022 | elapsed time: 2.64 mins
5b | Obtained 4340 posts from 'r/stocks' between 12/2021 and 01/2022 | elapsed time: 0.54 mins
6a | 

## Data Export

In [3]:
# exporting the dataframe to a csv file
wsb.to_csv('data/wsb.csv') 
stocks.to_csv('data/stocks.csv')