# Notebook for Meme Stock App

## Table of contents
1. Importing Libraries
2. Getting data from Reddit subreddits
3. Cleaning text data to extract companies and stock tickers
   1. Regex method
   2. Word2vec method
4. Streamlit webapp

### 1. Importing Libraries

In [None]:
#import relevant libraries
import praw
from datetime import datetime
import pytz
import re
import pandas as pd 
from collections import Counter 
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import streamlit as st
import plotly.express as px
import yfinance as yf
from plotly import graph_objs as go
import json
import nltk
from nltk.tokenize import word_tokenize
#from gensim.models import Word2Vec #pip install updated gensim library from github, new release to fix bug due soon. Gensim v10.2 has bug (scipy deprecated 'triu' from 'scipy.linalg', triu required for gensim v10.2)


### 2. Getting data from Reddit
#### Webscraper vs Reddit API
Scraping data from reddit is employed regularly, therefore creating a webscraper-based data miner was investigated. The decision to move towards using Reddit API was made because Reddit actively employs anti-webscraping measures. To preserve the long-lasting function of the application, API calls a developer account was created and it's credentials stored in a .env file. The limitation of using reddit API (PRAW library) rather than a web-scraper is that reddit enforces an API call limit of 60 calls per minute. Because of this, an API_limit function was created to investigate the number of API calls the code was calling per run. Certain PRAW functions were avoided that used multiple API calls, and the current code is well within the API call limit enforced. As such, the decision to keep the API calls as the method of data mining was kept.

#### Reddit API
Reddit maintains an API library for obtaining data from reddit submissions and subreddits. Any subreddit, such as r/wallstreetbets, can be specified and submissions (otherwise known as *posts*) with relevant data (title, text, upvotes, etc) can be accessed. Top submissions can be accessed with a limit of 1000 submissions. Comments from submissions can also be accessed via a CommentForest instance, which is initiated alongside a reddit instance. 

#### Submissions vs Daily Discussions
Submissions are general posts posted within Wallstreetbets. On the other hand Daily Discussions are daily posts whereby active users comment on stocks they are looking at on the day. Both submissions and comments from Daily Discussions were mined for text data. Other useful parameters such as upvote count, number of comments, author, time, and others were also extracted in case a use case arose. 

In [None]:
#Access top submissions from r/Wallstreetbets without submission limit and within a time range of 1 year
top_submissions = get_top_submissions('wallstreetbets', limit=None, time_filter='year')

#Extract data from reddit.submission object such as title, text, author, date, upvote_ratio, etc. 
top_submission_data = get_submission_data(top_submissions)

## 3. Cleaning Data

### Data Characteristics
The submissions and comment data is very unformatted and diverse, consisting of links, emojis, uppercase, lowercase, uneven spacing, mispellings, and others. On brief observation, most posts that mention specific companies use 'tickers', or ...!! explain what a ticker is!! however some posts write out the full company. 

As such, it was quite difficult to interpret and extract company names and tickers. Company names are hard to extract because they follow no distinct linguistic pattern. Many company names consist of multiple words, such as 'Bank of America' or 'Morgan Stanley', whilst others are noun words of any length, which are indistiguishable from non-company nouns. On the other hand, tickers are easier to extract because they have a distinct pattern that consists of 1-5 capitalised letters. Nonetheless, extracting tickers presented many challenges as many submissions and comments do not always capitalise tickers, often contain various acronyms, and other capitalised words and sentences for linguistic emphasis.  

### Data preparation:
The first step was to remove all punctuation and patterns such as /n (represents space). From the ntlk library, all semantic stopwords were removed **Explain more**, along with punctuation, http links, emoji's, and '/n' which marks the start of a new line?. 

The next step is to extract tickers and/or company names mentioned in each post or comment. There were three possible methods; regex, NER, and a machine learning method. To test which method is the best, a 'testing text' was created from reddit comments and submissions that contained 100 tickers and 100 company names mentioned in the text. Below each method is described and the test results explained. 

### Testing Method

### Regex method
From the cleaned text, the first method was to extract all the words that matched the pattern of 1-5 capital letters, and to remove any words manually added to a blacklist. The rationale behind this method was that most posts observed used tickers when discussing the company, therefore it could meet the requirements needed ?!(what requirements neede?!) As expected, no company names were extracted.

results

### NER
NER or named entity recognition is a method derived from Spacy... the benefit from regex. it was attempted. 

results were

### Word2Vec
* Word2Vec was used because the app required a more customised data cleaning method to identify companeis and tickers. THe benefit was you can calculate the similarity which means it covers for mispelled words, etc.  the difficult 


TODO section
1. testing - create submission post of 100 tickers and 100 companies to see which method extracted the most against the list
*come up with rating for each method - compare against company list to see how many are recognised
2. Create text_cleaning function
   1. remove punctuation
   2. remove stopwords
   3. remove '/n'
   4. remove links
   5. remove emoji's 
3. create testing function
   1. create submission text test
   


## 4 Data Display 

## 5. Web App