# Notebook for Meme Stock App

## Table of contents
1. Importing Libraries
2. Investigating Optimum Ticker Extraction Method
   1. Regex Method
   2. String-Matching (FuzzyWuzzy)
   3. Word Vectorisaiton (GloVe)
3. Getting data from Reddit subreddits
4. Cleaning text data to extract companies and stock tickers
   1. Regex method
   2. Word2vec method
5. Streamlit webapp

### 1. Importing Libraries

In [1]:
from get_reddit import RedditSubmissions, RedditAPIHelper
import praw
import os
from dotenv import load_dotenv

from utils import *

from extraction_models import GloveModel, FuzzModel, RegexExtraction



In [None]:
# #import relevant libraries
# import praw
# from datetime import datetime
# import pytz
# import re
# import pandas as pd 
# from collections import Counter 
# from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# import streamlit as st
# import plotly.express as px
# import yfinance as yf
# from plotly import graph_objs as go
# import json
# import nltk
# from nltk.tokenize import word_tokenize
# #from gensim.models import Word2Vec #pip install updated gensim library from github, new release to fix bug due soon. Gensim v10.2 has bug (scipy deprecated 'triu' from 'scipy.linalg', triu required for gensim v10.2)


## 2. Ticker & Company Extraction Method Analysis

In [2]:
# load test data from file
test_data_file = 'test_data/reddit_test_data.txt'
actual_values_file = 'test_data/reddit_data_answers.txt'
actual_tickervalues_file = 'test_data/reddit_answers_tickers.txt'

#load test files into variable
test_data = ModelUtils.load_text_files(test_data_file)
true_values = ModelUtils.load_text_files(actual_values_file)
true_values_tickers = ModelUtils.load_text_files(actual_tickervalues_file)

#convert results to list of 'tokens' for performance_evaluation function
true_values_list = ModelUtils.convert_comma_separated_string_to_list(true_values)
true_values_tickers_list = ModelUtils.convert_comma_separated_string_to_list(true_values_tickers)

#load company data from public dictionary to make look-up dictionaries/lists
company_data = ModelUtils.get_company_information()

### Method 1: Regex Extraction
To begin with this method, we first need to create a list of company tickers from the company data. The regex extraction method requires that the tokenised words are not made into lowercase, and this is accounted for in the preprocess_text function. The extract_tickers function evaluates each token and if it fits the pattern of a ticker (i.e. 1-5 capitalised letters in a row) and matches a ticker in the list of tickers, it is considered a match. The matches are compared against a true_value_list of tickers extracted from the sample reddit text. Unfortunately, this method had different true_values than the values that validate the string matching and vectorisation methods because this method can only identify tickers and not companies. This makes it hard to compare the regex model against the other two.

In [3]:
# Create List of tickers
ticker_list = RegexExtraction.create_ticker_list(company_data)

# process the test data into tokens
tokens_upper = ModelUtils.preprocess_text(test_data, lower_case = False)

# run the RegexExtraction method and extract the tickers and time to identify each ticker
regex_tickers, regex_time = RegexExtraction.extract_tickers(tokens_upper, ticker_list)

# Evaluate the precision and sensitivity of the ticker_extraction method 
regex_precision, regex_sensitivity = ModelUtils.evaluate_model_performance(true_values_tickers_list, regex_tickers)

In [9]:
print(f'Tickers identified: {regex_tickers}')
print(f'Time taken per ticker: {regex_time}s')
print(f'Precision of Regex Model: {regex_precision:.3f}')
print(f'Sensitivity of Regex Model: {regex_sensitivity:.3f}')

Tickers identified: ['GME', 'GME', 'GME', 'AMC', 'GME', 'AMC', 'NOK', 'GME', 'GME', 'AMC', 'GME', 'GME', 'AG', 'SLV', 'GME', 'GME', 'JPM', 'GME', 'SLV', 'AG', 'AG', 'SLV', 'AG', 'NOK', 'GO', 'GME', 'NOK', 'PLTR', 'BB', 'GME', 'NOK', 'SU', 'AMC', 'TSLA', 'TSLA', 'DDS', 'DDS', 'GME', 'USA', 'GME', 'NOK', 'GME', 'BB', 'AMC', 'GME', 'BB', 'AMC', 'NOK', 'SLV', 'PSLV', 'CTRM', 'VALE', 'ZOM', 'AGI']
Time taken per ticker: 3.7037853627477973e-05s
Precision of Regex Model: 0.944
Sensitivity of Regex Model: 0.836


#### Regex Extraction Results
The results show that this method is fast, taking 3.70 e-05 seconds. This is primarily due to the low algorithmic complexity of this algorithm. Moreover, this method was very precise, correctly identifying 94.4% true values. The model also had a sensitivity of 83%, meaning it identified 83% of the true values from the text. 

Method 2: String-matching Extraction (FuzzyWuzzy)


### 2. Getting data from Reddit
#### Webscraper vs Reddit API
Scraping data from reddit is employed regularly, therefore creating a webscraper-based data miner was investigated. The decision to move towards using Reddit API was made because Reddit actively employs anti-webscraping measures. To preserve the long-lasting function of the application, API calls a developer account was created and it's credentials stored in a .env file. The limitation of using reddit API (PRAW library) rather than a web-scraper is that reddit enforces an API call limit of 60 calls per minute. Because of this, an API_limit function was created to investigate the number of API calls the code was calling per run. Certain PRAW functions were avoided that used multiple API calls, and the current code is well within the API call limit enforced. As such, the decision to keep the API calls as the method of data mining was kept.

#### Reddit API
Reddit maintains an API library for obtaining data from reddit submissions and subreddits. Any subreddit, such as r/wallstreetbets, can be specified and submissions (otherwise known as *posts*) with relevant data (title, text, upvotes, etc) can be accessed. Top submissions can be accessed with a limit of 1000 submissions. Comments from submissions can also be accessed via a CommentForest instance, which is initiated alongside a reddit instance. 


In [None]:
# Load the .env file
load_dotenv('credentials.env')

# Get the credentials from the .env file
user_agent = os.getenv('USER_AGENT')
client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')

# Initialize the Reddit API instance
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

#### Submissions vs Daily Discussions
Submissions are general posts posted within Wallstreetbets. On the other hand Daily Discussions are daily posts whereby active users comment on stocks they are looking at on the day. Both submissions and comments from Daily Discussions were mined for text data. Other useful parameters such as upvote count, number of comments, author, time, and others were also extracted in case a use case arose. 

In [None]:
# Create an instance of RedditSubmissions with the Reddit API instance
reddit_submissions = RedditSubmissions(reddit)

#Access top submissions from r/Wallstreetbets without submission limit
submissions = reddit_submissions.get_submissions('wallstreetbets',sort = 'top', limit=None, time_filter='year')

#Extract data from reddit.submission object such as title, text, author, date, upvote_ratio, etc. 
submission_data = reddit_submissions.extract_submission_data(submissions)

## 3. Cleaning Data

### Data Characteristics
The submissions and comment data is very unformatted and diverse, consisting of links, emojis, uppercase, lowercase, uneven spacing, mispellings, and others. On brief observation, most posts that mention specific companies use 'tickers', or ...!! explain what a ticker is!! however some posts write out the full company. 

As such, it was quite difficult to interpret and extract company names and tickers. Company names are hard to extract because they follow no distinct linguistic pattern. Many company names consist of multiple words, such as 'Bank of America' or 'Morgan Stanley', whilst others are noun words of any length, which are indistiguishable from non-company nouns. On the other hand, tickers are easier to extract because they have a distinct pattern that consists of 1-5 capitalised letters. Nonetheless, extracting tickers presented many challenges as many submissions and comments do not always capitalise tickers, often contain various acronyms, and other capitalised words and sentences for linguistic emphasis.  

### Data preparation:
The first step was to remove all punctuation and patterns such as /n (represents space). From the ntlk library, all semantic stopwords were removed **Explain more**, along with punctuation, http links, emoji's, and '/n' which marks the start of a new line?. 

The next step is to extract tickers and/or company names mentioned in each post or comment. There were three possible methods; regex, NER, and a machine learning method. To test which method is the best, a 'testing text' was created from reddit comments and submissions that contained 100 tickers and 100 company names mentioned in the text. Below each method is described and the test results explained. 

### Testing Method

### Regex method
From the cleaned text, the first method was to extract all the words that matched the pattern of 1-5 capital letters, and to remove any words manually added to a blacklist. The rationale behind this method was that most posts observed used tickers when discussing the company, therefore it could meet the requirements needed ?!(what requirements neede?!) As expected, no company names were extracted.

results

### FuzzyWuzzy
NER or named entity recognition is a method derived from Spacy... the benefit from regex. it was attempted. 

results were

### GloVe
* Word2Vec was used because the app required a more customised data cleaning method to identify companies and tickers. THe benefit was you can calculate the similarity which means it covers for mispelled words, etc.  the difficult 


## 4 Data Display 

## 5. Web App