# Notebook for Meme Stock App

## Table of contents
1. Importing Libraries
2. Getting data from Reddit subreddits
3. Cleaning Data
4. Investigating Optimum Ticker Extraction Method
   1. Regex Method
   2. String-Matching (FuzzyWuzzy)
   3. Word Vectorisaiton (GloVe)
5. 
6. Cleaning text data to extract companies and stock tickers
   1. Regex method
   2. Word2vec method
7. Streamlit webapp

## 1. Importing Libraries

In [1]:
import praw
import os
from dotenv import load_dotenv
from utils import *
from extraction_models import GloveModel, FuzzModel, RegexExtraction
from get_reddit import RedditSubmissions, RedditAPIHelper



## 2. Getting data from Reddit
#### Webscraper vs Reddit API
Scraping data from reddit is employed regularly, therefore creating a webscraper-based data miner was investigated. The decision to move towards using Reddit API was made because Reddit actively employs anti-webscraping measures. To preserve the long-lasting function of the application, API calls a developer account was created and it's credentials stored in a .env file. The limitation of using reddit API (PRAW library) rather than a web-scraper is that reddit enforces an API call limit of 60 calls per minute. Because of this, an API_limit function was created to investigate the number of API calls the code was calling per run. Certain PRAW functions were avoided that used multiple API calls, and the current code is well within the API call limit enforced. As such, the decision to keep the API calls as the method of data mining was kept.

#### Reddit API
Reddit maintains an API library for obtaining data from reddit submissions and subreddits. Any subreddit, such as r/wallstreetbets, can be specified and submissions (otherwise known as *posts*) with relevant data (title, text, upvotes, etc) can be accessed. Top submissions can be accessed with a limit of 1000 submissions. Comments from submissions can also be accessed via a CommentForest instance, which is initiated alongside a reddit instance. 


In [2]:
# Load the .env file
load_dotenv('credentials.env')

# Get the credentials from the .env file
user_agent = os.getenv('USER_AGENT')
client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')

# Initialize the Reddit API instance
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

#### Submissions vs Daily Discussions
Submissions are general posts posted within Wallstreetbets. On the other hand Daily Discussions are daily posts whereby active users comment on stocks they are looking at on the day. Both submissions and comments from Daily Discussions were mined for text data. Other useful parameters such as upvote count, number of comments, author, time, and others were also extracted in case a use case arose. 


In [3]:
# Create an instance of RedditSubmissions with the Reddit API instance
reddit_submissions = RedditSubmissions(reddit)

#Access top submissions from r/Wallstreetbets without submission limit
submissions = reddit_submissions.get_submissions('wallstreetbets',sort = 'top', limit=None, time_filter='year')

#Extract data from reddit.submission object such as title, text, author, date, upvote_ratio, etc. 
submission_data = reddit_submissions.extract_submission_data(submissions)

## 3. Cleaning Data

### Data Characteristics
The submissions and comment data is very unformatted and diverse, consisting of links, emojis, uppercase, lowercase, uneven spacing, mispellings, and others. On brief observation, most posts that mention specific companies use 'tickers', or ...!! explain what a ticker is!! however some posts write out the full company. 

As such, it was quite difficult to interpret and extract company names and tickers. Company names are hard to extract because they follow no distinct linguistic pattern. Many company names consist of multiple words, such as 'Bank of America' or 'Morgan Stanley', whilst others are noun words of any length, which are indistiguishable from non-company nouns. On the other hand, tickers are easier to extract because they have a distinct pattern that consists of 1-5 capitalised letters. Nonetheless, extracting tickers presented many challenges as many submissions and comments do not always capitalise tickers, often contain various acronyms, and other capitalised words and sentences for linguistic emphasis.  

### Data preparation:
The first step was to remove all punctuation and patterns such as /n (represents space). From the ntlk library, all semantic stopwords were removed **Explain more**, along with punctuation, http links, emoji's, and '/n' which marks the start of a new line?. 

The next step is to extract tickers and/or company names mentioned in each post or comment. There were three possible methods; regex, NER, and a machine learning method. To test which method is the best, a 'testing text' was created from reddit comments and submissions that contained 100 tickers and 100 company names mentioned in the text. Below each method is described and the test results explained. 


In [11]:
# Clean and tokenise the words from each post in the dataframe
submission_data['tickers'] = submission_data['all_text'].apply(lambda x: ModelUtils.preprocess_text(x, lower_case = False))

## 4. Ticker & Company Extraction Method Analysis

### Load company data for look-up dictionaries for the different models
Test data is loaded using load_text_files, and then true_values files converted to lists to be used in evaluate_model_performance function. Load the company data from the json file downloaded from the SEC. This file should contain all the companies currently listed in the USA.

In [6]:
# load test data from file
test_data_file = 'test_data/reddit_test_data.txt'
actual_values_file = 'test_data/reddit_data_answers.txt'
actual_tickervalues_file = 'test_data/reddit_answers_tickers.txt'

#load test files into variable
test_data = ModelUtils.load_text_files(test_data_file)
true_values = ModelUtils.load_text_files(actual_values_file)
true_values_tickers = ModelUtils.load_text_files(actual_tickervalues_file)

#convert results to list of 'tokens' for performance_evaluation function
true_values_list = ModelUtils.convert_comma_separated_string_to_list(true_values)
true_values_tickers_list = ModelUtils.convert_comma_separated_string_to_list(true_values_tickers)

#load company data from public dictionary to make look-up dictionaries/lists
company_data = ModelUtils.get_company_information()

### Method 1: Regex Extraction
To begin with this method, we first need to create a list of company tickers from the company data. The regex extraction method requires that the tokenised words are not made into lowercase, and this is accounted for in the preprocess_text function. The extract_tickers function evaluates each token and if it fits the pattern of a ticker (i.e. 1-5 capitalised letters in a row) and matches a ticker in the list of tickers, it is considered a match. The matches are compared against a true_value_list of tickers extracted from the sample reddit text. Unfortunately, this method had different true_values than the values that validate the string matching and vectorisation methods because this method can only identify tickers and not companies. This makes it hard to compare the regex model against the other two.

In [7]:
# Create List of tickers
ticker_list = RegexExtraction.create_ticker_list(company_data)

# process the test data into tokens
tokens_upper = ModelUtils.preprocess_text(test_data, lower_case = False)

# run the RegexExtraction method and extract the tickers and time to identify each ticker
regex_tickers, regex_time = RegexExtraction.extract_tickers(tokens_upper, ticker_list)

# Evaluate the precision and sensitivity of the ticker_extraction method 
regex_precision, regex_sensitivity = ModelUtils.evaluate_model_performance(true_values_tickers_list, regex_tickers)

In [9]:
print(f'Tickers identified: {regex_tickers}')
print(f'Time taken per ticker: {regex_time}s')
print(f'Precision of Regex Model: {regex_precision:.3f}')
print(f'Sensitivity of Regex Model: {regex_sensitivity:.3f}')

Tickers identified: ['GME', 'GME', 'GME', 'AMC', 'GME', 'AMC', 'NOK', 'GME', 'GME', 'AMC', 'GME', 'GME', 'AG', 'SLV', 'GME', 'GME', 'JPM', 'GME', 'SLV', 'AG', 'AG', 'SLV', 'AG', 'NOK', 'GO', 'GME', 'NOK', 'PLTR', 'BB', 'GME', 'NOK', 'SU', 'AMC', 'TSLA', 'TSLA', 'DDS', 'DDS', 'GME', 'USA', 'GME', 'NOK', 'GME', 'BB', 'AMC', 'GME', 'BB', 'AMC', 'NOK', 'SLV', 'PSLV', 'CTRM', 'VALE', 'ZOM', 'AGI']
Time taken per ticker: 3.7037853627477973e-05s
Precision of Regex Model: 0.944
Sensitivity of Regex Model: 0.836


#### Regex Extraction Results
The results show that this method is fast, taking 3.70 e-05 seconds. This is primarily due to the low algorithmic complexity of this algorithm. Moreover, this method was very precise, correctly identifying 94.4% true values. The model also had a sensitivity of 83%, meaning it identified 83% of the true values from the text. 

### Method 2: String-matching Extraction (FuzzyWuzzy)
This method matches the string values of the tokenised text with string values of company names and tickers. This required a dictionary with the tickers as keys, and company names and tickers as string values. This was done by converting company_data file into three lists; ticker_to_title, title_to_ticker and merging them to form a combined dictionary. This also required some data cleaning via clean_titles function in order to remove company suffixes. This combined dictionary was the preprocessed into a list of all possible values (called values) and a dictionary with every key value pair possible (values_to_key). This step was implemented in order to tailor the dataset to best suit the needs of the FuzzyWuzzy model in order to increase efficiency. In the fuzz_best_match function, FuzzyWuzzy calculates the string similarity of the input token against the values list to find the best match. The best match is then compared against an inputted similarity threshold, which allows the user to toggle the precision of the function.

In order to run through a list of tokens, this function was implemented into a loop and best matches appended to a list.

To evaluate the performance of this method along different thresholds,a function was created to test fuzz_best_match on reddit data with various similarity thresholds. The list of predicted values (i.e. best matches) for each threshold was compared individually to the list of true_values to ascertain time, precision, and sensitivity.

In [12]:
# Initiate fuzz_model class instance
fuzz_model = FuzzModel()

# Create 'fuzz' look-up dictionaries from the company data file
title_to_ticker, ticker_to_title, combined_dict = fuzz_model.create_fuzz_dicts(company_data)

# Preprocess these dictionaries to tailor them to the process.extractOne(token, values, scorer=fuzz.token_sort_ratio) fuzz model.
values_to_key, values = fuzz_model.fuzz_preprocess(combined_dict)

#preprocess the text data into tokens
word_tokens = ModelUtils.preprocess_text(test_data)

# Run fuzz optimum threshold analysis against reddit data to oberve time, precision, and sensitivtiy at varying thresholds
fuzz_result = fuzz_model.fuzz_optimum_threshold(word_tokens, values_to_key, values, true_values_list, thresholds = np.arange(70, 101, 5) )

In [13]:
fuzz_result

['Threshold: 70, precision: 0.157, sensitivity: 0.776, time taken to match: 0.5981s',
 'Threshold: 75, precision: 0.180, sensitivity: 0.776, time taken to match: 0.5612s',
 'Threshold: 80, precision: 0.250, sensitivity: 0.776, time taken to match: 0.5520s',
 'Threshold: 85, precision: 0.337, sensitivity: 0.776, time taken to match: 0.6092s',
 'Threshold: 90, precision: 0.611, sensitivity: 0.763, time taken to match: 0.4850s',
 'Threshold: 95, precision: 0.659, sensitivity: 0.763, time taken to match: 0.5166s',
 'Threshold: 100, precision: 0.659, sensitivity: 0.763, time taken to match: 0.5941s']

### Method 3: Word Vectorisation Extraction (GloVe Model)
This method uses word vectorization with the GloVe (Global Vectors for Word Representation) model to extract company titles and tickers from text data. The GloVe model, developed by Jeffrey Pennington from standford was developed in 2014, and represents words as vectors which can capture semantic relationships.

The GloVe model is first loaded from a pre-trained file and converted to a dictionary where words are keys and their corresponding vectors are values. 

Three dictionaries are created so that vectors can be looked up as values and the key represented the match. Vectorised dictionaries are created by splitting the text into words, and retrieving vectors for these words from the GloVe model. For  companies comprised of multiple words, the average of these vectors is computed for to represent the entire text as a single vector. the three dictionaries are:

ticker_to_vector: Maps tickers to their vector representations.
title_to_vector: Maps company titles to their vector representations.
title_lookup dictionary: Maps tickers to their corresponding company titles.

From these dictionaries the vectors for both tickers and titles are combined into a single dictionary called combined_vec_dict. Each ticker maps to a tuple containing its own vector and the vector of its corresponding company title. 

To find a vector match, the glove_best_match function calculates the cosine similarity between the vector of a tokenized word and the vectors in the combined_vec_dict. The ticker with the highest similarity above a specified threshold is selected as the best match.

To determine the best threshold for matching, the glove_optimum_threshold function tests various thresholds and evaluates performance based on precision and sensitivity. It compares the predicted tickers with true tickers, recording the time taken for matching, precision, and sensitivity for each threshold.

In [24]:
#Initiate GloVemodel instance
glove = GloveModel()

#define pathway to load pre-trained vector model
glove_filepath = ('/Users/fraserlevick/Documents/python_code/MScF_sem2_code/meme_stock_app_other/glove/glove.6B.50d.txt')

#load the glove model into a variable
glove_model = glove.load_glove_model(glove_filepath)

#Create vector look-up dictionaries from company data and glove model
ticker_to_vector, title_to_vector, title_lookup = glove.create_vector_dicts(company_data, glove_model, vector_size=50)

#create combined vector dictionary 
combined_vec_dict = glove.merge_vector_dicts(ticker_to_vector, title_to_vector, title_lookup)

# Run glove optimum threshold analysis against reddit data to oberve time, precision, and sensitivtiy at varying thresholds
glove_result = glove.glove_optimum_threshold(word_tokens, true_values_list, combined_vec_dict, thresholds = np.arange(.699, .999, 0.05))

In [25]:
glove_result

['Threshold: 0.699, precision: 0.170, sensitivity: 0.763, time taken to match: 0.2790s',
 'Threshold: 0.749, precision: 0.190, sensitivity: 0.763, time taken to match: 0.2206s',
 'Threshold: 0.799, precision: 0.225, sensitivity: 0.763, time taken to match: 0.1832s',
 'Threshold: 0.8490000000000001, precision: 0.301, sensitivity: 0.763, time taken to match: 0.1807s',
 'Threshold: 0.8990000000000001, precision: 0.411, sensitivity: 0.763, time taken to match: 0.1893s',
 'Threshold: 0.9490000000000002, precision: 0.582, sensitivity: 0.750, time taken to match: 0.1469s',
 'Threshold: 0.9990000000000002, precision: 0.667, sensitivity: 0.737, time taken to match: 0.1278s']

## 4. Extract Tickers from WSB Data

In [17]:
#run the extraction model on the tokenised words
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: RegexExtraction.extract_tickers(tokens, ticker_list, time_method = False))

#Get top x tickers from each submission
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: TickerFrequencyProcessor.top_tickers(tokens,5))

# create ticker df for streamlit graph
ticker_dataframe = TickerFrequencyProcessor.process_ticker_frequencies(submission_data)

#export ticker_dataframe as csv to safe on memory and increase app efficiency
ticker_file_path = 'ticker_frequencies.csv'
ticker_dataframe.to_csv(ticker_file_path, index=False)

## 5. Display Data in Web App

In [26]:
import subprocess

# Define the command to run your Streamlit app
command = ["streamlit", "run", "web_app.py"]

# Run the command
subprocess.run(command)


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.16.14.181:8501

  For better performance, install the Watchdog module:

  $ xcode-select --install
  $ pip install watchdog
            


KeyboardInterrupt: 