### Notes
* Selenium, generally slower and more resource-intensive compared to other scraping methods
* Scrapy is less intensive than selenium, will use
* cannot directly simulate clicking through a website using only Requests and BeautifulSoup 
* Requests is an HTTP library used for making HTTP requests to web servers and fetching HTML content
* BeautifulSoup is an HTML parsing library used for extracting data from HTML documents

### Notes
* Scrapy usually configured via internal scrapy project.
* As my web scraper is part of a larger project, need to customise scrapy to run in a script.

In [2]:
import praw
from datetime import datetime
import pytz
import re
import pandas as pd 
from collections import Counter 
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import streamlit as st
import plotly.express as px
import yfinance as yf
from plotly import graph_objs as go
import numpy as np
from typing import List
import string
import os

In [None]:
from gensim.models import Word2Vec

In [None]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

In [None]:
modelxx = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
modelxx.save("word2vec.model")
modelxx.train([["hello", "world"]], total_examples=1, epochs=1)

Connect to Reddit

In [None]:
reddit = praw.Reddit(client_id='HM0EZT9M6wIgyBKGqwdeHA', client_secret='jEbsMOFcOIBnibpKoFKoLYcJtmYzxA', user_agent='mac.os:Adv_prog_project:v.1 (by /u/Adv_Prog_proj_user)')

### Notes

For a submission this would be useful:
* author
* number of comments
* upvote_ratio
* comments
* date of submission .created_utc
* maybe
  * awards? theres a lot and it is downlaoded like a dictionary

From comments:
* author .author
* comment score (upvote) .score
* comment text .body
* date of comment .created_utc 

Notes:
1. comment.body only sees comments on submission, not subcomments

1. Download all submissions/posts

n.b Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

1. Parse daily discussion posts to access the comments

author .author
* comment score (upvote) .score
* comment text .body
* date of comment .created_utc 

In [None]:
import json

In [None]:

blacklist = ['LLC', 'SEC', 'UK', 'USA', 'US', 'CEO', 'UCLA', 'US', 'OTC', 'I', 'TO', 'AI', 'WSB', 'A', 'GPU', 'TLDR', 'GOING', 'UP', 'ARE', 'SO', 'THE', 'MOON', 'US', 'OP', 'I', 'PDF']

timezone = 'Europe/Zurich'

def get_api_limit(timezone: str):
    """
    Purpose:
        identify how many API calls left in the limit imposed by reddit (600 per 10 minutes) 

    Arguments:
        timezone: timezone as a string e.g. ('Europe/Zurich')

    Output:
        a print statement informing when the reset time was for API calls and how many remaining requests there are
    """
    try:
        #connect to reddit authority limits
        limit = reddit.auth.limits
        remaining_requests = limit['remaining']
        #get datetime of when reset will occur & translate into local time
        reset = datetime.utcfromtimestamp(limit['reset_timestamp'])
        reset_time = pytz.utc.localize(reset).astimezone(pytz.timezone(timezone))
        print(f'Reset time was {reset_time}, you have {remaining_requests} requests remaining')
        
    except pytz.UnknownTimeZoneError:
        print(f'Unknown timezone: {timezone}')


Notes on functions:
1. get_tickers: preprocess function, then get tickers
   1. have option make lowercase = True / False

In [None]:
from get_reddit import RedditSubmissions, RedditAPIHelper
import praw
import os
from dotenv import load_dotenv

In [19]:

# Load the .env file
load_dotenv('credentials.env')

# Get the credentials from the environment variables
user_agent = os.getenv('USER_AGENT')
client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')

In [None]:
# Initialize the Reddit API instance
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Create an instance of RedditSubmissions with the Reddit API instance
reddit_submissions = RedditSubmissions(reddit)

In [20]:
submissions = reddit_submissions.get_submissions('wallstreetbets',sort = 'top', limit=30, time_filter='year')
submission_data = reddit_submissions.extract_submission_data(submissions)

In [None]:
# tokenise the words from each post in the dataframe
modelutils = ModelUtils()
submission_data['tickers'] = submission_data['all_text'].apply(lambda x: modelutils.preprocess_text(x))

# is the input to the model a list of strings?
#check if submission_data['ticker_freq'] is a list of string, check if inputs to other models are also strings
#type(submission_data['tickers'][0])

#load company data
company_data = modelutils.get_company_information()

In [None]:
#setup fuzz model dictionaries
fuzz_model = FuzzModel()
title_to_ticker, ticker_to_title, combined_dict = fuzz_model.create_fuzz_dicts(company_data)
values_to_key, values = fuzz_model.fuzz_preprocess(combined_dict)

#run the extraction model on the tokenised words
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: run_fuzz_model(tokens, values_to_key, values, 95))

#Get top x tickers from each submission
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: TickerFrequencyProcessor.top_tickers(tokens,5))

# create ticker df for streamlit graph
ticker_dataframe = TickerFrequencyProcessor.process_ticker_frequencies(submission_data)

In [None]:
glove_text = 'Is it insider trading if I bought Boeing puts while I am inside the wrecked AAPL? Purely hypothetical of google:  \nImagine sitting in an airplane when suddenly the KO door blows out.   \nNow, while everyone is screaming and Yahoo for air, you instead turn on your noise-cancelling head-phones to ignore that crying baby next to you, calmly open your robin-hood app (or whatever broker you prefer, idc), and load up on Boeing puts.   \nThere is no way the market couldve already priced that in, it is literally just happening.  \nWould that be considered insider trading? I mean you are literally inside that wreck of an airplane...  \nOn the other hand, one could argue that you are also outside the airplane, given that the door just blew off...  \n'
modelutils = ModelUtils()
processed_glove_text = modelutils.preprocess_text(glove_text)
result = run_glove_model(processed_glove_text, combined_vec_dict, 0.9)
result

In [None]:
#GloveModel = GloveModel()
#setup glove model dictionaries
filepath = ('glove/glove.6B.50d.txt')
glove_model = GloveModel.load_glove_model(filepath)
ticker_to_vector, title_to_vector, title_lookup = GloveModel.create_vector_dicts(company_data, glove_model, vector_size=50)
combined_vec_dict = GloveModel.merge_vector_dicts(ticker_to_vector, title_to_vector, title_lookup)

#run the extraction model on the tokenised words
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: run_glove_model(tokens, combined_vec_dict, 0.9))

#Get top x tickers from each submission
submission_data['tickers'] = submission_data['tickers'].apply(lambda tokens: TickerFrequencyProcessor.top_tickers(tokens,5))

# create ticker df for streamlit graph
ticker_dataframe = TickerFrequencyProcessor.process_ticker_frequencies(submission_data)

# New ticker isolation method

In [5]:
from extraction_models import GloveModel, FuzzModel, RegexExtraction
from utils import ModelUtils, TickerFrequencyProcessor
import numpy as np



In [9]:
mini_reddit_data = 'AGI and PVG, efficient companies with lots of GME interest As you may AMC, precious metals AMC are very manipulated by GME and AGI and tesla familiar with the matter know it:'
mini_reddit_answers = ['agi', 'pvg', 'gme', 'amc', 'amc', 'gme', 'agi', 'tsla']

In [7]:
company_data = ModelUtils.get_company_information()

In [25]:
# Extract tickers from the dictionary
company_ticks = {v['ticker'] for k, v in company_data.items()}

In [12]:
import time

In [10]:
#Load reddit data
reddit_tokens = ModelUtils.preprocess_text(mini_reddit_data, lower_case = True)
reddit_tokens_norm = ModelUtils.preprocess_text(mini_reddit_data, lower_case = False)

In [1]:
from get_reddit import RedditSubmissions, RedditAPIHelper
import praw
import os
from dotenv import load_dotenv

from utils import *

from extraction_models import GloveModel, FuzzModel, RegexExtraction



In [2]:
def convert_comma_separated_string_to_list(variable):
    """
    Purpose:
        Convert a comma-separated string into a list of strings.

    Arguments:
        variable: Comma-separated string
    
    Output:
        List of strings
    """
    # Remove leading and trailing whitespace, then split by comma and strip each element
    list_of_strings = [item.strip() for item in variable.split(',')]
    return list_of_strings

In [3]:
# load test data from file
test_data_file = 'test_data/reddit_test_data.txt'
actual_values_file = 'test_data/reddit_data_answers.txt'
actual_tickervalues_file = 'test_data/reddit_answers_tickers.txt'

#load test files into variable
test_data = ModelUtils.load_text_files(test_data_file)
true_values = ModelUtils.load_text_files(actual_values_file)
true_values_tickers = ModelUtils.load_text_files(actual_tickervalues_file)

#convert results to list of 'tokens' for performance_evaluation function
true_values_list = convert_comma_separated_string_to_list(true_values)
true_values_tickers_list = convert_comma_separated_string_to_list(true_values_tickers)

#load company data from public dictionary to make look-up dictionaries/lists
company_data = ModelUtils.get_company_information()

In [4]:
# Create List of tickers
ticker_list = RegexExtraction.create_ticker_list(company_data)

# process the test data into tokens
tokens_upper = ModelUtils.preprocess_text(test_data, lower_case = False)

# run the RegexExtraction method and extract the tickers and time to identify each ticker
regex_tickers, regex_time = RegexExtraction.extract_tickers(tokens_upper, ticker_list)

# Evaluate the precision and sensitivity of the ticker_extraction method 
regex_precision, regex_sensitivity = ModelUtils.evaluate_model_performance(true_values_tickers_list, regex_tickers)

In [8]:
regex_precision, regex_sensitivity

(0.9444444444444444, 0.8360655737704918)

In [5]:
regex_tickers

['GME',
 'GME',
 'GME',
 'AMC',
 'GME',
 'AMC',
 'NOK',
 'GME',
 'GME',
 'AMC',
 'GME',
 'GME',
 'AG',
 'SLV',
 'GME',
 'GME',
 'JPM',
 'GME',
 'SLV',
 'AG',
 'AG',
 'SLV',
 'AG',
 'NOK',
 'GO',
 'GME',
 'NOK',
 'PLTR',
 'BB',
 'GME',
 'NOK',
 'SU',
 'AMC',
 'TSLA',
 'TSLA',
 'DDS',
 'DDS',
 'GME',
 'USA',
 'GME',
 'NOK',
 'GME',
 'BB',
 'AMC',
 'GME',
 'BB',
 'AMC',
 'NOK',
 'SLV',
 'PSLV',
 'CTRM',
 'VALE',
 'ZOM',
 'AGI']

In [6]:
true_values_tickers_list

['GME',
 'GME',
 'GME',
 'AMC',
 'GME',
 'GME',
 'BB',
 'AMC',
 'NOK',
 'GME',
 'GME',
 'AMC',
 'GME',
 'GME',
 'AG',
 'SLV',
 'GME',
 'GME',
 'JPM',
 'GME',
 'SLV',
 'AG',
 'AG',
 'SLV',
 'AG',
 'NOK',
 'GME',
 'NOK',
 'PLTR',
 'BB',
 'GME',
 'NOK',
 'SU',
 'RIDE',
 'AMC',
 'TSLA',
 'TSLA',
 'DDS',
 'DDS',
 'GME',
 'GME',
 'DFV',
 'NOK',
 'GME',
 'BB',
 'AMC',
 'GME',
 'BB',
 'AMC',
 'NOK',
 'NAKD',
 'PSLV',
 'NKD',
 'CTRM',
 'VALE',
 'VALE',
 'ZOM',
 'WPG',
 'WPG',
 'AGI',
 'PVG']

In [None]:
#load company data
company_data = modelutils.get_company_information()

In [None]:
#Model 2: fuzz method evaluation
fuzz_model = FuzzModel()
title_to_ticker, ticker_to_title, combined_dict = fuzz_model.create_fuzz_dicts(company_data)
values_to_key, values = fuzz_model.fuzz_preprocess(combined_dict)
fuzz_result = fuzz_model.fuzz_optimum_threshold(reddit_tokens, values_to_key, values, mini_reddit_answers, thresholds = np.arange(70, 101, 5) )

In [None]:
fuzz_result

In [None]:
#Model 3: GloVe method evaluation
filepath = ('glove/glove.6B.50d.txt')
glove_model = GloveModel.load_glove_model(filepath)
ticker_to_vector, title_to_vector, title_lookup = GloveModel.create_vector_dicts(company_data, glove_model, vector_size=50)
combined_vec_dict = GloveModel.merge_vector_dicts(ticker_to_vector, title_to_vector, title_lookup)
glove_result = GloveModel.glove_optimum_threshold(reddit_tokens, mini_reddit_answers, combined_dict)


In [None]:
test_data = load_text_files('test_data.txt')
test_title_answers = load_text_files('test_company_answers.txt')
test_ticker_answers = load_text_files('ticker_answers.txt')

In [None]:
try:
    list_backup = list 
except TypeError:
    # If the above line throws a TypeError, it means list has been overridden
    del list  # Delete the overridden variable
    list = list_backup  # Restore the built-in list function

In [None]:
reddit_data = load_text_files('reddit_test_data.txt')
reddit_data_answers = load_text_files('reddit_data_answers.txt')
reddit_tokens = preprocess_text(reddit_data)

In [None]:
#prep fuzzy dictionaries
reddit_key_values, reddit_values = fuzz_preprocess(combined_dict)

In [None]:
mini_reddit_data = 'AGI and PVG, efficient companies with lots of GME interest As you may AMC, precious metals AMC are very manipulated by GME and AGI and tesla familiar with the matter know it:'
mini_reddit_answers = ['agi', 'pvg', 'gme', 'amc', 'amc', 'gme', 'agi', 'tsla']
m_red_tokens = preprocess_text(mini_reddit_data, lower_case = False)

In [None]:
regex_method = extract_tickers(m_red_tokens, blacklist)
reg_precision, reg_sensitivity = evaluate_model_performance(mini_reddit_answers, regex_method)
reg_precision, reg_sensitivity

In [None]:
m_red_tokens = ['agi', 'pvg', 'efficient','companies', 'lots','gme', 'interest', 'may', 'amc', 'precious', 'metals', 'amc', 'manipulated', 'gme', 'agi', 'tesla', 'familiar', 'matter','know']

In [None]:
# Create a line chart
color_discrete_sequence = px.colors.qualitative.Plotly
fig = px.scatter(tickers, x='date', y='count', color='ticker',
              title='Ticker Mention Frequency Over Time', color_discrete_sequence=color_discrete_sequence)

fig.show()

In [None]:
import subprocess

# Define the command to run your Streamlit app
command = ["streamlit", "run", "path_to_your_python_file/web_app.py"]

# Run the command
subprocess.run(command)

Select ticker & get info and trading volume

In [None]:
import json

# Define the relative path to the JSON file
file_path = 'company_tickers.json'
with open(file_path, 'r') as file:
    data = json.load(file)
cleaned_data = [{'ticker': entry["ticker"], 'company': entry['title']} for entry in data.values()]

In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the necessary resources for tokenization


In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(16, 12))
ticker_counts.plot(kind='line', ax=ax)
ax.set_title('Ticker Mention Frequency by Date')
ax.set_xlabel('Date')
ax.set_ylabel('Frequency')
plt.xticks(rotation=45)
plt.legend(title='Ticker')
fig.subplots_adjust(bottom=0.2, top=0.9)
fig.tight_layout()
plt.show()

In [None]:
# fig, ax = plt.subplots(figsize=(10, 6))
# cumulative_mentions.plot(kind='line', ax=ax)
# ax.set_title('Ticker Mention Frequency by Date')
# ax.set_xlabel('Date')
# ax.set_ylabel('Frequency')
# plt.xticks(rotation=45)
# plt.legend(title='Ticker')
# plt.tight_layout()
# plt.show()

trying to figure out if we are missing daily discussion threads
* missing february 19, 2024 https://www.reddit.com/r/wallstreetbets/comments/1aukpdc/daily_discussion_thread_for_february_19_2024/
* why?

In [None]:


march_discussions = [thread for thread in list_a if 'February' in thread]
march_discussions
def extract_date(thread_title):
    match = re.search(r'Daily Discussion Thread for (\w+ \d{1,2}, \d{4})', thread_title)
    if match:
        return datetime.strptime(match.group(1), '%B %d, %Y')
    else:
        return None

sorted_threads = sorted(march_discussions, key=extract_date)


Making sure i stay in the request limits

3 options
1. Get stock tickers using by cleaning words individually 
   1. Simple, but miss everytime someone says the name of the company
   2. https://medium.com/@financial_python/how-to-get-trending-stock-tickers-from-reddit-using-praw-and-python-1fccc7f06748
2. Use NER: named entity recognition - lightweight learning library
3. To machine learning with NLP libraries - complex

Decision: do NER for now. Maybe upgrade to machine learning NLP libraries

Ticker extraction with Regex
* use re.compile if reusing the pattern multiple times in a program
* Issue is some text like 'A' and 'I' that are technically tickers, are not being implied as such in the text

NER Entity Extraction

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

org_list = []

for entity in doc.ents:
    if entity.label_ == 'ORG':
        org_list.append(entity.text)
    else:
        None

org_list = list(set(org_list))