# Sentiment Adjusted Stock Prediction
By:

Date: /2025

With the virtually infinite amount of information floating around on social media about current world affairs, it would be nice to be able to automatically parse the information and get the data you need related to which investments you should buy. We can solve this problem by building a program to automatically scrape the data and derive key details from it using AI.

We believe that we can get a better picture of what stocks will be if we can combine 2 things, well optimized stock data and a twitter api. The first part of this report will focus on the:

# Stock Data Collection

Large detailed stock datasets were hard to come by, so we had to come up with a strategy to build a large enough dataset within a few acceptance criteria:

- A large array of stocks, given in this project as "stock_tickers.csv" or S
- Stocks that were going to stay relatively stable, for this I selected a list of the companies with the largest market cap descending, and collected as large a data set as possible $(\max(size(S)))$
- Stocks that were freely available with the data from Alpha Vantage (the API used in this project) set of Alpha Vantage given $A$, condition $S \in A$
- Stocks will be picked from the top of the Nasdaq top stock list $S_i = N_i, i = {1,2,3,\dots,n}$
- Number of pulls the API will allow me to do from Wall Street close on Friday to open on Monday (don't know how to express this one as the API throttling seems to happen at random)

Provided mathematically as:

$$\max(size(S))$$

$$s.t.$$

$$S \in A$$

$$S_i = N_i, i = \{ 1,2,3,\dots,n \}$$

n = the number of entries I am able to pull during the weekend the NYSE is closed



## $$S_i = N_i, i = \{ 1,2,3,\dots,n \}$$
This was a very simple constraint to follow because it just required me to search the NASDAQ for the highest market cap companies and download the 6000 entry long csv of stock valuation and tickers, that I can isolate the tickers on ([nasdq.com/highest market cap](https://www.nasdaq.com/market-activity/index/spx/historical?page=1&rows_per_page=10&timeline=m1)). This data could not be directly used as the stock market is in a state of flux which necessitates a higher amount of granularity see constraint below.

## $$S \in A$$
This project relies upon the stock data that can be freely procured from the Alpha Vantage API ([alphavantage.co/documentation](https://www.alphavantage.co/documentation/)), this constraint wa probed by using the above constraint set, using error handling for checking if it wasn't in the set (after an 8 hour timeout, that was found to be enough for the API timeout to reset) :

In [None]:
max_retries = 5

...

if not time_series:
            print(f"No time series data found for {symbol}.") #bad data catch
            return

## $$\max(size(S))$$
This function is more served as me running a few scripts periodically throughout the weekend to maximize the number of stocks that can be grabbed from the API. It is run wit this .bat file:

In [None]:
@echo off

echo Starting up the app...

python -m compileall -q .

python main.py

pause

But the pertinent code for this report and this class is this:

In [None]:
for attempt in range(max_retries):
            for i, key in enumerate(keys):
                url = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={symbol}&interval=30min&apikey={key}' ## url to alphavantage
                response = requests.get(url)
                data = response.json()

                if "Meta Data" in data and "Time Series (30min)" in data:
                    print(f"[Key {i + 1}] Success for {symbol}.")
                    break
                else:
                    print(f"[Key {i + 1}] failed or rate-limited. Trying next key...") ## catch if they are limiting my rates (although I am starting to think it is by static ip)

            else:
                print(f"[Attempt {attempt + 1}] All keys failed. Waiting {retry_delay / 60} minutes...")
                if attempt == 3: # ratchet catch to get all the data I could glean so far, will overwrite if I get more data
                    downloader = stockCSVDownloader()
                    downloader.move_to_downloads("stocks.csv") #actual downloader

                    printer = lastStockPrinter()
                    printer.move_last_stock_to_downloads(symbol, str(self.count))
                    continue
                elif attempt == 4:
                    time.sleep(retry_delay * 8) #long sleeper so I can afk this

                time.sleep(retry_delay)
                continue
            break
        else:
            print(f"Failed to fetch data for {symbol} after {max_retries} retries.")
            return

This peice of code queries the API and adds the stock data to the csv that will be staged and downloaded after 3 failed attempts (my "soft catch" that hopefully catches when the API starts to throttle me.) after it throttles me, it prints out the last stock to be queried, I turn on a vpn, and continue the process from the last stock ticker it printed out.

All stocks_# are then combined and fed into a variance calculator made in Julia, given below: 

In [None]:
using CSV
using DataFrames
using Statistics

df = CSV.read("top_stocks_data.csv", DataFrame)
grouped = groupby(df, :ticker)

results = DataFrame(
    ticker = String[],
    open_variance = Float64[],
    high_variance = Float64[],
    low_variance = Float64[],
    close_variance = Float64[],
    volume_variance = Float64[],
)

for g in grouped
    push!(results, (
        ticker = first(g.ticker),
        open_variance = var(g.open),
        high_variance = var(g.high),
        low_variance = var(g.low),
        close_variance = var(g.close),
        volume_variance = var(g.volume),
    ))
end

CSV.write("stock_variance.csv", results)

println("Variance data was written to stock_variance.csv.")