# 00 Double Time Series LSTM Models
In this notebook, I will experiment with sequence data coming from both the 1-min timeframe and the 1-day timeframe. Specifically, I will use the tf Keras functional API to get the different sequences in, along with experimenting with embeddings for sectors and such.

Just to list out a few things I want to try out:
 - Embeddings for different sectors
   - Make sure all sectors are represented properly
 - PCA, but with the proper data
 - Early callback
 - Functional API for different time series

## Imports

In [7]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import heatmap
import dask
import dask.dataframe as dd
from dask import delayed
from pyarrow.parquet import ParquetFile
import pyarrow as pa
from tqdm import tqdm

import tulipy as ti

import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import IncrementalPCA

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras import initializers
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import register_keras_serializable
from tensorflow.keras.optimizers import SGD
import keras_tuner as kt
from keras_tuner import HyperParameters

import os
import sys
import warnings

# Data Loading and Initial Feature Engineering for LSTM Model
I will start by getting the day data and minute data in. Both of these sets of data were obtained from FRD, and both were cleaned in a separate notebook. The day data was created directly from the 1-min data, so the numbers should match perfectly.

Here are a few caveats about the data loading:
1. I might only choose certain tickers, perhaps starting with the top 1,000 with the highest prices (or by market cap). To do that, I'll need to obtain a list of tickers with their market caps, that also matches the data that I have.
2. I will need to obtain a list of stocks with their respective sectors. Specifically, I want to reduce the number of stocks in the "Other" category as much as possible, and I want each of them to be well-defined.
   - There is the notion of trying to use the correlation between stocks (like I did with the MIDS 205 course) as potentially another predictive feature, but that's currently beyond the scope of what I'm trying to do here.
   - I will, however, try the embeddings strategy and see if they can learn anything about each other.
   - One more thing is that I could also try learning embeddings **from the tickers themselves** rather than just their sectors. That might actually provide even more support for certain tickers moving together, rather than just the correlation like I was using previously.
     - *This might only be possible/feasible when I learn about transformers, so perhaps not yet*
3. For every day for my chosen tickers, I will only fetch data from 8:00am to 9:35am, primarily because I believe the 1.5 PM hours before the market opens will be the most helpful in defining what will happen to the stock. If I'm starting with trying to predict the Close price of a stock from its Open price, I will likely only use data up to and including 9:28am as my training data.
   - The reason I'm thinking of also fetching the first 5 minutes of the Open is because those initial trading minutes are the most volatile and often the most deterministic of how a stock might perform. While I won't be using those minutes if I'm trying to predict the Close, I might want to test opening-range breakout strategies later on. If so, I might try to do some shorter-term predictions, perhaps seeing if I ever hit a certain R-level within XX minutes of entering at the 5-minute mark (or 15-min, 30-min, etc.)

## Getting the tickers and their respective GICS sector, industry group, industry, and sub-industry
To actually get the cleaned file of the tickers and their respective GICS hierarchical classification, I had to use 2 different sources:
 - https://stockanalysis.com/stocks/
 - https://www.msci.com/documents/1296102/29559863/GICS_structure_and_definitions_effective_close_of_March_17_2023.xlsx/e47b8086-56fd-c9d2-196f-c2054b24b1d4?t=1670964718735

...and manually merge them together (since the names of the industries/sub-industries were very different). As of now, there are still some tickers listed as "Other" or "Shell Companies", neither of which is defined under the GICS 11 overarching sectors. However, I'm going to assume that that's ok, and hopefully the model will learn that either these aren't important, or perhaps they all trend in similar directions.

Both the Excel file used to manually merge the sources together, along with the csv file for just the tickers, the 4 GICS classifications, and their market cap (as of 4/16/25), are stored in the SSD.

In [13]:
# Getting the csv of the tickers, their GICS classes, and the market caps
tick_hier_mktcap_path = '/Volumes/T7/Tickers-and-Sector-Hierarchy/Tickers-GICS-4-Mkt-Cap.csv'
tick_hier_mktcap_df = pd.read_csv(tick_hier_mktcap_path)

tick_hier_mktcap_df.head()

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,A,"Agilent Technologies, Inc.",Diagnostics & Research,Health Care Equipment,Health Care Equipment & Supplies,Health Care Equipment & Services,Health Care,29.29B
1,AA,Alcoa Corporation,Aluminum,Aluminum,Metals & Mining,Materials,Materials,6.50B
2,AACB,Artius II Acquisition Inc.,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,269.85M
3,AACG,ATA Creativity Global,Education & Training Services,Education Services,Diversified Consumer Services,Consumer Services,Consumer Discretionary,30.25M
4,AACT,Ares Acquisition Corporation II,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,700.00M


In [15]:
# Converting market cap to numeric
def convert_to_number(value):
    if pd.isna(value):
        return None
    value = str(value).replace(',', '').strip()

    if value.endswith('T'):
        return float(value[:-1]) * 1e12
    elif value.endswith('B'):
        return float(value[:-1]) * 1e9
    elif value.endswith('M'):
        return float(value[:-1]) * 1e6
    elif value.endswith('K'):
        return float(value[:-1]) * 1e3
    else:
        try:
            return float(value)
        except ValueError:
            return None

# Applying it on the market cap
tick_hier_mktcap_df['Market_Cap'] = tick_hier_mktcap_df['Market_Cap'].apply(convert_to_number)

tick_hier_mktcap_df

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,A,"Agilent Technologies, Inc.",Diagnostics & Research,Health Care Equipment,Health Care Equipment & Supplies,Health Care Equipment & Services,Health Care,2.929000e+10
1,AA,Alcoa Corporation,Aluminum,Aluminum,Metals & Mining,Materials,Materials,6.500000e+09
2,AACB,Artius II Acquisition Inc.,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,2.698500e+08
3,AACG,ATA Creativity Global,Education & Training Services,Education Services,Diversified Consumer Services,Consumer Services,Consumer Discretionary,3.025000e+07
4,AACT,Ares Acquisition Corporation II,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,7.000000e+08
...,...,...,...,...,...,...,...,...
5476,ZVSA,"ZyVersa Therapeutics, Inc.",Biotechnology,Biotechnology,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",Health Care,1.850000e+06
5477,ZWS,Zurn Elkay Water Solutions Corporation,Pollution & Treatment Controls,Diversified Support Services,Commercial Services & Supplies,Commercial & Professional Services,Industrials,5.000000e+09
5478,ZYBT,Zhengye Biotechnology Holding Limited,Drug Manufacturers - Specialty & Generic,Drug Retail,Consumer Staples Distribution & Retail,Consumer Staples Distribution & Retail,Consumer Staples,3.597000e+08
5479,ZYME,Zymeworks Inc.,Biotechnology,Biotechnology,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",Health Care,7.837800e+08


We can see from the above that we have a clean list of tickers, along with the name of their respective company, their original sub-industry, their manually corrected sub-industry, their industry, their industry group, their sector, and their market cap. We'll use the market cap to choose our tickers in the next step, but one thing we need to consider is which of the 4 GICS hierarchy levels to use. It may be wise to only use the Sector, but it also could be valuable to focus on some of the more granular classifications.

Personally, since I want to use embeddings anyway, I may choose to use the finest-grain classification (sub-industry), but let's see how many unique counts of each hierarchy level we have:

In [17]:
# Getting the number of unique classes for each GICS hierarchy level
print("Number of unique classes for the GICS hierarchy level of:")
print(" - Sector:", tick_hier_mktcap_df['Sector'].nunique())
print(" - Industry Group:", tick_hier_mktcap_df['Industry_Group'].nunique())
print(" - Industry:", tick_hier_mktcap_df['Industry'].nunique())
print(" - Sub-Industry:", tick_hier_mktcap_df['Sub_Industry'].nunique())

Number of unique classes for the GICS hierarchy level of:
 - Sector: 13
 - Industry Group: 27
 - Industry: 69
 - Sub-Industry: 126


The number of classes seems to *roughly* double between each hierarchy, which is good to know as we move onto selecting the tickers.

## Selecting top 1000 tickers based on market cap (that also exist in my 1-min data)
Now that we have our industries for a general set of tickers, let's make sure that the tickers in our FRD dataset are represented within this group. If there are missing tickers in the data above, then we'll likely just remove it. (The other idea would be to simply label the hierarchy levels as "Other", but we'll still be missing the market cap, which will ultimately decide what tickers we'll keep)

In [21]:
# Getting the paths to the SSD to the 1-min data and the day data
parquet_1min_path = '/Volumes/T7/Filtered-Cleaned-Parquet-ML'
parquet_day_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day'

# Cycling through the 1day parquet files to get the tickers
cleaned_tickers = []
for ticker_1day_parq in os.listdir(parquet_day_path):
    ticker_name = ticker_1day_parq.split('_')[0]
    cleaned_tickers.append(ticker_name)

len(cleaned_tickers)

2599

In [23]:
# Checking to see which tickers are missing from our list
hier_mktcap_tickers = list(tick_hier_mktcap_df['Ticker'])
missing_hier_mktcap_tickers = []

print("Tickers that are not in the dataset matching each stock to their GICS hierarchy levels and market cap:")
for cleaned_ticker in cleaned_tickers:
    if cleaned_ticker not in hier_mktcap_tickers:
        print(" -", cleaned_ticker)
        missing_hier_mktcap_tickers.append(cleaned_ticker)
print("Total number of missing tickers:", len(missing_hier_mktcap_tickers))

Tickers that are not in the dataset matching each stock to their GICS hierarchy levels and market cap:
 - ACCD
 - AWF
 - ALTR
 - ADX
 - AADI
 - CLM
 - CFB
 - CDXC
 - CDMO
 - B
 - BXMX
 - BTZ
 - AZPN
 - BTT
 - BMEZ
 - FBMS
 - CAF
 - EDD
 - AWP
 - ATSG
 - BSTZ
 - DM
 - FFIE
 - FFC
 - INFN
 - IIM
 - EXG
 - IGD
 - EVV
 - IGR
 - IFN
 - GHY
 - GOF
 - GLO
 - GGN
 - FPF
 - BGY
 - MPLN
 - ETY
 - ETV
 - PTVE
 - ISD
 - ETG
 - LAAC
 - KYN
 - ITCI
 - PHK
 - NVRO
 - NVG
 - NRK
 - NZF
 - NMZ
 - NAC
 - NARI
 - NAD
 - PTY
 - SUM
 - SQ
 - RNP
 - BCOV
 - HTLF
 - SASR
 - PML
 - SABA
 - RVNC
 - WIMI
 - RVT
 - RQI
 - ROIC
 - QRTEA
 - RA
 - ZUO
 - UTG
 - ENLC
 - YY
 - VOXX
 - TPX
 - BBN
 - NKLA
 - SMAR
 - PDI
 - EMD
 - EIM
 - BALY
 - NFJ
 - NEP
 - SILV
 - NEA
 - EFR
Total number of missing tickers: 89


We see that there are 89 tickers that are missing in our hierarchy/mktcap list, but quickly going through them reveals that most of these companies are quite small and not as well-known. Therefore, it's likely safe to not include those stocks, and we'll move forward by getting only the top 1,000 tickers in terms of market cap.

In [25]:
# Getting a new dataframe of only the top 1,000 tickers sorted by market cap
filtered_tickers = tick_hier_mktcap_df[tick_hier_mktcap_df['Ticker'].isin(cleaned_tickers)].sort_values(by='Market_Cap', ascending=False)
filtered_tickers = filtered_tickers[:1000].reset_index(drop=True)

filtered_tickers

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,AAPL,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2.918340e+12
1,MSFT,Microsoft Corporation,Software - Infrastructure,Internet Services & Infrastructure,IT Services,Software & Services,Information Technology,2.762540e+12
2,NVDA,NVIDIA Corporation,Semiconductors,Semiconductors,Semiconductors & Semiconductor Equipment,Semiconductors & Semiconductor Equipment,Information Technology,2.549560e+12
3,GOOG,Alphabet Inc.,Internet Content & Information,Internet & Direct Marketing Retail,Internet & Direct Marketing Retail,Consumer Discretionary Distribution & Retail,Consumer Discretionary,1.918450e+12
4,GOOGL,Alphabet Inc.,Internet Content & Information,Internet & Direct Marketing Retail,Internet & Direct Marketing Retail,Consumer Discretionary Distribution & Retail,Consumer Discretionary,1.869210e+12
...,...,...,...,...,...,...,...,...
995,LNC,Lincoln National Corporation,Insurance - Life,Life & Health Insurance,Insurance,Insurance,Financials,5.130000e+09
996,QFIN,"Qifu Technology, Inc.",Credit Services,Specialized Finance,Financial Services,Financial Services,Financials,5.130000e+09
997,CADE,Cadence Bank,Banks - Regional,Regional Banks,Banks,Banks,Financials,5.120000e+09
998,GGB,Gerdau S.A.,Steel,Steel,Metals & Mining,Materials,Materials,5.110000e+09


## Getting the minute and day data from the chosen tickers
**Note**: I will start by just getting the day data to 1) test my new Macbook Pro, and 2) make sure I can combine the parquet files properly and get it to work.

In [74]:
# Getting the column names from the files (using AAPL because AAPL will always be in the data) and initializing df
AAPL_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/AAPL_1day.parquet'
column_names_day = list(pd.read_parquet(AAPL_path).columns)
day_df = pd.DataFrame(columns=column_names_day)

# Loading the day data of the 1000 tickers in filtered_tickers into one df
for ticker in tqdm(filtered_tickers['Ticker']):

    # Reading in df
    ticker_file_name = ticker + '_1day.parquet'
    ticker_path = os.path.join(parquet_day_path, ticker_file_name)
    ticker_df = pd.read_parquet(ticker_path)

    # Setting day_df if empty
    if day_df.empty:
        day_df = ticker_df.copy()
    else:
        day_df = pd.concat([day_df, ticker_df])
    
len(day_df)

100%|██████████████████████████████████████| 1000/1000 [00:08<00:00, 121.81it/s]


1055000

In [42]:
AAPL_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/AAPL_1day.parquet'
MSFT_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/MSFT_1day.parquet'
AAPL_df = pd.read_parquet(AAPL_path)
MSFT_df = pd.read_parquet(MSFT_path)

In [70]:
if day_df.empty:
    print('yup')

yup


In [46]:
pd.concat([AAPL_df, MSFT_df])[1050:1060]

Unnamed: 0_level_0,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-12-04,242.87,244.11,241.25,243.02,20217543.0,243.69,242.044,466815.0,AAPL
2024-12-05,243.99,244.54,242.13,243.08,18743262.0,244.2767,242.1,206871.0,AAPL
2024-12-06,242.905,244.63,242.08,242.86,19207219.0,243.23,242.41,169685.0,AAPL
2024-12-09,241.9,247.24,241.75,246.82,24141189.0,243.0,241.57,193561.0,AAPL
2024-12-10,246.91,248.21,245.34,247.8,19346976.0,248.0,246.134,286748.0,AAPL
2020-10-01,213.26,213.99,211.32,212.46,16694602.0,213.65,209.81,234387.0,MSFT
2020-10-02,208.0,210.99,205.54,206.18,22562045.0,209.49,206.69,457396.0,MSFT
2020-10-05,207.21,210.41,206.98,210.37,13910585.0,208.44,205.68,158613.0,MSFT
2020-10-06,208.83,210.18,204.82,205.91,20132019.0,209.73,208.5,164718.0,MSFT
2020-10-07,207.06,210.11,206.72,209.81,16113582.0,208.1,206.68,227525.0,MSFT
