# 01 Day Time Series LSTM Models
In this notebook, I will experiment with sequence data coming from just the 1-day timeframe. Specifically, I will experiment with the tf Keras functional API, along with testing the embeddings for sectors and such.

Just to list out a few things I want to try out here:
 - Embeddings for different sectors
   - Make sure all sectors are represented properly
 - PCA, but with the proper data
 - Early callback

## Imports

In [2]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import heatmap
import dask
import dask.dataframe as dd
from dask import delayed
from pyarrow.parquet import ParquetFile
import pyarrow as pa
from tqdm import tqdm

import tulipy as ti

import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import IncrementalPCA

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras import initializers
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import register_keras_serializable
from tensorflow.keras.optimizers import SGD
import keras_tuner as kt
from keras_tuner import HyperParameters

import os
import sys
import warnings

# Data Loading and Initial Feature Engineering for LSTM Model
I will start by getting the day data and minute data in. Both of these sets of data were obtained from FRD, and both were cleaned in a separate notebook. The day data was created directly from the 1-min data, so the numbers should match perfectly.

Here are a few caveats about the data loading:
1. I might only choose certain tickers, perhaps starting with the top 1,000 with the highest prices (or by market cap). To do that, I'll need to obtain a list of tickers with their market caps, that also matches the data that I have.
2. I will need to obtain a list of stocks with their respective sectors. Specifically, I want to reduce the number of stocks in the "Other" category as much as possible, and I want each of them to be well-defined.
   - There is the notion of trying to use the correlation between stocks (like I did with the MIDS 205 course) as potentially another predictive feature, but that's currently beyond the scope of what I'm trying to do here.
   - I will, however, try the embeddings strategy and see if they can learn anything about each other.
   - One more thing is that I could also try learning embeddings **from the tickers themselves** rather than just their sectors. That might actually provide even more support for certain tickers moving together, rather than just the correlation like I was using previously.
     - *This might only be possible/feasible when I learn about transformers, so perhaps not yet*
3. For every day for my chosen tickers, I will only fetch data from 8:00am to 9:35am, primarily because I believe the 1.5 PM hours before the market opens will be the most helpful in defining what will happen to the stock. If I'm starting with trying to predict the Close price of a stock from its Open price, I will likely only use data up to and including 9:28am as my training data.
   - The reason I'm thinking of also fetching the first 5 minutes of the Open is because those initial trading minutes are the most volatile and often the most deterministic of how a stock might perform. While I won't be using those minutes if I'm trying to predict the Close, I might want to test opening-range breakout strategies later on. If so, I might try to do some shorter-term predictions, perhaps seeing if I ever hit a certain R-level within XX minutes of entering at the 5-minute mark (or 15-min, 30-min, etc.)

## Getting the tickers and their respective GICS sector, industry group, industry, and sub-industry
To actually get the cleaned file of the tickers and their respective GICS hierarchical classification, I had to use 2 different sources:
 - https://stockanalysis.com/stocks/
 - https://www.msci.com/documents/1296102/29559863/GICS_structure_and_definitions_effective_close_of_March_17_2023.xlsx/e47b8086-56fd-c9d2-196f-c2054b24b1d4?t=1670964718735

...and manually merge them together (since the names of the industries/sub-industries were very different). As of now, there are still some tickers listed as "Other" or "Shell Companies", neither of which is defined under the GICS 11 overarching sectors. However, I'm going to assume that that's ok, and hopefully the model will learn that either these aren't important, or perhaps they all trend in similar directions.

Both the Excel file used to manually merge the sources together, along with the csv file for just the tickers, the 4 GICS classifications, and their market cap (as of 4/16/25), are stored in the SSD.

In [3]:
# Getting the csv of the tickers, their GICS classes, and the market caps
tick_hier_mktcap_path = '/Volumes/T7/Tickers-and-Sector-Hierarchy/Tickers-GICS-4-Mkt-Cap.csv'
tick_hier_mktcap_df = pd.read_csv(tick_hier_mktcap_path)

tick_hier_mktcap_df.head()

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,A,"Agilent Technologies, Inc.",Diagnostics & Research,Health Care Equipment,Health Care Equipment & Supplies,Health Care Equipment & Services,Health Care,29.29B
1,AA,Alcoa Corporation,Aluminum,Aluminum,Metals & Mining,Materials,Materials,6.50B
2,AACB,Artius II Acquisition Inc.,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,269.85M
3,AACG,ATA Creativity Global,Education & Training Services,Education Services,Diversified Consumer Services,Consumer Services,Consumer Discretionary,30.25M
4,AACT,Ares Acquisition Corporation II,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,700.00M


In [4]:
# Converting market cap to numeric
def convert_to_number(value):
    if pd.isna(value):
        return None
    value = str(value).replace(',', '').strip()

    if value.endswith('T'):
        return float(value[:-1]) * 1e12
    elif value.endswith('B'):
        return float(value[:-1]) * 1e9
    elif value.endswith('M'):
        return float(value[:-1]) * 1e6
    elif value.endswith('K'):
        return float(value[:-1]) * 1e3
    else:
        try:
            return float(value)
        except ValueError:
            return None

# Applying it on the market cap
tick_hier_mktcap_df['Market_Cap'] = tick_hier_mktcap_df['Market_Cap'].apply(convert_to_number)

tick_hier_mktcap_df

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,A,"Agilent Technologies, Inc.",Diagnostics & Research,Health Care Equipment,Health Care Equipment & Supplies,Health Care Equipment & Services,Health Care,2.929000e+10
1,AA,Alcoa Corporation,Aluminum,Aluminum,Metals & Mining,Materials,Materials,6.500000e+09
2,AACB,Artius II Acquisition Inc.,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,2.698500e+08
3,AACG,ATA Creativity Global,Education & Training Services,Education Services,Diversified Consumer Services,Consumer Services,Consumer Discretionary,3.025000e+07
4,AACT,Ares Acquisition Corporation II,Shell Companies,Shell Companies,Shell Companies,Shell Companies,Shell Companies,7.000000e+08
...,...,...,...,...,...,...,...,...
5476,ZVSA,"ZyVersa Therapeutics, Inc.",Biotechnology,Biotechnology,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",Health Care,1.850000e+06
5477,ZWS,Zurn Elkay Water Solutions Corporation,Pollution & Treatment Controls,Diversified Support Services,Commercial Services & Supplies,Commercial & Professional Services,Industrials,5.000000e+09
5478,ZYBT,Zhengye Biotechnology Holding Limited,Drug Manufacturers - Specialty & Generic,Drug Retail,Consumer Staples Distribution & Retail,Consumer Staples Distribution & Retail,Consumer Staples,3.597000e+08
5479,ZYME,Zymeworks Inc.,Biotechnology,Biotechnology,Biotechnology,"Pharmaceuticals, Biotechnology & Life Sciences",Health Care,7.837800e+08


We can see from the above that we have a clean list of tickers, along with the name of their respective company, their original sub-industry, their manually corrected sub-industry, their industry, their industry group, their sector, and their market cap. We'll use the market cap to choose our tickers in the next step, but one thing we need to consider is which of the 4 GICS hierarchy levels to use. It may be wise to only use the Sector, but it also could be valuable to focus on some of the more granular classifications.

Personally, since I want to use embeddings anyway, I may choose to use the finest-grain classification (sub-industry), but let's see how many unique counts of each hierarchy level we have:

In [5]:
# Getting the number of unique classes for each GICS hierarchy level
print("Number of unique classes for the GICS hierarchy level of:")
print(" - Sector:", tick_hier_mktcap_df['Sector'].nunique())
print(" - Industry Group:", tick_hier_mktcap_df['Industry_Group'].nunique())
print(" - Industry:", tick_hier_mktcap_df['Industry'].nunique())
print(" - Sub-Industry:", tick_hier_mktcap_df['Sub_Industry'].nunique())

Number of unique classes for the GICS hierarchy level of:
 - Sector: 13
 - Industry Group: 27
 - Industry: 69
 - Sub-Industry: 126


The number of classes seems to *roughly* double between each hierarchy, which is good to know as we move onto selecting the tickers.

## Selecting top 1000 tickers based on market cap (that also exist in my 1-min data)
Now that we have our industries for a general set of tickers, let's make sure that the tickers in our FRD dataset are represented within this group. If there are missing tickers in the data above, then we'll likely just remove it. (The other idea would be to simply label the hierarchy levels as "Other", but we'll still be missing the market cap, which will ultimately decide what tickers we'll keep)

In [6]:
# Getting the paths to the SSD to the 1-min data and the day data
parquet_1min_path = '/Volumes/T7/Filtered-Cleaned-Parquet-ML'
parquet_day_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day'

# Cycling through the 1day parquet files to get the tickers
cleaned_tickers = []
for ticker_1day_parq in os.listdir(parquet_day_path):
    ticker_name = ticker_1day_parq.split('_')[0]
    cleaned_tickers.append(ticker_name)

len(cleaned_tickers)

2599

In [7]:
# Checking to see which tickers are missing from our list
hier_mktcap_tickers = list(tick_hier_mktcap_df['Ticker'])
missing_hier_mktcap_tickers = []

print("Tickers that are not in the dataset matching each stock to their GICS hierarchy levels and market cap:")
for cleaned_ticker in cleaned_tickers:
    if cleaned_ticker not in hier_mktcap_tickers:
        print(" -", cleaned_ticker)
        missing_hier_mktcap_tickers.append(cleaned_ticker)
print("Total number of missing tickers:", len(missing_hier_mktcap_tickers))

Tickers that are not in the dataset matching each stock to their GICS hierarchy levels and market cap:
 - ACCD
 - AWF
 - ALTR
 - ADX
 - AADI
 - CLM
 - CFB
 - CDXC
 - CDMO
 - B
 - BXMX
 - BTZ
 - AZPN
 - BTT
 - BMEZ
 - FBMS
 - CAF
 - EDD
 - AWP
 - ATSG
 - BSTZ
 - DM
 - FFIE
 - FFC
 - INFN
 - IIM
 - EXG
 - IGD
 - EVV
 - IGR
 - IFN
 - GHY
 - GOF
 - GLO
 - GGN
 - FPF
 - BGY
 - MPLN
 - ETY
 - ETV
 - PTVE
 - ISD
 - ETG
 - LAAC
 - KYN
 - ITCI
 - PHK
 - NVRO
 - NVG
 - NRK
 - NZF
 - NMZ
 - NAC
 - NARI
 - NAD
 - PTY
 - SUM
 - SQ
 - RNP
 - BCOV
 - HTLF
 - SASR
 - PML
 - SABA
 - RVNC
 - WIMI
 - RVT
 - RQI
 - ROIC
 - QRTEA
 - RA
 - ZUO
 - UTG
 - ENLC
 - YY
 - VOXX
 - TPX
 - BBN
 - NKLA
 - SMAR
 - PDI
 - EMD
 - EIM
 - BALY
 - NFJ
 - NEP
 - SILV
 - NEA
 - EFR
Total number of missing tickers: 89


We see that there are 89 tickers that are missing in our hierarchy/mktcap list, but quickly going through them reveals that most of these companies are quite small and not as well-known. Therefore, it's likely safe to not include those stocks, and we'll move forward by getting only the top 1,000 tickers in terms of market cap.

In [8]:
# Getting a new dataframe of only the top 1,000 tickers sorted by market cap
filtered_tickers = tick_hier_mktcap_df[tick_hier_mktcap_df['Ticker'].isin(cleaned_tickers)].sort_values(by='Market_Cap', ascending=False)
filtered_tickers = filtered_tickers[:1000].reset_index(drop=True)

filtered_tickers

Unnamed: 0,Ticker,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
0,AAPL,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2.918340e+12
1,MSFT,Microsoft Corporation,Software - Infrastructure,Internet Services & Infrastructure,IT Services,Software & Services,Information Technology,2.762540e+12
2,NVDA,NVIDIA Corporation,Semiconductors,Semiconductors,Semiconductors & Semiconductor Equipment,Semiconductors & Semiconductor Equipment,Information Technology,2.549560e+12
3,GOOG,Alphabet Inc.,Internet Content & Information,Internet & Direct Marketing Retail,Internet & Direct Marketing Retail,Consumer Discretionary Distribution & Retail,Consumer Discretionary,1.918450e+12
4,GOOGL,Alphabet Inc.,Internet Content & Information,Internet & Direct Marketing Retail,Internet & Direct Marketing Retail,Consumer Discretionary Distribution & Retail,Consumer Discretionary,1.869210e+12
...,...,...,...,...,...,...,...,...
995,LNC,Lincoln National Corporation,Insurance - Life,Life & Health Insurance,Insurance,Insurance,Financials,5.130000e+09
996,QFIN,"Qifu Technology, Inc.",Credit Services,Specialized Finance,Financial Services,Financial Services,Financials,5.130000e+09
997,CADE,Cadence Bank,Banks - Regional,Regional Banks,Banks,Banks,Financials,5.120000e+09
998,GGB,Gerdau S.A.,Steel,Steel,Metals & Mining,Materials,Materials,5.110000e+09


## Testing the day data upload speed between Macs (DO NOT EDIT)
**Note 1**: I'm simply testing how quickly I can retrieve the day data using my new Macbook Pro compared to my old Macbook Pro.

**Note 2**: I did these tests BEFORE I realized that I needed to add in the PM data up till 9:27am and 9:28am, and loading in the new data will obviously take longer in general.

### Using the old laptop to load the data (DO NOT EDIT)

In [57]:
# USING THE OLD LAPTOP AND TESTING THE SPEED (Python 3 (ipykernel))
# Getting the column names from the files (using AAPL because AAPL will always be in the data) and initializing df
AAPL_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/AAPL_1day.parquet'
column_names_day = list(pd.read_parquet(AAPL_path).columns)
day_df = pd.DataFrame(columns=column_names_day)

# Loading the day data of the 1000 tickers in filtered_tickers into one df
for ticker in tqdm(filtered_tickers['Ticker']):

    # Reading in df
    ticker_file_name = ticker + '_1day.parquet'
    ticker_path = os.path.join(parquet_day_path, ticker_file_name)
    ticker_df = pd.read_parquet(ticker_path)

    # Setting day_df if empty
    if day_df.empty:
        day_df = ticker_df.copy()
    else:
        day_df = pd.concat([day_df, ticker_df])
    
len(day_df)

100%|███████████████████████████████████████| 1000/1000 [00:12<00:00, 78.73it/s]


1055000

It took ~12 seconds overall according to tqdm.

### Using the new laptop to load the data (DO NOT EDIT)

In [81]:
# Using the NEW laptop and testing the speed (Python [conda env:base]
# Getting the column names from the files (using AAPL because AAPL will always be in the data) and initializing df
AAPL_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/AAPL_1day.parquet'
column_names_day = list(pd.read_parquet(AAPL_path).columns)
day_df = pd.DataFrame(columns=column_names_day)

# Loading the day data of the 1000 tickers in filtered_tickers into one df
for ticker in tqdm(filtered_tickers['Ticker']):

    # Reading in df
    ticker_file_name = ticker + '_1day.parquet'
    ticker_path = os.path.join(parquet_day_path, ticker_file_name)
    ticker_df = pd.read_parquet(ticker_path)

    # Setting day_df if empty
    if day_df.empty:
        day_df = ticker_df.copy()
    else:
        day_df = pd.concat([day_df, ticker_df])
    
len(day_df)

100%|██████████████████████████████████████| 1000/1000 [00:08<00:00, 118.59it/s]


1055000

It took ~8 seconds overall according to tqdm.

Overall, comparing the old laptop's data loading speed with the new laptop's speed, the new Macbook's loading speed is about 50% faster. This is a pretty small test, so it's really not saying much, but it at least gives me confidence that this new laptop is - at the very least - just as good as my old one.

## Getting the minute and day data from the chosen tickers
**Note**: I will start by just getting the day data to make sure I can combine the parquet files properly and get it to work. Also, this is after I updated the Filtered-Cleaned-Parquet-1day files to include PM metrics up till 9:27am and 9:28am.

In [9]:
# Getting the column names from the files (using AAPL because AAPL will always be in the data) and initializing df
AAPL_path = '/Volumes/T7/Filtered-Cleaned-Parquet-1day/AAPL_1day.parquet'
column_names_day = list(pd.read_parquet(AAPL_path).columns)
day_df = pd.DataFrame(columns=column_names_day)

# Loading the day data of the 1000 tickers in filtered_tickers into one df
for ticker in tqdm(filtered_tickers['Ticker']):

    # Reading in df
    ticker_file_name = ticker + '_1day.parquet'
    ticker_path = os.path.join(parquet_day_path, ticker_file_name)
    ticker_df = pd.read_parquet(ticker_path)

    # Setting day_df if empty
    if day_df.empty:
        day_df = ticker_df.copy()
    else:
        day_df = pd.concat([day_df, ticker_df])
    
len(day_df)

100%|███████████████████████████████████████| 1000/1000 [00:12<00:00, 79.70it/s]


1055000

Overall, it took about 12 seconds to load (though I'm now using Python 3 (ipykernel) for some reason again), which is longer than before, but not too bad.

## Feature Engineering
Now that I have the data loaded in, I can derive the features and outcome variables.

In [227]:
# Getting the rank (day number) of each ticker
day_df['Rank'] = day_df.groupby('Ticker').cumcount() + 1

# Creating a new data_df by joining the filtered_tickers df to the day_df
data_df = pd.merge(day_df.reset_index(), filtered_tickers, how='left', on='Ticker').set_index(day_df.index)

# Renaming the actual index and the "index" column both to "Datetime"
data_df = data_df.rename(columns={'index': 'Datetime'})

data_df.head()

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker,PM_High_0927,PM_Low_0927,PM_Close_0927,PM_Volume_0927,PM_High_0928,PM_Low_0928,PM_Close_0928,PM_Volume_0928,Rank,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap
2020-10-01,2020-10-01,117.7,117.72,115.83,116.8,92076075.0,118.1,116.51,1755406.0,AAPL,118.1,116.51,117.45,1690772.0,118.1,116.51,117.68,1726947.0,1,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0
2020-10-02,2020-10-02,112.84,115.37,112.22,113.02,116084401.0,117.08,112.56,4460967.0,AAPL,117.08,112.7,113.0,4319384.0,117.08,112.56,112.75,4389769.0,2,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0
2020-10-05,2020-10-05,113.92,116.65,113.55,116.54,84421170.0,114.71,113.3,1690948.0,AAPL,114.71,113.3,113.97,1671728.0,114.71,113.3,113.92,1685852.0,3,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0
2020-10-06,2020-10-06,115.68,116.12,112.25,113.16,132805466.0,116.79,115.23,1358666.0,AAPL,116.79,115.23,116.01,1303941.0,116.79,115.23,115.85,1331333.0,4,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0
2020-10-07,2020-10-07,114.64,115.55,114.13,115.05,78303638.0,114.75,113.134,2071097.0,AAPL,114.75,113.134,114.7,2032261.0,114.75,113.134,114.6,2042679.0,5,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0


Obtaining the outcome variables (though will likely only use Close_Open_Diff_Ternary at the end):

In [228]:
# Getting difference between close price and open price (outcome variables)
data_df["Close_Open_Diff"] = data_df["Close"] - data_df["Open"]
data_df["Close_Open_Diff_Perc"] = (data_df["Close"] - data_df["Open"]) / data_df["Open"] * 100

# Getting binary classification for whether a stock ended positive (1) or negative/zero (0)
data_df["Close_Open_Diff_Binary"] = data_df["Close_Open_Diff"] > 0

# Getting ternary classification for whether a stock ended positive (2), negative (0), or not much change (1)
close_open_conditions = [
    data_df['Close_Open_Diff_Perc'] > 0.5,       # Close price was at least 0.5% higher than Open price
    (data_df['Close_Open_Diff_Perc'] >= -0.5) &  # Close price was between -0.5% and 0.5% of Open price
        (data_df['Close_Open_Diff_Perc'] <= 0.5),
    data_df['Close_Open_Diff_Perc'] < -0.5       # Close price was at least 0.5% lower than Open price
]
close_open_choices = [2, 1, 0]  # Defining choices corresponding to conditions
data_df['Close_Open_Diff_Ternary'] = np.select(close_open_conditions, close_open_choices)  # Applying conditions

data_df.head()

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker,PM_High_0927,PM_Low_0927,PM_Close_0927,PM_Volume_0927,PM_High_0928,PM_Low_0928,PM_Close_0928,PM_Volume_0928,Rank,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap,Close_Open_Diff,Close_Open_Diff_Perc,Close_Open_Diff_Binary,Close_Open_Diff_Ternary
2020-10-01,2020-10-01,117.7,117.72,115.83,116.8,92076075.0,118.1,116.51,1755406.0,AAPL,118.1,116.51,117.45,1690772.0,118.1,116.51,117.68,1726947.0,1,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-0.9,-0.764656,False,0
2020-10-02,2020-10-02,112.84,115.37,112.22,113.02,116084401.0,117.08,112.56,4460967.0,AAPL,117.08,112.7,113.0,4319384.0,117.08,112.56,112.75,4389769.0,2,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,0.18,0.159518,True,1
2020-10-05,2020-10-05,113.92,116.65,113.55,116.54,84421170.0,114.71,113.3,1690948.0,AAPL,114.71,113.3,113.97,1671728.0,114.71,113.3,113.92,1685852.0,3,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,2.62,2.29986,True,2
2020-10-06,2020-10-06,115.68,116.12,112.25,113.16,132805466.0,116.79,115.23,1358666.0,AAPL,116.79,115.23,116.01,1303941.0,116.79,115.23,115.85,1331333.0,4,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-2.52,-2.178423,False,0
2020-10-07,2020-10-07,114.64,115.55,114.13,115.05,78303638.0,114.75,113.134,2071097.0,AAPL,114.75,113.134,114.7,2032261.0,114.75,113.134,114.6,2042679.0,5,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,0.41,0.357641,True,1


Getting the previous day OHLCV metrics along with the PM HLCV metrics up till 9:27am and 9:28am:

In [229]:
# Getting previous day OHLCV metrics
data_df["PDO"] = data_df["Open"].shift(1)
data_df["PDH"] = data_df["High"].shift(1)
data_df["PDL"] = data_df["Low"].shift(1)
data_df["PDC"] = data_df["Close"].shift(1)
data_df["PDV"] = data_df["Volume"].shift(1)

# Getting previous day PM HLCV metrics up till 9:27am and 9:28am
data_df["PD_PM_H_0927"] = data_df["PM_High_0927"].shift(1)
data_df["PD_PM_L_0927"] = data_df["PM_Low_0927"].shift(1)
data_df["PD_PM_C_0927"] = data_df["PM_Close_0927"].shift(1)
data_df["PD_PM_V_0927"] = data_df["PM_Volume_0927"].shift(1)

data_df["PD_PM_H_0928"] = data_df["PM_High_0928"].shift(1)
data_df["PD_PM_L_0928"] = data_df["PM_Low_0928"].shift(1)
data_df["PD_PM_C_0928"] = data_df["PM_Close_0928"].shift(1)
data_df["PD_PM_V_0928"] = data_df["PM_Volume_0928"].shift(1)

In my original project, I obtained the gap features using the **Open** price. However, to make this more realistic, I will use the Close price at 9:28am as a substitute for the Open. The reasons for doing so are as follow:
1. The Close price at 9:28am is only 1 minute away from the Open price, and even though a lot can happen in that last minute, this will need to be good enough.
2. One minute should hopefully be enough time for me to get the updated minute data from (probably) IBKR, run inference using my ML model, and prepare the trades accordingly at the Open.

In [230]:
# Getting the gap features from 9:28 Close compared to the PDC
data_df["Gap_from_PDC"] = (data_df["PM_Close_0928"] - data_df["PDC"]) / data_df["PDC"]
gap_from_PDC_threshold = 0.02  # 2%
data_df["Gap_Binary_Threshold"] = abs(data_df["Gap_from_PDC"]) > gap_from_PDC_threshold

# Getting daily range gap up features
data_df["Daily_Range_Gap_Up"] = (data_df["PM_Close_0928"] - data_df["PDH"]) / data_df["PDH"]
data_df["Daily_Range_Gap_Up"] = data_df["Daily_Range_Gap_Up"].clip(lower = 0)
data_df["DR_Gap_Up_Binary"] = data_df["Daily_Range_Gap_Up"] > 0

# Getting daily range gap down features
data_df["Daily_Range_Gap_Down"] = (data_df["PM_Close_0928"] - data_df["PDL"]) / data_df["PDL"]
data_df["Daily_Range_Gap_Down"] = data_df["Daily_Range_Gap_Down"].clip(upper = 0)
data_df["DR_Gap_Down_Binary"] = data_df["Daily_Range_Gap_Down"] < 0

# Getting previous 14-day average PM volume up till 9:28am (i.e., using PD_PM_V_0928)
avg_vol_window = 14
data_df["Avg_PM_Vol_Prev_14D"] = data_df["PD_PM_V_0928"].rolling(window=avg_vol_window).mean()

# Comparing current PM volume to average PM volume over the past XX days
data_df["PM_Vol_Change"] = (data_df["PM_Volume_0928"] - data_df["Avg_PM_Vol_Prev_14D"]) / data_df["Avg_PM_Vol_Prev_14D"]

# Setting the first 14-day PM volume averages and previous volume change amounts of every ticker to np.nan
data_df.loc[data_df["Rank"] < 15, ["Avg_PM_Vol_Prev_14D", "PM_Vol_Change"]] = np.nan

display(data_df[10:20])
display(data_df.shape)

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker,PM_High_0927,PM_Low_0927,PM_Close_0927,PM_Volume_0927,PM_High_0928,PM_Low_0928,PM_Close_0928,PM_Volume_0928,Rank,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap,Close_Open_Diff,Close_Open_Diff_Perc,Close_Open_Diff_Binary,Close_Open_Diff_Ternary,PDO,PDH,PDL,PDC,PDV,PD_PM_H_0927,PD_PM_L_0927,PD_PM_C_0927,PD_PM_V_0927,PD_PM_H_0928,PD_PM_L_0928,PD_PM_C_0928,PD_PM_V_0928,Gap_from_PDC,Gap_Binary_Threshold,Daily_Range_Gap_Up,DR_Gap_Up_Binary,Daily_Range_Gap_Down,DR_Gap_Down_Binary,Avg_PM_Vol_Prev_14D,PM_Vol_Change
2020-10-15,2020-10-15,118.75,121.2,118.15,120.745,92741440.0,121.49,117.25,2945665.0,AAPL,121.49,117.25,118.88,2876720.0,121.49,117.25,118.8,2906007.0,11,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.995,1.68,True,2,120.99,123.03,119.62,121.29,128874678.0,122.57,120.6,120.94,2134894.0,122.57,120.6,120.96,2152679.0,-0.020529,True,0.0,False,-0.006855,True,,
2020-10-16,2020-10-16,121.25,121.548,118.81,118.96,92366893.0,122.11,120.65,1883037.0,AAPL,122.11,120.65,121.75,1571778.0,122.11,120.65,121.85,1685166.0,12,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-2.29,-1.88866,False,0,118.75,121.2,118.15,120.745,92741440.0,121.49,117.25,118.88,2876720.0,121.49,117.25,118.8,2906007.0,0.009152,False,0.005363,True,0.0,False,,
2020-10-19,2020-10-19,119.96,120.419,115.66,116.0,100858537.0,121.05,119.02,1966792.0,AAPL,121.05,119.02,120.01,1926163.0,121.05,119.02,119.97,1949307.0,13,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-3.96,-3.3011,False,0,121.25,121.548,118.81,118.96,92366893.0,122.11,120.65,121.75,1571778.0,122.11,120.65,121.85,1685166.0,0.00849,False,0.0,False,0.0,False,,
2020-10-20,2020-10-20,116.23,118.98,115.63,117.5,103813586.0,118.35,116.12,1674408.0,AAPL,118.35,116.21,116.24,1617201.0,118.35,116.12,116.24,1648903.0,14,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.27,1.092661,True,2,119.96,120.419,115.66,116.0,100858537.0,121.05,119.02,120.01,1926163.0,121.05,119.02,119.97,1949307.0,0.002069,False,0.0,False,0.0,False,,
2020-10-21,2020-10-21,116.63,118.705,116.45,116.86,73640108.0,117.94,116.5,1272131.0,AAPL,117.94,116.5,117.27,1015301.0,117.94,116.5,116.75,1201839.0,15,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,0.23,0.197205,True,1,116.23,118.98,115.63,117.5,103813586.0,118.35,116.21,116.24,1617201.0,118.35,116.12,116.24,1648903.0,-0.006383,False,0.0,False,0.0,False,2443894.0,-0.508228
2020-10-22,2020-10-22,117.41,118.04,114.59,115.774,86318174.0,117.58,116.0,1522067.0,AAPL,117.58,116.0,117.45,1487131.0,117.58,116.0,117.45,1502827.0,16,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-1.636,-1.393408,False,0,116.63,118.705,116.45,116.86,73640108.0,117.94,116.5,117.27,1015301.0,117.94,116.5,116.75,1201839.0,0.005049,False,0.0,False,0.0,False,2406386.0,-0.375484
2020-10-23,2020-10-23,116.38,116.55,114.28,115.05,68258650.0,116.65,115.88,790808.0,AAPL,116.65,115.88,116.38,759476.0,116.65,115.88,116.33,773781.0,17,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-1.33,-1.142808,False,0,117.41,118.04,114.59,115.774,86318174.0,117.58,116.0,117.45,1487131.0,117.58,116.0,117.45,1502827.0,0.004802,False,0.0,False,0.0,False,2200176.0,-0.648309
2020-10-26,2020-10-26,114.01,116.55,112.88,115.06,90955903.0,115.32,113.31,1556823.0,AAPL,115.32,113.31,114.05,1470021.0,115.32,113.31,114.0,1525760.0,18,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.05,0.920972,True,2,116.38,116.55,114.28,115.05,68258650.0,116.65,115.88,116.38,759476.0,116.65,115.88,116.33,773781.0,-0.009126,False,0.0,False,-0.00245,True,2135028.0,-0.285368
2020-10-27,2020-10-27,115.42,117.28,114.54,116.53,76025170.0,116.0,115.0,877134.0,AAPL,116.0,115.0,115.38,844939.0,116.0,115.0,115.5,865306.0,19,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.11,0.961705,True,2,114.01,116.55,112.88,115.06,90955903.0,115.32,113.31,114.05,1470021.0,115.32,113.31,114.0,1525760.0,0.003824,False,0.0,False,0.0,False,2148915.0,-0.597329
2020-10-28,2020-10-28,115.04,115.43,111.1,111.15,108210209.0,116.89,114.51,1361530.0,AAPL,116.89,114.51,114.97,1323057.0,116.89,114.51,114.99,1340196.0,20,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-3.89,-3.381433,False,0,115.42,117.28,114.54,116.53,76025170.0,116.0,115.0,115.38,844939.0,116.0,115.0,115.5,865306.0,-0.013215,False,0.0,False,0.0,False,2064817.0,-0.350937


(1055000, 51)

Getting the features for the SMA & EMA of the past XXX periods, the real-time RSI, and the real-time ATR & NATR:

In [231]:
# SMA and EMA for the past XXX periods
data_df["SMA_Prev_14"] = data_df["PDC"].rolling(window=14).mean()
data_df["SMA_Prev_50"] = data_df["PDC"].rolling(window=50).mean()
data_df["EMA_Prev_9"] = data_df["PDC"].ewm(span=9, adjust=False).mean()
data_df["EMA_Prev_20"] = data_df["PDC"].ewm(span=20, adjust=False).mean()

# Defining a function to calculate the RSI_14 based on the previous 13-day Close prices and the current day's 9:28am Close price
def calculate_real_time_RSI(df, period=14):
    # Defining the pure Close change amounts
    Close_Chg_Amt = df["Close"] - df["PDC"]
    Close_Gain_Amt = Close_Chg_Amt.clip(lower = 0)
    Close_Loss_Amt = Close_Chg_Amt.clip(upper = 0)

    # Defining the change amounts from the Close at 9:28am against the PDC
    Curr_Chg_Amt_0928 = df["PM_Close_0928"] - df["PDC"]
    Curr_Gain_Amt_0928 = Curr_Chg_Amt_0928.clip(lower = 0)
    Curr_Loss_Amt_0928 = Curr_Chg_Amt_0928.clip(upper = 0)

    # Calculating avg gain and loss using the previous 13 gain/loss amts (shift(1)) and the gain/loss amt at 9:28am compared to the PDC
    Avg_Gain = (Close_Gain_Amt.shift(1).rolling(window=(period-1)).sum() + Curr_Gain_Amt_0928) / period
    Avg_Loss = (Close_Loss_Amt.shift(1).rolling(window=(period-1)).sum() + Curr_Loss_Amt_0928) / period

    # Getting relative strength (RS) and relative strength index (RSI)
    RS = -1 * Avg_Gain / Avg_Loss  # It's ok if Avg_Loss is 0 and we get -inf; the final RSI calculation will just make that 0
    RSI = 100 - (100 / (1 + RS))
    
    return RSI

# Defining a function to calculate ATR_14 using the 9:28am period
def calculate_ATR(df, period=14):
    # Defining the previous range and ATR values
    Prev_Range = df["PDH"] - df["PDL"]
    Prev_ATR = Prev_Range.rolling(window=(period-1)).mean()

    # Calculating True Range and ultimately getting the ATR_14
    H_sub_L = df["PM_High_0928"] - df["PM_Low_0928"]
    H_sub_Cp = abs(df["PM_High_0928"] - df["PDC"])
    L_sub_Cp = abs(df["PM_Low_0928"] - df["PDC"])
    True_Range = pd.concat([H_sub_L, H_sub_Cp, L_sub_Cp], axis=1).max(axis=1)
    ATR = ((Prev_ATR * (period - 1)) + True_Range) / period

    return ATR

# Calculating RSI_14, ATR_14, and NATR_14 (all using the 9:28am period)
data_df["RSI_14"] = calculate_real_time_RSI(data_df, 14)
data_df["ATR_14"] = calculate_ATR(data_df, 14)
data_df["NATR_14"] = (data_df["ATR_14"] / data_df["PDC"]) * 100

# Setting 9, 14, 20, and 50 day metrics to np.nan:
data_df.loc[data_df["Rank"] <= 9, ['EMA_Prev_9']] = np.nan
data_df.loc[data_df["Rank"] <= 14, ['SMA_Prev_14', 'RSI_14', 'ATR_14', 'NATR_14']] = np.nan
data_df.loc[data_df["Rank"] <= 20, ['EMA_Prev_20']] = np.nan
data_df.loc[data_df["Rank"] <= 50, ['SMA_Prev_50']] = np.nan

display(data_df.shape)

(1055000, 58)

Obtaining the Bollinger Band features and volatility indicator based on STD from the SMA_20:

In [232]:
# Obtaining the center, upper, and lower Bollinger Bands based on the previous Close prices and the Close price at 9:28am, along with the
# number of standard deviations away from the SMA_20 the Close price at 9:28am is
def calculate_bbands(df, period=20, std_num=2):
    # Getting the rolling sum and the rolling sum of squares of the previous 19 PDCs
    PDC_Rolling_Sum = df["PDC"].rolling(window=(period-1), min_periods=(period-1)).sum()
    PDC_Rolling_Sum_Squares = df["PDC"].pow(2).rolling(window=(period-1), min_periods=(period-1)).sum()

    # Getting the combined sums with the Close price at 9:28am
    Combined_Sum = PDC_Rolling_Sum + df["PM_Close_0928"]
    Combined_Sum_Squares = PDC_Rolling_Sum_Squares + df["PM_Close_0928"] ** 2

    # Getting the std using the formula: sqrt((Σx^2 - (Σx)^2 / n) / (n-1))
    Combined_STD = np.sqrt((Combined_Sum_Squares - (Combined_Sum ** 2) / period) / (period - 1))

    # Getting the center Bollinger Band, which is effectively the average of the previous 19 PDCs plus the Close at 9:28am (i.e., SMA_20)
    SMA_20 = Combined_Sum / period

    # Getting the upper and lower Bollinger Bands based on 2 stds above and below the SMA_20
    Upper_BBand = SMA_20 + 2 * Combined_STD
    Lower_BBand = SMA_20 - 2 * Combined_STD

    # Getting the number of standard deviations away from the SMA_20 at 9:28am
    Num_STD_from_SMA_20 = (df["PM_Close_0928"] - SMA_20) / Combined_STD

    return SMA_20, Upper_BBand, Lower_BBand, Num_STD_from_SMA_20

# Setting the SMA_20 at 9:28am, the upper and lower Bollinger Bands, and the num stds away from the SMA_20 to the original data_df
SMA_20, Upper_BBand, Lower_BBand, Num_STD_from_SMA_20 = calculate_bbands(data_df)
data_df["SMA_20"] = SMA_20
data_df["Upper_BBand"] = Upper_BBand
data_df["Lower_BBand"] = Lower_BBand
data_df["Num_STD_from_SMA_20"] = Num_STD_from_SMA_20

display(data_df[18:22])
display(data_df.shape)

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker,PM_High_0927,PM_Low_0927,PM_Close_0927,PM_Volume_0927,PM_High_0928,PM_Low_0928,PM_Close_0928,PM_Volume_0928,Rank,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap,Close_Open_Diff,Close_Open_Diff_Perc,Close_Open_Diff_Binary,Close_Open_Diff_Ternary,PDO,PDH,PDL,PDC,PDV,PD_PM_H_0927,PD_PM_L_0927,PD_PM_C_0927,PD_PM_V_0927,PD_PM_H_0928,PD_PM_L_0928,PD_PM_C_0928,PD_PM_V_0928,Gap_from_PDC,Gap_Binary_Threshold,Daily_Range_Gap_Up,DR_Gap_Up_Binary,Daily_Range_Gap_Down,DR_Gap_Down_Binary,Avg_PM_Vol_Prev_14D,PM_Vol_Change,SMA_Prev_14,SMA_Prev_50,EMA_Prev_9,EMA_Prev_20,RSI_14,ATR_14,NATR_14,SMA_20,Upper_BBand,Lower_BBand,Num_STD_from_SMA_20
2020-10-27,2020-10-27,115.42,117.28,114.54,116.53,76025170.0,116.0,115.0,877134.0,AAPL,116.0,115.0,115.38,844939.0,116.0,115.0,115.5,865306.0,19,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.11,0.961705,True,2,114.01,116.55,112.88,115.06,90955903.0,115.32,113.31,114.05,1470021.0,115.32,113.31,114.0,1525760.0,0.003824,False,0.0,False,0.0,False,2148915.0,-0.597329,117.836357,,116.662398,,50.985545,3.248429,2.823247,,,,
2020-10-28,2020-10-28,115.04,115.43,111.1,111.15,108210209.0,116.89,114.51,1361530.0,AAPL,116.89,114.51,114.97,1323057.0,116.89,114.51,114.99,1340196.0,20,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-3.89,-3.381433,False,0,115.42,117.28,114.54,116.53,76025170.0,116.0,115.0,115.38,844939.0,116.0,115.0,115.5,865306.0,-0.013215,False,0.0,False,0.0,False,2064817.0,-0.350937,117.942071,,116.635919,,50.039494,3.413429,2.929227,117.03745,122.860503,111.214397,-0.703222
2020-10-29,2020-10-29,112.37,116.93,112.2,114.52,102297860.0,113.69,111.7,2178031.0,AAPL,113.69,111.7,112.44,2153965.0,113.69,111.7,112.44,2161444.0,21,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,2.15,1.913322,True,2,115.04,115.43,111.1,111.15,108210209.0,116.89,114.51,114.97,1323057.0,116.89,114.51,114.99,1340196.0,0.011606,False,0.0,False,0.0,False,2055168.0,0.051712,117.669214,,115.538735,116.462219,42.018284,3.585571,3.225885,116.62745,123.26468,109.99022,-1.261806
2020-10-30,2020-10-30,111.03,111.99,107.72,108.9,151657835.0,111.95,108.7,2798215.0,AAPL,111.4,108.7,110.96,2733096.0,111.95,108.7,111.01,2775072.0,22,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-2.13,-1.9184,False,0,112.37,116.93,112.2,114.52,102297860.0,113.69,111.7,112.44,2153965.0,113.69,111.7,112.44,2161444.0,-0.03065,True,0.0,False,-0.010606,True,2138593.0,0.297616,117.4935,,115.334988,116.277246,24.783753,3.736571,3.262811,116.63095,123.354034,109.907866,-1.672134


(1055000, 62)

Obtaining the On Balance Volume and Stochastic Oscillator features:

In [237]:
# Getting the previous positive or negative volumes based on PDC minus the PDC of the previous CS
data_df["Close_Chg_Amt"] = data_df["Close"] - data_df["PDC"]
data_df["Prev_Pos_Neg_Vol"] = np.where(data_df["Close_Chg_Amt"].shift(1) > 0, data_df["PDV"],
                                       np.where(data_df["Close_Chg_Amt"].shift(1) < 0, data_df["PDV"] * -1, 0))
data_df.loc[data_df["Rank"] <= 2, ["Prev_Pos_Neg_Vol"]] = 0.0  # Setting Prev_Pos_Neg_Vol for first two days to 0 to calibrate properly

# Getting the previous On Balance Volume (OBV) by ticker to try predicting larger trends in price movements
data_df["Prev_OBV"] = data_df.groupby("Ticker")["Prev_Pos_Neg_Vol"].cumsum()

# Defining a function to calculate the real-time %K and %D in the stochastic oscillator formula (assuming k_period = 14 and d_period = 3)
def calculate_stochastic_oscillator(df, k_period=14, d_period=3):
    """
    Note: To account for the real-time data we get at 9:28am, I had to do the calculations a bit differently. To quickly explain:
      1. I calculate the %K based on the High and Low prices of the 13 previous days along with the High and Low prices of the current
      day up till 9:28am. This effectively gives me my 14 periods, and my "current closing price" is set to the Close price 9:28am.
          - E.g., If I'm on Day 14, I use the H/L of days 1-13 along with H/L of Day 14 up till 9:28am to get my real-time %K.
      
      2. I calculate the %D based on the previous two "true" %K values along with the "real-time" %K. More specifically, the previous 
      "true" %K values are calculated from 1) the rolling average of the previous 14 Close prices, and 2) the rolling average of the
      previous 14 Close prices BEFORE the PDC. I then add those two values along with the real-time %K, then divide that total by my
      d_period (3), to ultimately get my real-time %D.
          - E.g., If I'm on Day 16, I can get my "true" %K values (calculated from the Close prices of the last day) from 1) days 1-14
          and 2) days 2-15, add those with my real-time %K I calculated above, and then divide by 3 to get my real-time %D.
    """
    # Copying the df and only the columns we need
    df_so = df.copy()
    df_so = df_so[["PDH", "PDL", "PDC", "PM_High_0928", "PM_Low_0928", "PM_Close_0928"]]

    # Getting the previous %K, ONLY for the purposes of getting the real-time %D later on
    df_so["PDL_Min_14"] = df_so["PDL"].rolling(window=(k_period)).min()
    df_so["PDH_Max_14"] = df_so["PDH"].rolling(window=(k_period)).max()
    df_so["Prev_K"] = (df_so["PDC"] - df_so["PDL_Min_14"]) / (df_so["PDH_Max_14"] - df_so["PDL_Min_14"]) * 100
    
    # Getting the lowest lows and the highest highs of the past 13 periods plus the low/high at 9:28am to get real-time %K
    df_so["PDL_Min_13"] = df_so["PDL"].rolling(window=(k_period-1)).min()
    df_so["PDH_Max_13"] = df_so["PDH"].rolling(window=(k_period-1)).max()
    df_so["Low_Min_RT"] = df_so[["PDL_Min_13", "PM_Low_0928"]].min(axis=1)  # Lowest minimum in real-time (9:28am)
    df_so["High_Min_RT"] = df_so[["PDH_Max_13", "PM_High_0928"]].max(axis=1)  # Highest maximum in real-time (9:28am)

    # Defining the real-time %K and %D based on previous metrics
    df_so["Sto_Osc_K_RT"] = (df_so["PM_Close_0928"] - df_so["Low_Min_RT"]) / (df_so["High_Min_RT"] - df_so["Low_Min_RT"]) * 100
    df_so["Sto_Osc_D_RT"] = (df_so["Prev_K"].rolling(window=(d_period-1)).sum() + df_so["Sto_Osc_K_RT"]) / d_period

    return df_so["Sto_Osc_K_RT"], df_so["Sto_Osc_D_RT"]

# Getting the real-time %K and %D into the original df
Sto_Osc_K_RT, Sto_Osc_D_RT = calculate_stochastic_oscillator(data_df)
data_df["Sto_Osc_K_RT"] = Sto_Osc_K_RT
data_df["Sto_Osc_D_RT"] = Sto_Osc_D_RT
data_df.loc[data_df["Rank"] <= 13, ["Sto_Osc_K_RT"]] = np.nan  # Making sure first 13 %K doesn't exist
data_df.loc[data_df["Rank"] <= 15, ["Sto_Osc_D_RT"]] = np.nan  # Making sure first 15 %D doesn't exist

display(data_df[12:18])
display(data_df.shape)

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,PM_High,PM_Low,PM_Volume,Ticker,PM_High_0927,PM_Low_0927,PM_Close_0927,PM_Volume_0927,PM_High_0928,PM_Low_0928,PM_Close_0928,PM_Volume_0928,Rank,Name,Sub_Industry_Orig,Sub_Industry,Industry,Industry_Group,Sector,Market_Cap,Close_Open_Diff,Close_Open_Diff_Perc,Close_Open_Diff_Binary,Close_Open_Diff_Ternary,PDO,PDH,PDL,PDC,PDV,PD_PM_H_0927,PD_PM_L_0927,PD_PM_C_0927,PD_PM_V_0927,PD_PM_H_0928,PD_PM_L_0928,PD_PM_C_0928,PD_PM_V_0928,Gap_from_PDC,Gap_Binary_Threshold,Daily_Range_Gap_Up,DR_Gap_Up_Binary,Daily_Range_Gap_Down,DR_Gap_Down_Binary,Avg_PM_Vol_Prev_14D,PM_Vol_Change,SMA_Prev_14,SMA_Prev_50,EMA_Prev_9,EMA_Prev_20,RSI_14,ATR_14,NATR_14,SMA_20,Upper_BBand,Lower_BBand,Num_STD_from_SMA_20,Close_Chg_Amt,Prev_Pos_Neg_Vol,Prev_OBV,Sto_Osc_K_RT,Sto_Osc_D_RT
2020-10-19,2020-10-19,119.96,120.419,115.66,116.0,100858537.0,121.05,119.02,1966792.0,AAPL,121.05,119.02,120.01,1926163.0,121.05,119.02,119.97,1949307.0,13,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-3.96,-3.3011,False,0,121.25,121.548,118.81,118.96,92366893.0,122.11,120.65,121.75,1571778.0,122.11,120.65,121.85,1685166.0,0.00849,False,0.0,False,0.0,False,,,,,119.12951,,,,,,,,,-2.96,-92366893.0,-154516677.0,,
2020-10-20,2020-10-20,116.23,118.98,115.63,117.5,103813586.0,118.35,116.12,1674408.0,AAPL,118.35,116.21,116.24,1617201.0,118.35,116.12,116.24,1648903.0,14,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.27,1.092661,True,2,119.96,120.419,115.66,116.0,100858537.0,121.05,119.02,120.01,1926163.0,121.05,119.02,119.97,1949307.0,0.002069,False,0.0,False,0.0,False,,,,,118.503608,,,,,,,,,1.5,-100858537.0,-255375214.0,30.523918,
2020-10-21,2020-10-21,116.63,118.705,116.45,116.86,73640108.0,117.94,116.5,1272131.0,AAPL,117.94,116.5,117.27,1015301.0,117.94,116.5,116.75,1201839.0,15,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,0.23,0.197205,True,1,116.23,118.98,115.63,117.5,103813586.0,118.35,116.21,116.24,1617201.0,118.35,116.12,116.24,1648903.0,-0.006383,False,0.0,False,0.0,False,2443894.0,-0.508228,117.606071,,118.302886,,49.924812,3.272357,2.784985,,,,,-0.64,103813586.0,-151561628.0,34.396355,
2020-10-22,2020-10-22,117.41,118.04,114.59,115.774,86318174.0,117.58,116.0,1522067.0,AAPL,117.58,116.0,117.45,1487131.0,117.58,116.0,117.45,1502827.0,16,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-1.636,-1.393408,False,0,116.63,118.705,116.45,116.86,73640108.0,117.94,116.5,117.27,1015301.0,117.94,116.5,116.75,1201839.0,0.005049,False,0.0,False,0.0,False,2406386.0,-0.375484,117.610357,,118.014309,,57.395659,3.218429,2.754089,,,,,-1.086,-73640108.0,-225201736.0,39.57382,38.298841
2020-10-23,2020-10-23,116.38,116.55,114.28,115.05,68258650.0,116.65,115.88,790808.0,AAPL,116.65,115.88,116.38,759476.0,116.65,115.88,116.33,773781.0,17,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,-1.33,-1.142808,False,0,117.41,118.04,114.59,115.774,86318174.0,117.58,116.0,117.45,1487131.0,117.58,116.0,117.45,1502827.0,0.004802,False,0.0,False,0.0,False,2200176.0,-0.648309,117.807071,,117.566247,,49.617932,3.193143,2.758083,,,,,-0.724,-86318174.0,-311519910.0,31.050228,31.033563
2020-10-26,2020-10-26,114.01,116.55,112.88,115.06,90955903.0,115.32,113.31,1556823.0,AAPL,115.32,113.31,114.05,1470021.0,115.32,113.31,114.0,1525760.0,18,Apple Inc.,Consumer Electronics,Consumer Electronics,Household Durables,Consumer Durables & Apparel,Consumer Discretionary,2918340000000.0,1.05,0.920972,True,2,116.38,116.55,114.28,115.05,68258650.0,116.65,115.88,116.38,759476.0,116.65,115.88,116.33,773781.0,-0.009126,False,0.0,False,-0.00245,True,2135028.0,-0.285368,117.700643,,117.062998,,51.658768,3.159857,2.746508,,,,,0.01,-68258650.0,-379778560.0,5.711921,17.946591


(1055000, 67)

There are certain metrics that I won't implement into the model yet; notably, I'll leave out MACD (because it indicates entry and exit signals, and my outcome variables don't care about intraday trading) and VWAP (because it uses 1-minute data that I haven't integrated yet).