
## Source data
- Kaggle Intraday Stock Data Set: https://www.kaggle.com/datasets/borismarjanovic/daily-and-intraday-stock-price-data

### Prereqs
- Tested with Python 3.10
- Assumes a folder which contains CSV data sets that can be concatenated together to make the master training data with multiple stock sequences


In [5]:
import os
# Set & view the working directory to ensure runtime is executng in expected directory
cwd = os.getcwd()
print(f"Current working directory: {cwd}")
os.chdir("C:/Users/angusf/source/repos/synth_nonin") #note the use of forward slashes

Current working directory: C:\Users\angusf\source\repos\synth_nonin


In [6]:
import pandas # pandas used to manipulate dataframes
import re # regular expressions used to grab stock name from source data file name

# set the path to the folder containing the CSV files
folder_path = cwd + './KaggleDataSet/5 Min/Stocks/SubSet' #note the use of forward slashes
print(f"Folder path to bulk CSV files: {folder_path}")
# get a list of all the CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
print(f"Number of CSV files found: {len(csv_files)}")

# read each CSV file into a separate dataframe and concatenate them together
df_list = []
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    individual_file_df = pandas.read_csv(file_path)

    # extract the leftmost characters up to the first dot to use as the stock symbol in the dataframe
    # assumes the filename is in the format "<stocksymbol>.<countrycode>.txt"
    filenamematch = re.match(r'^([^.]*)', file)
    if filenamematch:
            StockSymbolfromFilename = re.match(r'^([^.]*)', file).group(1) 
    else:
            StockSymbolfromFilename = 'null'

    print(f"Stock symbol from filename:{StockSymbolfromFilename}")
    individual_file_df['Symbol'] = StockSymbolfromFilename      #add a column called "Symbol" to the dataframe and populate it with the stock symbol extracted from the filename
    individual_file_df.drop(['OpenInt'], axis=1, inplace=True)  #drop the OpenInt column from the dataframe (not needed) axis=1 means drop a column, inplace=True means do it in place rather than returning a copy of the dataframe
    individual_file_df['Combineddatetime'] = individual_file_df['Date'] + ' ' + individual_file_df['Time'] #combine the Date and Time columns into a single column called Combinedatetime

    df_list.append(individual_file_df) # append the dataframes together
master_input_data = pandas.concat(df_list, ignore_index=True) #generate the master list 

# print the resulting dataframe
print(master_input_data)
master_input_data.to_csv('master_concatenated_stock_input_data.csv', index=False)


Folder path to bulk CSV files: C:\Users\angusf\source\repos\synth_nonin./KaggleDataSet/5 Min/Stocks/SubSet
Number of CSV files found: 202
Stock symbol from filename:a
Stock symbol from filename:aan
Stock symbol from filename:aav
Stock symbol from filename:abb
Stock symbol from filename:aeb
Stock symbol from filename:aed
Stock symbol from filename:aee
Stock symbol from filename:aeg
Stock symbol from filename:aegn
Stock symbol from filename:aeh
Stock symbol from filename:aehr
Stock symbol from filename:aeis
Stock symbol from filename:aek
Stock symbol from filename:ael
Stock symbol from filename:aem
Stock symbol from filename:aemd
Stock symbol from filename:aeo
Stock symbol from filename:aep
Stock symbol from filename:aer
Stock symbol from filename:aeri
Stock symbol from filename:aes
Stock symbol from filename:aet
Stock symbol from filename:aeti
Stock symbol from filename:aeua
Stock symbol from filename:aey
Stock symbol from filename:aezs
Stock symbol from filename:afam
Stock symbol from 