
## Source data
- Kaggle Intraday Stock Data Set: https://www.kaggle.com/datasets/borismarjanovic/daily-and-intraday-stock-price-data

### Prereqs
- Tested with Python 3.10
- Assumes a folder which contains CSV data sets that can be concatenated together to make the master training data with multiple stock sequences


In [5]:
import os
# Set & view the working directory to ensure runtime is executng in expected directory
cwd = os.getcwd()
print(f"Current working directory: {cwd}")
os.chdir("C:/Users/angusf/source/repos/synth_nonin") #note the use of forward slashes

Current working directory: C:\Users\angusf\source\repos\synth_nonin


In [6]:
import pandas # pandas used to manipulate dataframes
import re # regular expressions used to grab stock name from source data file name

# set the path to the folder containing the CSV files
folder_path = cwd + './KaggleDataSet/5 Min/Stocks/SubSet' #note the use of forward slashes
print(f"Folder path to bulk CSV files: {folder_path}")
# get a list of all the CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
print(f"Number of CSV files found: {len(csv_files)}")

# read each CSV file into a separate dataframe and concatenate them together
df_list = []
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    individual_file_df = pandas.read_csv(file_path)

    # extract the leftmost characters up to the first dot to use as the stock symbol in the dataframe
    # assumes the filename is in the format "<stocksymbol>.<countrycode>.txt"
    filenamematch = re.match(r'^([^.]*)', file)
    if filenamematch:
            StockSymbolfromFilename = re.match(r'^([^.]*)', file).group(1) 
    else:
            StockSymbolfromFilename = 'null'

    print(f"Stock symbol from filename:{StockSymbolfromFilename}")
    individual_file_df['Symbol'] = StockSymbolfromFilename      #add a column called "Symbol" to the dataframe and populate it with the stock symbol extracted from the filename
    individual_file_df.drop(['OpenInt'], axis=1, inplace=True)  #drop the OpenInt column from the dataframe (not needed) axis=1 means drop a column, inplace=True means do it in place rather than returning a copy of the dataframe
    individual_file_df['Combineddatetime'] = individual_file_df['Date'] + ' ' + individual_file_df['Time'] #combine the Date and Time columns into a single column called Combinedatetime

    df_list.append(individual_file_df) # append the dataframes together
master_input_data = pandas.concat(df_list, ignore_index=True) #generate the master list 

# print the resulting dataframe
print(master_input_data)
master_input_data.to_csv('master_concatenated_stock_input_data.csv', index=False)


Folder path to bulk CSV files: C:\Users\angusf\source\repos\synth_nonin./KaggleDataSet/5 Min/Stocks/SubSet
Number of CSV files found: 15
Stock symbol from filename:aapl
Stock symbol from filename:ae
Stock symbol from filename:alex
Stock symbol from filename:asr
Stock symbol from filename:bld
Stock symbol from filename:cbh
Stock symbol from filename:cltl
Stock symbol from filename:cuz
Stock symbol from filename:ebf
Stock symbol from filename:eufl
Stock symbol from filename:feye
Stock symbol from filename:fve
Stock symbol from filename:hayn
Stock symbol from filename:hbann
Stock symbol from filename:hsgx
             Date      Time      Open      High       Low     Close   Volume  \
0      2017-11-17  15:35:00  171.0400  171.0500  170.2500  170.3600  1808907   
1      2017-11-17  15:40:00  170.3600  170.4100  170.0600  170.0600   481179   
2      2017-11-17  15:45:00  170.0600  170.2900  169.8300  170.2500   580184   
3      2017-11-17  15:50:00  170.2600  170.2800  169.9700  169.9700   