# Data Cleaning ALL 1 Minute Files From FirstRate Data
Previously, we had a cleaning process where Hayes started with the entire set of FRD csv files, converted them into parquet files, uploaded them onto Google Drive, and then filtered them down based on a certain set of criteria. I then continued filtering that down even further on Google Drive (through Google Colab), and I ended up with only the parquet day files that had 1) data starting from 8:00am at the latest, and 2) a PM volume of at least 100,000 shares traded. This limited the number of stocks significantly, and we were planning on testing that with the BHOD python backtester I developed.

However, we switched gears and started focusing on 1) forward testing through a TWS API bot, and 2) developing ML-based strategies. As of April 2025, Hayes has continued working on the bot, and while I spent the last month working on my own bot as well, I'm dedicating these notebooks to testing ML strategies that do not take in our pre-conceived notions of a "Stock in Play" or what might be considered a good entry or exit. Therefore, I'm going to re-download and re-clean all of the files from FirstRate Data, just so we have as much data as possible, and then I'll likely limit the amount of data that we use to a certain subset. More specifically, the plan is to only use data after the market returned to normal after the COVID pandemic (i.e., starting ~October 2020) and build our models based off of future time periods.

## Imports

In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import heatmap
import dask
import dask.dataframe as dd
from dask import delayed
from pyarrow.parquet import ParquetFile
import pyarrow as pa
from tqdm import tqdm

import tulipy as ti

import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import IncrementalPCA

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras import initializers
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import register_keras_serializable
from tensorflow.keras.optimizers import SGD
import keras_tuner as kt
from keras_tuner import HyperParameters

import os
import sys
import warnings

# Converting all FRD csv files to parquet files
I've downloaded all the FRD 1-min data onto the SSD, and now I'll begin process of creating a new folder and converting every single one of those files into parquet format.