# Sequence Preparation
This notebook is meant to prepare a data file for use in ML experiments and run an experiment.  The data file will be a CSV that is updated with a sequence group column and a label (target) column for a classification problem using machine learning models to predict the target.  The sequence group column will be used to group sequences that are related to each other.  The label column will be used to indicate the target value for each sequence.  The data file will be used to train and test machine learning models to predict the target value for new sequences.

The steps outlined here will build upon one another so that the final data file will be ready for use in machine learning experiments.  The code should be run sequentially to prepare the data file.

## File Preperation
The basic steps for getting a file ready for the ML expermient are:
1. Load the data file
1. Filter the columns of interest which would include the features and the target (also known as the label)
1. Normalize the data
1. Create a sequence group column using a rolling window
1. Shape the data
1. Split the data, build, compile & train the model
1. Evaluate the model

### Data File Assumptions
- The data file will be a CSV
- The data file will have a header row
- The data file will have a column that contains the target value
- The data file will be in sequential chronological order
- The data file will have a column that contains a date and time value to aide sequence grouping
- The data file is already cleansed with regard to missing values and outliers


**Steps 1 & 2: Load the Data File and Filter the Columns of Interest**
- Update the file paths
- Update the column names for the features and the target

In [57]:
# STEP 1 & 2
# Declare the input & output file paths, the columns to write and the target column
# perform the imports
import pandas as pd

file_in = './data/sample.csv'
file_tmp = './tmp/sample.csv'
file_out = './data_prod/sample.csv'
file_training = './data_prod/sample_training.csv'
file_testing = './data_prod/sample_testing.csv'

columns_to_write = [
    'date',
    'reversal',
    'adx',
    'dmi',
    'fisherstransform',
    'fosc',
    'linreg',
    'linregintercept',
    'linregslope',
    'macd',
    'macd_avg',
    'macd_diff',
    'parabolic_sar',
    'rsi',
    'rsi_avg',
    'trendsequence']
target_column = 'reversal'

# Load the data from the input CSV file into a pandas dataframe
df = pd.read_csv(file_in)

# Convert 'date' column to datetime format for easier manipulation
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S.%f')

# Save the modified dataframe with only the specified columns to a new CSV file
#df.to_csv(file_tmp, index=False, columns=columns_to_write)
df['reversal'] = df['reversal'].astype(int)
df_filtered = df[columns_to_write]
#df_filtered.to_csv(file_tmp, index=False)

**Step 3: Normalize the Data**

This step will use the MinMaxScaler to normalize the data.  The MinMaxScaler will scale the data to a range of 0 to 1.

In [58]:
from sklearn.preprocessing import MinMaxScaler

# List of features to exclude from normalization
features_to_exclude = ['date', 'reversal']

# Dynamically select features to normalize (all features except the ones to exclude)
features_to_normalize = [col for col in df_filtered.columns if col not in features_to_exclude]

# Initialize the Min-Max Scaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit the scaler to your data (for the features to be normalized)
scaler.fit(df_filtered[features_to_normalize])

# Transform the data using the fitted scaler
df_normalized = df_filtered.copy()  # Create a copy of the DataFrame to keep the original data intact
df_normalized[features_to_normalize] = scaler.transform(df_filtered[features_to_normalize])

# df_normalized now contains the normalized data, excluding the specified features
#df_normalized.to_csv(file_out, index=False)

**Step 4: Create a Sequence Group Column Using a Rolling Window**

This step will use a rolling window to create a sequence group column.  The sequence group column will be used to group sequences that are related to each other.  The rolling window will be based on a number of records.

In [62]:
# STEP 4
rows_per_group = 5

df_normalized['logical_date'] = df_normalized['date'].dt.date
# Initialize an empty DataFrame to hold the final results
final_df = pd.DataFrame()

# Loop over the DataFrame to create rolling windows
for start in range(len(df_normalized) - rows_per_group):
    window = df_normalized.iloc[start:start + rows_per_group]
    # Check if all the dates in the window are the same
    if len(set(window['logical_date'])) == 1:
        sequence_group = start + 1
        # Check if the next record exists and is on the same logical date
        if start + rows_per_group < len(df_normalized) and window.iloc[-1]['logical_date'] == df_normalized.iloc[start + rows_per_group]['logical_date']:
            future_reversal = df_normalized.iloc[start + rows_per_group]['reversal']
        else:
            future_reversal = None  # Set to None if there's no next record or it's on a different date
        
        window_copy = window.copy()
        window_copy['sequence_group'] = sequence_group
        window_copy['future_reversal'] = future_reversal
        final_df = pd.concat([final_df, window_copy], ignore_index=True)

# Filter out any sequence groups that don't have a future_reversal (indicating the next record was on a different day)
final_df = final_df.dropna(subset=['future_reversal'])

# Drop the date and convert all columns to float
final_df = final_df.drop(['date','logical_date'], axis=1)
final_df = final_df.astype(float)

# Write the final DataFrame to a new CSV file
print(len(final_df))
#print(final_df.head())
#final_df.to_csv(file_tmp, index=False)

4045


**Step 5: Shape the Data**
Shape the data for RNN input, which requires a 3D shape [samples, timesteps, features]

In [65]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Drop the 'sequence_group' column and separate features and labels
X = final_df.drop(['sequence_group', 'future_reversal'], axis=1)
y = final_df['future_reversal']

# Since the data is already grouped, reshape it to fit the RNN input shape
num_features = X.shape[1]
num_sequences = len(final_df) // rows_per_group 

X_reshaped = X.values.reshape((num_sequences, rows_per_group, num_features))
y_reshaped = y.values.reshape((num_sequences, rows_per_group))[:, 0]  # Take the first label of each sequence

**Step 6: Split the Data, Build, Compile & Train the Model**


In [66]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_reshaped, y_reshaped, test_size=0.2, random_state=42)

# Build the RNN model
model = Sequential([
    SimpleRNN(50, input_shape=(rows_per_group, num_features), return_sequences=False),
    Dropout(0.5),
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

2024-02-04 13:41:37.027202: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory


Epoch 1/10


2024-02-04 13:41:42.188793: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f93ccd49f40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-02-04 13:41:42.188884: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2024-02-04 13:41:42.355903: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-02-04 13:41:42.656410: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
I0000 00:00:1707072102.972153   17901 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.6605


**Step 7: Evaluate the Model**


In [72]:
from sklearn.metrics import confusion_matrix

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.4f}")

# Assuming y_test contains the true labels and y_pred_classes contains your model predictions
y_pred = model.predict(X_test)
y_pred_classes = (y_pred > 0.5).astype("int32")  # Convert probabilities to binary predictions

# Define the confusion matrix
cm = confusion_matrix(y_test, y_pred_classes)

# Labels for rows and columns
labels = ['Not Reversed', 'Reversed']

# Prepare the header and column names with separators
header = f"{'|':>14} {'Predicted':^20} |"
column_names = f"| {'':>12} | {labels[0]:^12} | {labels[1]:^12} |"

# Separator for clarity between header and rows
separator = "+" + "-"*14 + "+" + "-"*14 + "+" + "-"*14 + "+"

# Data rows with separators
row1 = f"| {'Actual':<10} | {cm[0][0]:^12} | {cm[0][1]:^12} |"
row2 = f"| {labels[0]:<10} | {cm[1][0]:^12} | {cm[1][1]:^12} |"

# Assemble the full table
table = f"\n{header}\n{separator}\n{column_names}\n{separator}\n{row1}\n{separator}\n{row2}\n{separator}"

# Print the table
print(table)


Test Accuracy: 0.6605

             |      Predicted       |
+--------------+--------------+--------------+
|              | Not Reversed |   Reversed   |
+--------------+--------------+--------------+
| Actual     |     107      |      0       |
+--------------+--------------+--------------+
| Not Reversed |      55      |      0       |
+--------------+--------------+--------------+
