# Reversal Experiment using Recurrent Neural Networks

This notebook is meant to prepare a data file and conduct machine learning experiments using Neural Networks, specifically recurrent neural networks which support memory (RNN, LSTM, etc.).  The Neural Network models will be used for binary classification.

The data file will be a CSV that is updated with a sequence group column and a label (target) column for a classification problem using machine learning models to predict the target.  The "sequence_group" column will be used to group sequences that are related to each other.  The label column will be used to indicate the target value for each sequence.  The data file will be used to train and test machine learning models to predict the target value for new sequences.

The steps outlined here will build upon one another and should be run sequentially so that the final data file will be processed using a number of different Neural Network models.

I highly recommend the use of a GPU for this experiment.  The use of a GPU will greatly reduce the time it takes to train the models.

## File Preperation

The basic steps for getting a file ready for the ML expermient are:

1. Load the data file
1. Filter the columns of interest which would include the features and the target (also known as the label)
1. Normalize the data
1. Create a sequence group column using a rolling window
1. Shape the data
1. Split the data, build, compile, train & evaluate the model
1. Evaluate the model

The data file has been created using NinjaTrader 8 and is a CSV file.  The rows represent renko bars.  The other features represent the indicators which were used with defaults.  Note: The indicators are not good features for this problem because they are being used on a chart type and at a granularity that is not typical for the indicator.  The indicators are being used to demonstrate the process of preparing the data file and conducting the machine learning experiments.  The utility which was used to create the data file is the "Exporter" strategy which I authored in my NinjaTrader repository on GitHub.  I have included a sample data file in the /data directory using NQ 30 tick renko bars.

### Data File Assumptions

- The data file will be a CSV
- The data file will have a header row
- The data file will have a column that contains the target value and it must be a binary value
- The data file will be in sequential chronological order
- The data file will have a column that contains a date and time value to aide sequence grouping
- The data file is already cleansed with regard to missing values and outliers


**Steps 1 & 2: Load the Data File and Filter the Columns of Interest**
- Update the file paths
- Update the column names for the features and the target

In [263]:
# STEP 1 & 2
# Declare the input & output file paths, the columns to write and the target column
# perform the imports
import pandas as pd

file_in = './data/sample.csv'
file_tmp = './tmp/sample.csv'
file_out = './data_prod/sample.csv'
file_training = './data_prod/sample.csv'
file_testing = './data_prod/sample.csv'

columns_to_write = [
    'date',
    'higherclose',
    'reversal',
    'trendsequence',
    'adl',
    'adx',
    'adxr',
    'apz_lower',
    'apz_upper',
    'aroonoscillator',
    'atr',
    'bollinger_lower',
    'bollinger_middle',
    'bollinger_upper',
    'bop',
    'camarilla_r1',
    'camarilla_r2',
    'camarilla_r3',
    'camarilla_r4',
    'camarilla_s1',
    'camarilla_s2',
    'camarilla_s3',
    'camarilla_s4',
    'cci',
    'chaikinmoneyflow',
    'chaikinoscillator',
    'chaikinvolatility',
    'choppinessindex',
    'cmo',
    'currentday_open',
    'currentday_low',
    'currentday_high',
    'disparityindex',
    'dm_diplus',
    'dm_diminus',
    'dmi',
    'dmindex',
    'donchian_lower',
    'donchian_mean',
    'donchian_upper',
    'doublestochastics_k',
    'easeofmovement',
    'fibonacci_pp',
    'fibonacci_r1',
    'fibonacci_r2',
    'fibonacci_r3',
    'fibonacci_s1',
    'fibonacci_s2',
    'fibonacci_s3',
    'fisherstransform',
    'fosc',
    'kama',
    'keltner_lower',
    'keltner_mean',
    'keltner_upper',
    'linreg',
    'linregintercept',
    'linregslope',
    'macd',
    'macd_avg',
    'macd_diff',
    'mama_default',
    'mama_kama',
    'mfi',
    'momentum',
    'moneyflowoscillator',
    'orderflowcumulativedelta_deltaopen',
    'orderflowcumulativedelta_deltaclose',
    'orderflowcumulativedelta_deltahigh',
    'orderflowcumulativedelta_deltalow',
    'orderflowvwap_vwap',
    'orderflowvwap_s1_lower',
    'orderflowvwap_s1_higher',
    'orderflowvwap_s2_lower',
    'orderflowvwap_s2_higher',
    'orderflowvwap_s3_lower',
    'orderflowvwap_s3_higher',
    'parabolic_sar',
    'pfe',
    'ppo',
    'priceoscillator',
    'psychologicalline',
    'rsquared',
    'relativevigorindex',
    'rind',
    'roc',
    'rsi',
    'rsi_avg',
    'rss',
    'rvi',
    'stddev',
    'stochrsi',
    'stochastics_d',
    'stochastics_k',
    'stochasticsfast_d',
    'stochasticsfast_k',
    'trix',
    'trix_signal',
    'tsf',
    'tsi',
    'ultimateoscillator',
    'vortex_viplus',
    'vortex_viminus',
    'volma',
    'volume_oscillator',
    'vroc',
    'williamsr',
    'wisemanawesomeoscillator',
    'woodiescci',
    'woodiescci_turbo',
    'woodiespivot_pp',
    'woodiespivot_r1',
    'woodiespivot_r2',
    'woodiespivot_s1',
    'woodiespivot_s2'
    ]

group_helper = 'date' # This is the column that will be used to group the data and must be a datetime column in the format '%Y-%m-%d %H:%M:%S.%f'
target_column = 'reversal' # This is the column that will be used as the target column for the model and must be a binary column

# Load the data from the input CSV file into a pandas dataframe
df = pd.read_csv(file_in)

# Convert 'date' column to datetime format for easier manipulation
df[group_helper] = pd.to_datetime(df[group_helper], format='%Y-%m-%d %H:%M:%S.%f')

# Save the modified dataframe with only the specified columns to a new CSV file
#df.to_csv(file_tmp, index=False, columns=columns_to_write)
df[target_column] = df[target_column].astype(int)
df_filtered = df[columns_to_write]
#df_filtered.to_csv(file_tmp, index=False)

**Step 3: Normalize the Data**

This step will use the MinMaxScaler to normalize the data.  The MinMaxScaler will scale the data to a range of 0 to 1.

In [264]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

# List of features to exclude from normalization
features_to_exclude = [group_helper, target_column]

# Dynamically select features to normalize (all features except the ones to exclude)
features_to_normalize = [col for col in df_filtered.columns if col not in features_to_exclude]

# Initialize the Scaler
#scaler = MinMaxScaler(feature_range=(0, 1))
scaler = RobustScaler() # AKA Z-score normalization, better at handling outliers

# Fit the scaler to the data (for the features to be normalized)
scaler.fit(df_filtered[features_to_normalize])

# Transform the data using the fitted scaler
df_normalized = df_filtered.copy()  # Create a copy of the DataFrame to keep the original data intact
df_normalized[features_to_normalize] = scaler.transform(df_filtered[features_to_normalize])

#df_normalized.to_csv(file_out, index=False)

**Step 4: Create a Sequence Group Column Using a Rolling Window**

This step will use a rolling window to create a sequence group column.  The sequence group column will be used to group sequences that are related to each other.  The rolling window will be based on a number of records.  The sequence groups will eliminate the need for padding sequences to the same length since the sequences will be grouped together and will contain exactly the same number of records specified by the "rows_per_group" variable.

In [265]:
# STEP 4
rows_per_group = 5

df_normalized[f'logical_{group_helper}'] = df_normalized[group_helper].dt.date
# Initialize an empty DataFrame to hold the final results
final_df = pd.DataFrame()

# Loop over the DataFrame to create rolling windows
for start in range(len(df_normalized) - rows_per_group):
    window = df_normalized.iloc[start:start + rows_per_group]
    # Check if all the dates in the window are the same
    if len(set(window[f'logical_{group_helper}'])) == 1:
        sequence_group = start + 1
        # Check if the next record exists and is on the same logical date
        if start + rows_per_group < len(df_normalized) and window.iloc[-1][f'logical_{group_helper}'] == df_normalized.iloc[start + rows_per_group][f'logical_{group_helper}']:
            future_target = df_normalized.iloc[start + rows_per_group][target_column]
        else:
            future_target = None  # Set to None if there's no next record or it's on a different date
        
        window_copy = window.copy()
        window_copy['sequence_group'] = sequence_group
        window_copy[f'future_{target_column}'] = future_target
        final_df = pd.concat([final_df, window_copy], ignore_index=True)

# Filter out any sequence groups that don't have a future_reversal (indicating the next record was on a different day)
final_df = final_df.dropna(subset=[f'future_{target_column}'])

# Drop the date and convert all columns to float
final_df = final_df.drop([group_helper,f'logical_{group_helper}'], axis=1)
final_df = final_df.astype(float)

# Write the final DataFrame to a new CSV file
#print(len(final_df))
#print(final_df.head())
final_df.to_csv(file_tmp, index=False)

  df_normalized[f'logical_{group_helper}'] = df_normalized[group_helper].dt.date


**Step 5: Shape the Data**
Shape the data for RNN input, which requires a 3D shape [samples, timesteps, features]

In [266]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Drop the 'sequence_group' column and separate features and labels
X = final_df.drop(['sequence_group', f'future_{target_column}'], axis=1)
y = final_df[f'future_{target_column}']

# Since the data is already grouped, reshape it to fit the RNN input shape
num_features = X.shape[1]
num_sequences = len(final_df) // rows_per_group 

X_reshaped = X.values.reshape((num_sequences, rows_per_group, num_features))
y_reshaped = y.values.reshape((num_sequences, rows_per_group))[:, 0]  # Take the first label of each sequence

**Step 6: Split the Data, Build, Compile, Train & Evaluate the Model**

In [284]:
from sklearn.metrics import confusion_matrix
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import GRU
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten
from tensorflow.keras.layers import Bidirectional

# Function to evaluate the model
def evaluate_model(type, model, X_test, y_test, target_column):
    # Evaluate the model on the test set
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)  # Set verbose=0 to not print the evaluation log
    #print(f"Test Accuracy: {test_acc:.4f}")
    
    # Generate model predictions
    y_pred = model.predict(X_test)
    y_pred_classes = (y_pred > 0.5).astype("int32")  # Convert probabilities to binary predictions
    
    # Calculate the confusion matrix
    cm = confusion_matrix(y_test, y_pred_classes)
    
    # Labels for the confusion matrix
    labels = [f"not {target_column}", f"{target_column}"]
    
    # Format and print the confusion matrix
    header = f"{'Actual |':>19} {'Predicted':^33} |"
    column_names = f"| {'':>15} | {labels[0]:^15} | {labels[1]:^15} |"
    separator = "+" + "-"*17 + "+" + "-"*17 + "+" + "-"*17 + "+"
    row1 = f"| {labels[0]:<15} | {cm[0][0]:^15} | {cm[0][1]:^15} |"
    row2 = f"| {labels[1]:<15} | {cm[1][0]:^15} | {cm[1][1]:^15} |"
    table = f"\n{header}\n{separator}\n{column_names}\n{separator}\n{row1}\n{separator}\n{row2}\n{separator}"
    print(f"{type} Model Evaluation Results:")
    print(table)

def train_fit (type, model, X_train, y_train, X_test, y_test, target_column, set_epochs=10, set_batch_size=32):
    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])
    # Fit the model to the training data
    model.fit(X_train, y_train, epochs=set_epochs, batch_size=set_batch_size, validation_data=(X_test, y_test), verbose=0)
    # Evaluate the model
    evaluate_model(type, model, X_test, y_test, target_column)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_reshaped, y_reshaped, test_size=0.2, random_state=42)

# Build the RNN model
model = Sequential([
    SimpleRNN(50, input_shape=(rows_per_group, num_features), return_sequences=False),
    Dropout(0.5),
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])
train_fit("RNN", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

# Build the LSTM model
model = Sequential([
    LSTM(50, input_shape=(rows_per_group, num_features)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
train_fit("LSTM", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

# Build the GRU model
model = Sequential([
    GRU(50, input_shape=(rows_per_group, num_features)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
train_fit("GRU", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

# Build the Stacked LSTM model
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(rows_per_group, num_features)),
    Dropout(0.5),
    LSTM(50),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
train_fit("Stacked LSTM", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

# Build the Bidirectional LSTM model
model = Sequential([
    Bidirectional(LSTM(50), input_shape=(rows_per_group, num_features)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
train_fit("Bidirectional LSTM", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

# Build the 1D Convolutional Neural Network (CNN) model
model = Sequential([
    Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(rows_per_group, num_features)),
    MaxPooling1D(pool_size=2),
    Flatten(),
    Dense(50, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
train_fit("1D Convolutional Neural Network (CNN)", model,X_train,y_train, X_test, y_test, target_column, 1, 1)

RNN Model Evaluation Results:

           Actual |             Predicted             |
+-----------------+-----------------+-----------------+
|                 |  not reversal   |    reversal     |
+-----------------+-----------------+-----------------+
| not reversal    |      2272       |       386       |
+-----------------+-----------------+-----------------+
| reversal        |      1154       |       210       |
+-----------------+-----------------+-----------------+
LSTM Model Evaluation Results:

           Actual |             Predicted             |
+-----------------+-----------------+-----------------+
|                 |  not reversal   |    reversal     |
+-----------------+-----------------+-----------------+
| not reversal    |      2652       |        6        |
+-----------------+-----------------+-----------------+
| reversal        |      1363       |        1        |
+-----------------+-----------------+-----------------+
GRU Model Evaluation Results:

          