# Sequence Preparation
This notebook is meant to prepare a data file for use in ML experiments.  The data file will be a CSV that is updated with a sequence group column and a label (target) column for a classification problem using machine learning models to predict the target.  The sequence group column will be used to group sequences that are related to each other.  The label column will be used to indicate the target value for each sequence.  The data file will be used to train and test machine learning models to predict the target value for new sequences.

The steps outlined here will build upon one another so that the final data file will be ready for use in machine learning experiments.  The code should be run sequentially to prepare the data file.

## File Preperation
The basic steps for getting a file ready for the ML expermient are:
1. Load the data file
1. Filter the columns of interest which would include the features and the target (also known as the label)
1. Normalize the data
1. Create a sequence group column using a rolling window
1. Split the data into a training set and a test set

### Data File Assumptions
- The data file will be a CSV
- The data file will have a header row
- The data file will have a column that contains the target value
- The data file will be in sequential chronological order
- The data file will have a column that contains a date and time value to aide sequence grouping
- The data file is already cleansed with regard to missing values and outliers


**Steps 1 & 2: Load the Data File and Filter the Columns of Interest**
- Update the file paths
- Update the column names for the features and the target

In [31]:
# STEP 1 & 2
# Declare the input & output file paths, the columns to write and the target column
# perform the imports
import pandas as pd

file_in = './data/sample.csv'
file_tmp = './tmp/sample.csv'
file_out = './data_prod/sample.csv'
file_training = './data_prod/sample_training.csv'
file_testing = './data_prod/sample_testing.csv'

columns_to_write = [
    'date',
    'reversal',
    'adx',
    'dmi',
    'fisherstransform',
    'fosc',
    'linreg',
    'linregintercept',
    'linregslope',
    'macd',
    'macd_avg',
    'macd_diff',
    'parabolic_sar',
    'rsi',
    'rsi_avg',
    'trendsequence']
target_column = 'reversal'

# Load the data from the input CSV file into a pandas dataframe
df = pd.read_csv(file_in)

# Convert 'date' column to datetime format for easier manipulation
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S.%f')

# Save the modified dataframe with only the specified columns to a new CSV file
#df.to_csv(file_tmp, index=False, columns=columns_to_write)
df['reversal'] = df['reversal'].astype(int)
df_filtered = df[columns_to_write]
#df_filtered.to_csv(file_tmp, index=False)

**Step 3: Normalize the Data**

This step will use the MinMaxScaler to normalize the data.  The MinMaxScaler will scale the data to a range of 0 to 1.

In [34]:
from sklearn.preprocessing import MinMaxScaler

# List of features to exclude from normalization
features_to_exclude = ['date', 'reversal']

# Dynamically select features to normalize (all features except the ones to exclude)
features_to_normalize = [col for col in df_filtered.columns if col not in features_to_exclude]

# Initialize the Min-Max Scaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit the scaler to your data (for the features to be normalized)
scaler.fit(df_filtered[features_to_normalize])

# Transform the data using the fitted scaler
df_normalized = df_filtered.copy()  # Create a copy of the DataFrame to keep the original data intact
df_normalized[features_to_normalize] = scaler.transform(df_filtered[features_to_normalize])

# df_normalized now contains the normalized data, excluding the specified features
#df_normalized.to_csv(file_out, index=False)

**Step 4: Create a Sequence Group Column Using a Rolling Window**

This step will use a rolling window to create a sequence group column.  The sequence group column will be used to group sequences that are related to each other.  The rolling window will be based on a number of records.

In [35]:
# STEP 4
rows_per_group = 5

# Create the DataFrame
df = pd.read_csv(file_out)

# Convert 'date' column to datetime format for easier manipulation
df['date'] = pd.to_datetime(df['date'])
df['logical_date'] = df['date'].dt.date

# Initialize an empty DataFrame to hold the final results
final_df = pd.DataFrame()

# Loop over the DataFrame to create rolling windows
for start in range(len(df) - rows_per_group):
    window = df.iloc[start:start + rows_per_group]
    # Check if all the dates in the window are the same
    if len(set(window['logical_date'])) == 1:
        sequence_group = start + 1
        # Check if the next record exists and is on the same logical date
        if start + rows_per_group < len(df) and window.iloc[-1]['logical_date'] == df.iloc[start + rows_per_group]['logical_date']:
            future_reversal = df.iloc[start + rows_per_group]['reversal']
        else:
            future_reversal = None  # Set to None if there's no next record or it's on a different date
        
        window_copy = window.copy()
        window_copy['sequence_group'] = sequence_group
        window_copy['future_reversal'] = future_reversal
        final_df = pd.concat([final_df, window_copy], ignore_index=True)

# Filter out any sequence groups that don't have a future_reversal (indicating the next record was on a different day)
final_df = final_df.dropna(subset=['future_reversal'])

#final_df.to_csv(file_tmp, index=False)

**Step 5: Split the Data into a Training Set and a Test Set**
This step will split the data into a training set and a test set.  The training set will be used to train the machine learning models.  The test set will be used to evaluate the performance of the machine learning models.

In [17]:
from sklearn.model_selection import GroupShuffleSplit

# Define the GroupShuffleSplit splitter
# test_size can be a float representing the proportion of the dataset to include in the test split
gss = GroupShuffleSplit(test_size=0.3, n_splits=1, random_state=42)

# The split method requires the features, the target, and the groups.
# The target will be the 'future_reversal' column from your DataFrame.
# Groups will be the 'sequence_group' column from your DataFrame.
X = final_df.drop(columns=['future_reversal'])  # Features excluding the target
y = final_df['future_reversal']  # Target
groups = final_df['sequence_group']  # Groups

# Perform the split
for train_idx, test_idx in gss.split(X, y, groups=groups):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

# X_train and y_train now contain the training feature set and target set, respectively
# X_test and y_test contain the testing feature set and target set, respectively

# If you need the whole rows in the train/test sets you can do:
train_df = final_df.iloc[train_idx]
test_df = final_df.iloc[test_idx]

# Output the first few rows of the train and test sets to verify
train_df.to_csv(file_training, index=False)
test_df.to_csv(file_testing, index=False)

In [21]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define the RNN model
model = Sequential([
    # Embedding layer if your input data is not already vectorized. Adjust input_dim and output_dim as needed.
    # Embedding(input_dim=num_features, output_dim=128, input_length=max_sequence_length),
    LSTM(64, return_sequences=False),  # Adjust the number of units based on your dataset complexity.
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])  # Use 'categorical_crossentropy' for multi-class.

history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type bool).