# Input Preperation Concept Snippets
This notebook contains some concept snippets for file input preperation concepts for preparing ML files. The snippets are not complete code, but rather a concept to be used in a larger codebase.

## Sequence Grouping
The following code creates a updated dataframe with a new column 'sequence_group' that groups rows by logical date and a new column 'future_reversal' that contains the reversal value of the next record in the sequence group.  This will also filter out any sequence groups that don't have a future_reversal (indicating the next record was on a different day).

In [1]:
import pandas as pd

rows_per_group = 2

# Assume the DataFrame is read from a CSV file or similar.
# For the purpose of this example, we'll create the DataFrame manually.
data = {
    'barcount': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'date': ['8/1/2023 07:01', '8/1/2023 07:02', '8/1/2023 07:03', '8/1/2023 07:04', '8/2/2023 07:05', '8/2/2023 07:06', '8/2/2023 07:07', '8/3/2023 07:08', '8/3/2023 07:09', '8/4/2023 07:10'],
    'feature1': ['a', 'c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's'],
    'feature2': ['b', 'd', 'f', 'h', 'j','l', 'n', 'p', 'r', 't'],
    'reversal': [False, False, True, False, True, False, False, True, False, False]
}

# Create the DataFrame
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
df['logical_date'] = df['date'].dt.date

# Initialize an empty DataFrame to hold the final results
final_df = pd.DataFrame()

# Loop over the DataFrame to create rolling windows
for start in range(len(df) - rows_per_group):
    window = df.iloc[start:start + rows_per_group]
    # Check if all the dates in the window are the same
    if len(set(window['logical_date'])) == 1:
        sequence_group = start + 1
        # Check if the next record exists and is on the same logical date
        if start + rows_per_group < len(df) and window.iloc[-1]['logical_date'] == df.iloc[start + rows_per_group]['logical_date']:
            future_reversal = df.iloc[start + rows_per_group]['reversal']
        else:
            future_reversal = None  # Set to None if there's no next record or it's on a different date
        
        window_copy = window.copy()
        window_copy['sequence_group'] = sequence_group
        window_copy['future_reversal'] = future_reversal
        final_df = pd.concat([final_df, window_copy], ignore_index=True)

# Filter out any sequence groups that don't have a future_reversal (indicating the next record was on a different day)
final_df = final_df.dropna(subset=['future_reversal'])

# Show the final DataFrame
final_df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,barcount,date,feature1,feature2,reversal,logical_date,sequence_group,future_reversal
0,1,2023-08-01 07:01:00,a,b,False,2023-08-01,1,True
1,2,2023-08-01 07:02:00,c,d,False,2023-08-01,1,True
2,2,2023-08-01 07:02:00,c,d,False,2023-08-01,2,False
3,3,2023-08-01 07:03:00,e,f,True,2023-08-01,2,False
6,5,2023-08-02 07:05:00,i,j,True,2023-08-02,5,False
7,6,2023-08-02 07:06:00,k,l,False,2023-08-02,5,False


In [None]:
# This example creates non-rolling sequence_groups based on the date column. It assumes the DataFrame is already sorted by date.
# Declare the global variables
file_in = './data/sample.csv'
file_out = './data_prod/sample.csv'
rows_per_group = 5
columns_to_write = ['date', 'open', 'high', 'low', 'close', 'volume', 'reversal','sequence_group']
df = pd.read_csv(file_in)

# Convert 'date' column to datetime format for easier manipulation
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S.%f')

# Initialize the sequence_group column
df['sequence_group'] = None

# Set the number of rows per group

# Group by day and apply the logic for setting sequence_group values
group_number = 1
for _, group in df.groupby(df['date'].dt.date):
    num_full_groups = len(group) // rows_per_group
    if num_full_groups > 0:
        for i in range(num_full_groups):
            df.loc[group.index[i*rows_per_group:(i+1)*rows_per_group], 'sequence_group'] = group_number
            group_number += 1

# Remove rows where sequence_group is None
df_final = df.dropna(subset=['sequence_group'])

# Save the modified dataframe with only the specified columns to a new CSV file
df_final.to_csv(file_out, index=False, columns=columns_to_write)
print(f"Output file saved to {file_out}")