## Data Description
The dataset we are using is comprised of 2.5 years of insulin dosage, blood glucose (bg), and Estimated Variability of Glucose (EVG). This data was collected from a type 1 diabetic's insulin pump and continuous glucose monitor (CGM). Unfortunately this data isnt entirely continous over the 2.5 year span and has gaps in some instances. This data has been exported directly from 30 day span .csv files containing 3 tables one for insulin dosage (bolus), bg, and EVG data. We split these files into the 3 respective tables and saved them to new tables containing the full time-span of data, these files are titles Bolus.csv, BG.csv, and EVG.csv. Here I will detail the contents of these files.

1. Bolus
    * Features:
        * Type: Type of bolus event (Always 'Bolus').
        * BolusType: Describes the wayin which the Bolus was used (categorical).
        * BolusDeliveryMethod: Method used to deliver the bolus (Auto or Standard)(categorical).
        * BG (mg/dL): Blood glucose levels at the time of bolus administration (continuous data).
        * SerialNumber: Device identifier.
        * CompletionDateTime: Timestamp when the dosing was completed.
        * InsulinDelivered: The standard unit measure for the amount of insulin delivered (continuous data).
        * FoodDelivered: Insulin delivered for food consumption (continuous data).
        * CorrectionDelivered: Insulin delivered for BG correction (continuous data).
        * CompletionStatusDesc: Status description of the dosage (categorical).
        * BolexStartDateTime: Not Used In Export.
        * BolexCompletionDateTime: Not Used In Export.
        * BolexInsulinDelivered: Not Used In Export.
        * BolexCompletionStatusDesc: Not Used In Export.
        * StandardPercent: (Always 100).
        * Duration (mins):(Always 0).
        * CarbSize: The amount of carbohydrates consumed in grams (continuous data).
        * TargetBG (mg/dL): Target blood glucose level for the subject in milligrams per deciliter (continuous data).
        * CorrectionFactor: Insulin sensitivity factor (continuous data).
        * CarbRatio: Insulin-to-carbohydrate ratio (continuous data).
    * Notes:
        * This table holds the most detailed records for insulin dosing decisions, food intake, and blood glucose corrections.
        * This table will be very useful in modeling relationships between carbohydrate intake, insulin dosage, and BG levels.
        * Bolus (Def: A large single doseage of insuline to lower a bloodsugar rise) refers to the device/method of insuline delivery. Which is automated through a pocket size device with a refillable tank of insulin.
2. EVG Table
    * Features:
        * DeviceType: Type/Name of the device used to record the event.
        * SerialNumber: Device identifier.
        * Description: A text description of the type of data recorded (Always EVG).
        * EventDateTime: Timestamp of when the measurement was recorded.
        * Readings (mg/dL): The estimated glucose level at the recorded time, in milligrams per deciliter (mg/dL).
    * Notes: 
        * The EVG data is collected directly from a CGM.
        * EVG is designed for tracking overall trends and patterns in glucose levels rather than moment-to-moment decisions.
        * Useful for identifying time-in-range, glucose variability, and predicting hypo/hyperglycemia over time.
        
3. BG Table
    * Features:
        * DeviceType: Type/Name of device used to measure blood glucose.
        * SerialNumber: Device identifier.
        * Description: A text description of the type of data recorded (Always BG)
        * EventDateTime: Timestamp of the blood glucose measurement.
        * BG (mg/dL): Blood glucose levels in milligrams per deciliter (continuous data).
        * Note: Additional notes field (Always Blank).
    * Notes:
        * This data is manually entered and is recorded through a glucometer (Finger Prick).
        * The BG measurement reflects the actual glucose levels at the time of measurement.


### General Dataset Properties

* Total \# of Features:
    * Bolus: 18 features
    * EVG: 5 features
    * BG: 6 features
* \# of Usable/Useful Features: (!!!!! Update possibly)
    * Bolus: 12 features
    * EVG: 2 features
    * BG: 2 features
* Table Sizes:
    * Bolus: 11,348 Records
    * EVG: 191,781 Records
    * BG: 2,832 Records
* Unique Dates: 
    * Bolus: 599 Days
    * EVG: 666 Days
    * BG: 663 Days
* Intersecting Dates: 599




## Model Overview

Our goal for this porject is to create a machine learning model that is capable of predicting blood glucose levels based on past data. Given that this data is mostly continous and can be used in such manner, we have decided to employ the use of a Recurrent Neural Network (RNN). Seeing how we want to use a lot of data from a varying timespans, the drawbacks of Short-Term memory may produce less accurate predictions. To remedy this we decided to focus on the use of a RNN variant called Long Short-Term Memory (LSTM) networks.

1. Research and Model Selection
    (***ADD Writing***)

1. Problem Framing and Input Variables
(***Update After Finishing***)
The predictive task involves forecasting BG levels using historical data from the Bolus dataset. We selected the following features as inputs to the model based on their availability, continuity, and relevance to BG level prediction:
    * BG (mg/dL): Blood glucose levels at specific times.
    * InsulinDelivered: The amount of insulin delivered.
    * FoodDelivered: Insulin dose related to food intake.
    * CarbSize: Amount of carbohydrates consumed.
Reasoning:
    * These features are continuous (non-categorical) and align with domain knowledge of factors influencing BG levels.
    * The Bolus dataset provides a large number of entries, allowing for robust training of the model.
To frame the input-output relationship, we structured the input data as a time window of the past 24 hours, using these features to predict BG levels at a future timestamp. We also plan to experiment with varying the length of the input window to optimize model accuracy.


1. Data Preprocessing
Steps for preparing the data for the LSTM model:
    (***ADD Writing***)
    * Data Cleaning:
        * Handle missing values and fill gaps in time-series data, ensuring continuity.
        * Filter out irrelevant or unused features to streamline inputs.
    * Feature Scaling:
        * Normalize all input features to ensure consistent scaling, which is critical for LSTM performance.
    * Windowing:
        * Create overlapping sequences of the past 24 hours' data to use as input features, with the BG level at the next timestamp as the target variable.
    * Train-Test Split:
        * Split the data into training and testing sets while preserving time order to prevent data leakage.
        
1. Model Implementation
We used the Keras library to implement the LSTM model, referencing a multivariate time-series forecasting tutorial as a guide. The key steps involved:
    * Model Architecture:
        * A sequential LSTM-based model was constructed with:
        * Input layers to process multivariate time-series data.
        * LSTM layers for feature extraction and sequence learning.
        * Dense layers for final predictions of BG levels.
    * Hyperparameter Selection:
        * We plan to fine-tune key hyperparameters, including:
        * Number of LSTM units.
        * Batch size.
        * Learning rate.
        * Number of time steps (length of the input sequence).

1. Experimental Design
To ensure the model performs well, we will:
    * Hyperparameter Optimization:
        * Use techniques like grid search or randomized search to identify optimal model parameters.
    * Validation:
        * Implement cross-validation to evaluate model performance.
    * Error Analysis:
        * Examine predictions and residuals to understand model strengths and weaknesses.
        * Feature Experimentation:
        * Test the inclusion or exclusion of additional features (e.g., CorrectionDelivered, TargetBG) to assess their impact on performance.

1. Planned Outputs
The final outputs will include:
    * Predicted BG levels over time.
    * Performance metrics such as mean squared error (MSE) and mean absolute error (MAE).
    * Visualizations comparing predicted BG levels to actual values for test data.







### Data Notes
* Not all Days are covered, at points there are significant (5-8 week) gaps in data.
* Different Devices used to collect data may overlap times (device marked as 'UNKNOWN' in data)
* 3 different tables (I think our main focus should be on the third table that has the pump data coupled with the best of the BG tables (1 or 2))

In [2]:
import pandas as pd
import csv
import os


def load_export_csv(file_path):
    # Define the headers for each table to identify them
    table_headers = {
        'table1': ["DeviceType", "SerialNumber", "Description", "EventDateTime", "Readings (mg/dL)"],
        'table2': ["DeviceType", "SerialNumber", "Description", "EventDateTime", "BG (mg/dL)", "Note"],
        'table3': ["Type", "BolusType", "BolusDeliveryMethod", "BG (mg/dL)", "SerialNumber",
                   "CompletionDateTime", "InsulinDelivered", "FoodDelivered", "CorrectionDelivered",
                   "CompletionStatusDesc", "BolexStartDateTime", "BolexCompletionDateTime",
                   "BolexInsulinDelivered", "BolexCompletionStatusDesc", "StandardPercent",
                   "Duration (mins)", "CarbSize", "TargetBG (mg/dL)", "CorrectionFactor",
                   "CarbRatio"]
    }
    
    # Initialize data storage for each table
    data_tables = {
        'table1': [],
        'table2': [],
        'table3': []
    }
    
    current_table = None  # To keep track of which table we're currently reading
    line_number = 0  # To track line numbers for debugging
    
    with open(file_path, 'r', newline='', encoding='utf-8-sig') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            line_number += 1
            # Strip whitespace from each cell
            row = [cell.strip() for cell in row]
            
            # Debugging: Print current row and line number
            # print(f"Line {line_number}: {row}")
            
            # Check if the row matches any table header
            if row[:len(table_headers['table1'])] == table_headers['table1']:
                current_table = 'table1'
                # print(f"Detected header for table1 at line {line_number}")
                continue  # Skip the header row
            elif row[:len(table_headers['table2'])] == table_headers['table2']:
                current_table = 'table2'
                # print(f"Detected header for table2 at line {line_number}")
                continue  # Skip the header row
            elif row[:len(table_headers['table3'])] == table_headers['table3']:
                current_table = 'table3'
                # print(f"Detected header for table3 at line {line_number}")
                continue  # Skip the header row
            elif not any(cell for cell in row):
                # Empty row signifies possible separation; skip
                current_table = None
                # print(f"Detected empty row at line {line_number}; resetting current_table")
                continue
            
            # If current_table is set, append the row to the corresponding data list
            if current_table:
                expected_columns = len(table_headers[current_table])
                actual_columns = len(row)
                
                if actual_columns != expected_columns:
                    print(f"Warning: Line {line_number} has {actual_columns} columns, expected {expected_columns}. Skipping row.")
                    continue  # Skip rows that don't match the expected column count
                
                # Replace '(Data)' placeholders with actual data if necessary
                cleaned_row = [cell if cell != '(Data)' else None for cell in row]
                data_tables[current_table].append(cleaned_row)
            else:
                # Rows outside of any table headers are ignored
                print(f"Warning: Line {line_number} is outside of any table. Skipping row.")
                continue
    
    # Convert lists to DataFrames with appropriate columns
    df_tables = {}
    for table_key, data in data_tables.items():
        if data:  # Only create DataFrame if there's data
            df = pd.DataFrame(data, columns=table_headers[table_key])
            df_tables[table_key] = df
        else:
            df_tables[table_key] = pd.DataFrame(columns=table_headers[table_key])
    
    return df_tables['table1'], df_tables['table2'], df_tables['table3']



file_path = './Data/2023PumpData/CSV_redacted_90945369_02Dec2024_1920-2.csv'  
df_table1, df_table2, df_table3 = load_export_csv(file_path)

# Display the DataFrames
print("Table 1 DataFrame:")
print(df_table1.head())

print("\nTable 2 DataFrame:")
print(df_table2.head())

print("\nTable 3 DataFrame:")
print(df_table3.head())

# Optionally, save the DataFrames to separate CSV files
df_table1.to_csv(os.path.join('./DataTables', 'table_1.csv'), index=False)
df_table2.to_csv(os.path.join('./DataTables', 'table_2.csv'), index=False)
df_table3.to_csv(os.path.join('./DataTables', 'table_3.csv'), index=False)


Table 1 DataFrame:
  DeviceType SerialNumber Description        EventDateTime Readings (mg/dL)
0    Unknown       870772         EGV  2023-03-01T00:00:55              174
1    Unknown       870772         EGV  2023-03-01T00:05:55              172
2    Unknown       870772         EGV  2023-03-01T00:10:55              171
3    Unknown       870772         EGV  2023-03-01T00:15:55              168
4    Unknown       870772         EGV  2023-03-01T00:20:55              165

Table 2 DataFrame:
  DeviceType SerialNumber Description        EventDateTime BG (mg/dL) Note
0    Unknown       870772          BG  2023-03-01T16:07:59        238     
1    Unknown       870772          BG  2023-03-01T18:04:52        382     
2    Unknown       870772          BG  2023-03-01T20:09:14        156     
3    Unknown       870772          BG  2023-03-01T20:31:14        248     
4    Unknown       870772          BG  2023-03-02T12:46:15        157     

Table 3 DataFrame:
    Type BolusType BolusDeliveryMet

In [None]:
import glob

def gather_csv_files(main_directory):
    """
    Gathers all CSV files from subdirectories (2022PumpData, 2023PumpData, 2024PumpData)
    in the 'data' directory located in the main directory.

    Parameters:
        main_directory (str): The path to the main directory.

    Returns:
        list: A list of paths to all CSV files found in the specified subdirectories.
    """
    # Define the parent directory
    parent_directory = os.path.join(main_directory, 'Data')
    
    # List of subdirectories to search
    subdirectories = ['2022PumpData', '2023PumpData', '2024PumpData']
    
    # Collect CSV files from all subdirectories
    csv_files = []
    for subdir in subdirectories:
        subdir_path = os.path.join(parent_directory, subdir)
        print(subdir_path)
        csv_files.extend(glob.glob(os.path.join(subdir_path, '*.csv')))
    
    return csv_files

# Usage example
main_directory = './'  # Replace with the path to your main directory
csv_files = gather_csv_files(main_directory)

# Print the gathered CSV file paths
for file in csv_files:
    print(file)


In [None]:

def process_all_csv_files_combined(main_directory, output_directory):
    """
    Processes all CSV files in the 'Data/2022PumpData', 'Data/2023PumpData', and 'Data/2024PumpData'
    subdirectories, extracting tables and appending them into combined CSV files.

    Parameters:
        main_directory (str): The path to the main directory containing the 'Data' folder.
        output_directory (str): The path to the directory where combined tables will be saved.

    Returns:
        None
    """
    # Gather all CSV files
    csv_files = gather_csv_files(main_directory)

    # Ensure the output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Initialize empty DataFrames for combined output
    combined_table1 = pd.DataFrame()
    combined_table2 = pd.DataFrame()
    combined_table3 = pd.DataFrame()

    for file_path in csv_files:
        print(f"Processing file: {file_path}")
        
        try:
            # Load the CSV file and extract tables
            df_table1, df_table2, df_table3 = load_export_csv(file_path)
            
            # Append each table to its respective combined DataFrame
            combined_table1 = pd.concat([combined_table1, df_table1], ignore_index=True)
            combined_table2 = pd.concat([combined_table2, df_table2], ignore_index=True)
            combined_table3 = pd.concat([combined_table3, df_table3], ignore_index=True)
        
        except Exception as e:
            print(f"Error processing file {file_path}: {e}")

    # Define output file paths
    table1_file = os.path.join(output_directory, 'EVG.csv')
    table2_file = os.path.join(output_directory, 'BG.csv')
    table3_file = os.path.join(output_directory, 'Bolus.csv')

    # Save the combined tables to CSV
    combined_table1.to_csv(table1_file, index=False)
    combined_table2.to_csv(table2_file, index=False)
    combined_table3.to_csv(table3_file, index=False)

    print(f"Combined tables saved to: {output_directory}")


# Usage example
main_directory = './'  # Replace with the path to your main directory
output_directory = './DataTables'  # Replace with your desired output directory

process_all_csv_files_combined(main_directory, output_directory)


In [5]:

# Load the CSV files into pandas DataFrames
file_path1 = './DataTables/BG.csv'  # Replace with the actual path for Table 1
file_path2 = './DataTables/EVG.csv'  # Replace with the actual path for Table 2
file_path3 = './DataTables/Bolus.csv'  # Replace with the actual path for Table 3

# Load the data
df1 = pd.read_csv(file_path1)
df2 = pd.read_csv(file_path2)
df3 = pd.read_csv(file_path3)

# Convert the EventDateTime column to datetime and extract dates
df1['EventDate'] = pd.to_datetime(df1['EventDateTime']).dt.date
df2['EventDate'] = pd.to_datetime(df2['EventDateTime']).dt.date
df3['EventDate'] = pd.to_datetime(df3['CompletionDateTime']).dt.date

# Find the unique dates for each table
unique_dates1 = set(df1['EventDate'].unique())
unique_dates2 = set(df2['EventDate'].unique())
unique_dates3 = set(df3['EventDate'].unique())

print(len(unique_dates3))

# Find the intersection of unique dates across all three tables
common_dates = unique_dates1.intersection(unique_dates2).intersection(unique_dates3)

# Print the number of unique intersecting dates
print(f"Number of unique dates that intersect across all three tables: {len(common_dates)}")


663
Number of unique dates that intersect across all three tables: 599


In [None]:
"""
Time Based Features
Modify usage as needed.
"""

def date_handling(df, dfName):
    field = ''
    if dfName == 'Bolus':
        field = 'CompletionDateTime'
    if (dfName == 'BG') or (dfName == 'EVG'):
        field = 'EventDateTime'    
    df = seconds_since_epoch(df, field)
    df = time_of_day(df, field)
    df = day_of_week(df, field)
    df = week(df, field)
    df = month(df, field)
    return df

def seconds_since_epoch(df, field):
    df[field] = pd.to_datetime(df[field])
    df['CompletionTimeSec'] = df[field].astype('int64') // 10**9
    return df

""" time_of_day():
    We may only want to show morning, afternoon, night?
"""
def time_of_day(df, field):
    df[field] = pd.to_datetime(df[field])

    df['TimeOfDaySec'] = (
        df[field].dt.hour * 3600 + 
        df[field].dt.minute * 60 + 
        df[field].dt.second
    )
    return df

def day_of_week(df, field):
    df[field] = pd.to_datetime(df[field])
    df['DayOfWeek'] = df[field].dt.weekday
    return df

def week(df, field):
    df[field] = pd.to_datetime(df[field])
    df['Week'] = df[field].dt.isocalendar().week
    return df

def month(df, field):
    df[field] = pd.to_datetime(df[field])
    df['Month'] = df[field].dt.month
    return df

def day_of_year(df, field):
    df[field] = pd.to_datetime(df[field])
    df['DayOfYear'] = df[field].dt.dayofyear
    return df



SyntaxError: expected ':' (1583064809.py, line 1)

In [None]:
from sklearn.preprocessing import LabelEncoder
"""
def onehot_encoding(df):
    df = pd.get_dummies(df, columns=['BolusType', 'BolusDeliveryMethod', 'CompletionStatusDesc'])
    return df

def label_encoding(df):
    bolus_type_encoder = LabelEncoder()
    delivery_method_encoder = LabelEncoder()
    status_desc_encoder = LabelEncoder()

    # Apply label encoding
    df['BolusType_Encoded'] = bolus_type_encoder.fit_transform(df['BolusType'])
    df['BolusDeliveryMethod_Encoded'] = delivery_method_encoder.fit_transform(df['BolusDeliveryMethod'])
    df['CompletionStatusDesc_Encoded'] = status_desc_encoder.fit_transform(df['CompletionStatusDesc'])
    return df
"""
"""
    ETHAN: I am seeing some things on embedding encodings in an embedding layer for the LTSM model, I believe it will do it automatically if you are using Embeddings() after label encoding the data
    
    from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
"""

''

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

def min_max_scaling(df, field):
    scaler = MinMaxScaler(feature_range=(0, 1))
    fieldNew = "Scaled" + field
    df[fieldNew] = scaler.fit_transform(df[[field]])
    return df

def standard_scaling(df, field):
    scaler = StandardScaler()
    fieldNew = "Scaled" + field
    df[fieldNew] = scaler.fit_transform(df[[field]])
    return df

def robust_scaling(df, field):
    scaler = RobustScaler()
    fieldNew = "Scaled" + field
    df[fieldNew] = scaler.fit_transform(df[[field]])
    return df

In [47]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler



def round_to_nearest_5_minutes(df):
    # Ensure CompletionDateTime is in datetime format
    df['CompletionDateTime'] = pd.to_datetime(df['CompletionDateTime'], errors='coerce')
    # Drop rows with invalid datetime entries
    df = df.dropna(subset=['CompletionDateTime'])
    # Round CompletionDateTime to the nearest 5 minutes
    df['CompletionDateTime'] = df['CompletionDateTime'].dt.round('5min')
    return df

def remove_cgm_on_time_overlap(df):
    # Identify duplicates based on CompletionDateTime
    duplicate_times = df[df.duplicated('CompletionDateTime', keep=False)]
    # Filter out rows where Type is EVG and CompletionDateTime is duplicated
    df = df[~((df['CompletionDateTime'].isin(duplicate_times['CompletionDateTime'])) & (df['Type'] == 'CGM'))]
    
    return df

def convert_data(df):
    # Example Bolus table columns
    bolus_columns = [
        'Type', 'BolusType', 'BolusDeliveryMethod', 'BG (mg/dL)', 'SerialNumber',
        'CompletionDateTime', 'InsulinDelivered', 'FoodDelivered', 'CorrectionDelivered',
        'CompletionStatusDesc', 'BolexStartDateTime', 'BolexCompletionDateTime', 
        'BolexInsulinDelivered', 'BolexCompletionStatusDesc', 'StandardPercent', 
        'Duration (mins)', 'CarbSize', 'TargetBG (mg/dL)', 'CorrectionFactor', 'CarbRatio'
    ]

    # 1. Create a template DataFrame for Bolus table with default values
    template_bolus = pd.DataFrame(columns=bolus_columns)

    # Default values for missing fields
    default_numeric = 0
    default_categorical = ''

    # 2. Map EVG fields to Bolus fields
    mapped_bolus = evg_df.rename(columns={
        'Readings (mg/dL)': 'BG (mg/dL)',
        'SerialNumber': 'SerialNumber',
        'EventDateTime': 'CompletionDateTime'
    })

    # 3. Add missing columns with default values
    for col in bolus_columns:
        if col not in mapped_bolus.columns:
            if col in ['BG (mg/dL)', 'InsulinDelivered', 'FoodDelivered', 'CorrectionDelivered', 
                    'StandardPercent', 'Duration (mins)', 'CarbSize', 'TargetBG (mg/dL)', 
                    'CorrectionFactor', 'CarbRatio']:  # Numeric fields
                mapped_bolus[col] = default_numeric
            elif col == 'Type':
                mapped_bolus[col] = 'CGM'
            else:  # Categorical fields
                mapped_bolus[col] = default_categorical

    # 4. Reorder columns to match the Bolus table structure
    mapped_bolus = mapped_bolus[bolus_columns]

    # Resulting Bolus DataFrame
    #print(mapped_bolus)
    return mapped_bolus


def min_max_scaling(df, field):
    scaler = MinMaxScaler(feature_range=(0, 1))
    fieldNew = "Scaled" + field
    df[fieldNew] = scaler.fit_transform(df[[field]])
    return df

def map_categories_to_indices(df, columns):
    """
    Map categorical values to numeric indices.
    """
    mappings = {}
    for col in columns:
        df[col] = df[col].astype('category')
        df[f"{col}_idx"] = df[col].cat.codes  # Create numeric indices
        mappings[col] = dict(enumerate(df[col].cat.categories))  # Save mappings
    return df, mappings

def onehot_encoding(df, columns):
    df = pd.get_dummies(df, columns=columns)
    return df

def onehot_encoding_with_all_categories(df, categorical_columns, all_possible_categories):
    """
    One-hot encode specified categorical columns, ensuring all possible categories are included.
    """
    for col in categorical_columns:
        df[col] = pd.Categorical(df[col], categories=all_possible_categories[col])
    df = pd.get_dummies(df, columns=categorical_columns, prefix=categorical_columns, dummy_na=False)
    return df


# Load the CSV files into pandas DataFrames
#file_path1 = './DataTables/BG.csv'  # Replace with the actual path for Table 1
file_path2 = './DataTables/EVG.csv'  # Replace with the actual path for Table 2
file_path3 = './DataTables/Bolus.csv'  # Replace with the actual path for Table 3

# Load the data
#df1 = pd.read_csv(file_path1)
evg_df = pd.read_csv(file_path2)
bolus_df = pd.read_csv(file_path3)

evg_mapped = convert_data(evg_df)

df = pd.concat([bolus_df, evg_mapped], ignore_index=True)
#print(evg_mapped)

# Round CompletionDateTime to the nearest 5 minutes
df = round_to_nearest_5_minutes(df)

# Sort the DataFrame by CompletionDateTime
df = df.sort_values(by='CompletionDateTime').reset_index(drop=True)

df = remove_cgm_on_time_overlap(df)
columns=['Type', 'BolusType', 'BolusDeliveryMethod', 'CompletionStatusDesc']
# Apply the function
df = onehot_encoding(df, columns)
columns=['SerialNumber', 'BolexStartDateTime', 'BolexCompletionDateTime', 'BolexInsulinDelivered', 'BolexCompletionStatusDesc']
df = df.drop(columns=columns)
bool_columns = df.select_dtypes(include='bool').columns
df[bool_columns] = df[bool_columns].astype(int)

df.rename(columns={'CompletionDateTime': 'Date'}, inplace=True)

# Move the 'Date' column to the first position
first_col = df.pop('Date')  # Remove 'Date' column
df.insert(0, 'Date', first_col)  # Reinsert it at position 0

# Display the DataFrame

# Check the resulting DataFrame
df.to_csv('test.csv', index=False)





# Check for NaN values; there are none
print("NaN values:", df.isnull().values.any())
# Organize data, make date into an index
df.set_index("Date", inplace=True)
# Note: we are not using any categorical data
values = df.values
# specify columns to plot
#groups = [0, 1, 2, 3]
#i = 1
# plot each column
#pyplot.figure()
#for group in groups:
#	pyplot.subplot(len(groups), 1, i)
#	pyplot.plot(values[:, group])
#	pyplot.title(df.columns[group], y=0.5, loc='right')
#	i += 1
#pyplot.show()
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
print(scaled)

NaN values: False
[[0.         0.26737968 0.26737968 ... 1.         0.         0.        ]
 [0.22       0.10481283 0.08930481 ... 1.         0.         0.        ]
 [0.         0.05721925 0.05721925 ... 1.         0.         0.        ]
 ...
 [0.475      0.         0.         ... 0.         0.         0.        ]
 [0.49666667 0.0197861  0.         ... 1.         0.         0.        ]
 [0.43166667 0.         0.         ... 0.         0.         0.        ]]


In [None]:

# Sample test data
test_data = {
    'Type': ['EVG', 'Bolus', 'EVG', 'Bolus', 'EVG'],
    'CompletionDateTime': [
        '2022-05-26 06:23:59',
        '2022-05-26 06:25:00',
        '2022-05-26 10:08:17',
        '2022-05-26 10:10:00',
        '2022-05-26 11:16:39'
    ],
    'BG (mg/dL)': [137, 140, 99, 100, 138],
    'InsulinDelivered': [0, 4.5, 0, 3.8, 0]
}

# Convert to DataFrame
test_df = pd.DataFrame(test_data)

# Ensure CompletionDateTime is datetime
test_df['CompletionDateTime'] = pd.to_datetime(test_df['CompletionDateTime'])

# Apply rounding to the nearest 5 minutes
test_df['CompletionDateTime'] = test_df['CompletionDateTime'].dt.round('5min')

# Sort by CompletionDateTime
test_df = test_df.sort_values(by='CompletionDateTime').reset_index(drop=True)
print(test_df)
# Remove EVG records with overlapping times
test_df = remove_cgm_on_time_overlap(test_df)

# Print the result
print(test_df)


    Type  CompletionDateTime  BG (mg/dL)  InsulinDelivered
0    EVG 2022-05-26 06:25:00         137               0.0
1  Bolus 2022-05-26 06:25:00         140               4.5
2    EVG 2022-05-26 10:10:00          99               0.0
3  Bolus 2022-05-26 10:10:00         100               3.8
4    EVG 2022-05-26 11:15:00         138               0.0
    Type  CompletionDateTime  BG (mg/dL)  InsulinDelivered
1  Bolus 2022-05-26 06:25:00         140               4.5
3  Bolus 2022-05-26 10:10:00         100               3.8
4    EVG 2022-05-26 11:15:00         138               0.0
