# License Notice

Copyright (c) 2024 Warren Bebbington

This notebook is part of the simple-glucose-analysis project and is licensed under the MIT License. For the full license text, please see the LICENSE file in the project's root directory.

# How to Backup SQLite Database from XDrip+ Android App

To manually back up the SQLite database in the XDrip+ app and save it for use in your `simple_glucose_analysis` project, follow these steps:

## Steps to Backup the Database
```
1. **Open XDrip+ App**:
   - Launch the XDrip+ app on your Android device.

2. **Access the Menu**:
   - Tap the **hamburger menu** (three horizontal lines) located at the top right of the screen.

3. **Select Import/Export**:
   - From the dropdown menu, select **Import/Export**.

4. **Export Database**:
   - Choose the **Export Database** option.
   - Follow any prompts to confirm the backup location if necessary.

5. **Save the Database File**:
   - When prompted to select a save location, choose a folder that is easily accessible.
   - **Important**: Save the database file (typically named `export.sqlite`) in the main directory of your `simple_glucose_analysis` project.

6. **Verify Backup**:
   - Ensure the database file is saved correctly in your project directory. You can check this using a file explorer on your device or your computer.
```
## Using the Database in Your Project

Once the database file is saved in the `simple_glucose_analysis` project directory, you can load it into the preprocessing notebook.

**Note**: It's good practice to back up your database regularly to prevent data loss!


In [None]:
from sqlalchemy import create_engine, inspect
import pandas as pd

### Load your Xdrip+ Sqlite backup

In [None]:
# Path to your SQLite file
db_path = 'path-to-your-file.sqlite'

# Create an SQLAlchemy engine
engine = create_engine(f'sqlite:///{db_path}')

# Use SQLAlchemy's inspector to list all tables
inspector = inspect(engine)
tables = inspector.get_table_names()
print(tables)

In [None]:
# Load BgReadings table into a pandas DataFrame
glucose_data = 'BgReadings'  # Table containing all BG Readings from XDrip+
bg_df = pd.read_sql_table(glucose_data, con=engine)
bg_df['timestamp'] = pd.to_datetime(bg_df['timestamp'], unit='ms')

# Load Treatments table into a pandas DataFrame
treatments_data = 'Treatments'  # Table containing all Treatments from XDrip+
treatments_df = pd.read_sql_table(treatments_data, con=engine)
treatments_df['timestamp'] = pd.to_datetime(treatments_df['timestamp'], unit='ms')

# Explore the first few rows of the blood glucose table
bg_df.head()

In [None]:
treatments_df.head()

We can see that the insulin column in XDrip+ is used for storing both basal and bolus insulin doses and these can be differentiated by the insulinJSON column which will show the type of insulin you set in XDrip+. In this case Novorapid(bolus) and Levemir(basal). We will create a function that loops the database and for each row in `insulin` that has any value above 0.0, we will check the insulinJSON for the word 'Novorapid' if this word is present we will move the vale to a column named `bolus` and if not we will set the value in a column named `basal`. We will then drop the rest of the rows in the treatments table.

**UPDATE** - It seems the word Novorapid is not always present in the insulinJSON column and for this reason we will use the word 'Levemir' instead to try and isolate basal doses, this may be different depending on how you setup XDrip+.

**UPDATE** - Neither value is consistent enough to distinguish the insulin type, for this reason i will use a cut off value of 10 units to decide if the insulin is basal or bolus. I have chosen 10 because my basal dose has always been above this and my maximum bolus dose is 6 units. This should adequatley determine which is which for my data. You may need to adjust these values. 

### Save Raw Data

We will save the data in csv files for your own use. The BgReadings tables contains more data to be looked into, and there seem to be other useful tables including HeartRate(recorded by XDrip+ if health data is available on android device, eg. SmartWatch), Calibrations(calibration data), BloodReadings(Finger Prick results) and more...

In [None]:
bg_df.to_csv('data/raw_bg.csv')
treatments_df.to_csv('data/raw_treaments.csv')

# Clean data

In [None]:
bg_df.info()

### Split Basal and Bolus insulin

In [None]:
# Create two new columns 'bolus' and 'basal', initializing with NaN values
treatments_df['bolus'] = float('nan')
treatments_df['basal'] = float('nan')

# Filter rows where insulin > 0
insulin_positive = treatments_df['insulin'] > 0

# Filter rows where insulin >= 10
above_10 = treatments_df['insulin'] >= 10

# For rows where 'insulin' is >= 10, assign to 'basal'
treatments_df.loc[above_10, 'basal'] = treatments_df['insulin']

# For rows where 'insulin' > 0 and 'insulin' is < 10, assign to 'bolus'
treatments_df.loc[insulin_positive & ~above_10, 'bolus'] = treatments_df['insulin']

# Display the updated DataFrame to check the result
print(treatments_df[['insulin', 'bolus', 'basal']])

In [None]:
treatments_df.info()

## Unrequired data

We will now drop all unrequired columns and adjust timestamps in both table to 5 minute intervals, improving alignment of the two data sources.

In [None]:
# Create dataframes with only our required columns and rename calculated_value to glucose
bg_df = bg_df[['calculated_value', 'timestamp']].copy()
bg_df['timestamp'] = pd.to_datetime(bg_df['timestamp']).dt.round('5min')
bg_df = bg_df.groupby('timestamp').agg({'calculated_value': 'mean'})
bg_df.rename(columns={'calculated_value': 'glucose'}, inplace=True)

# For treatments_df
treatments_df = treatments_df[['carbs', 'basal', 'bolus', 'timestamp']].copy()
treatments_df['timestamp'] = pd.to_datetime(treatments_df['timestamp']).dt.round('5min')
treatments_df = treatments_df.groupby('timestamp').agg({
    'carbs': 'sum',
    'basal': 'sum',
    'bolus': 'sum'
})

In [None]:
treatments_df

In [None]:
bg_df

## Unrealistic data

We will now limit glucose values to their physiological limits to help negate sensor errors, we will limit glucose levels on the upper range to no more than 20.0 mmol/l and on the lower side we will limit all glucose levels to the Libre 2 cut off limit of 2.2 mmol/l. we will also change any basal or bolus doses over 15u to 0, as these must be Xdrip+ issues, as I have never taken such large doses in my 18 years as a Type 1 Diabetic.

In [None]:
# For treatments_df: Set any value above 15 in 'bolus' and 'basal' columns to 0
treatments_df['bolus'] = treatments_df['bolus'].apply(lambda x: 0 if x > 15 else x)
treatments_df['basal'] = treatments_df['basal'].apply(lambda x: 0 if x > 15 else x)

# For bg_df: Limit glucose values to the range [2.2, 20.0] - We first need to convert mg/dl to mmol/l
# Uncomment the below lines to adjust your data

# Convert glucose from mg/dL to mmol/L using standard /18
bg_df['glucose'] = bg_df['glucose'] / 18.0
bg_df['glucose'] = bg_df['glucose'].clip(lower=2.2, upper=20.0)

In [None]:
treatments_df.describe()

In [None]:
# Replace allbasal and bolus NaN values with 0 in the entire DataFrame
treatments_df.fillna(0, inplace=True)

treatments_df

In [None]:
bg_df.describe()

## Combine

We will now ensure the combined_df contains the complete date range and combine the two dataframes. We can then turn our focus to ensuring the glucose data is consistent and interpolate where suitable and drop rows with large gaps in the blood glucose readings data.

In [None]:
# Ensure both dataframes have a complete range of 5-minute intervals
start_time = min(bg_df.index.min(), treatments_df.index.min())
end_time = max(bg_df.index.max(), treatments_df.index.max())
full_range = pd.date_range(start=start_time, end=end_time, freq='5min')

bg_df = bg_df.reindex(full_range)
treatments_df = treatments_df.reindex(full_range)

# Merge the dataframes
combined_df = pd.merge(bg_df, treatments_df, left_index=True, right_index=True, how='outer')

# Explicitly name the index
combined_df.index.name = 'timestamp'

# Handle missing values
combined_df[['carbs', 'basal', 'bolus']] = combined_df[['carbs', 'basal', 'bolus']].fillna(0) # Fill all NaN values with 0

## Check glucose data consistency

We will now investigate any gaps in the `glucose` column and drop any rows with gaps over 20 minutes and drop all others.

In [None]:
combined_df.isna().sum()

In [None]:
# Identify gaps in glucose readings
combined_df['is_gap'] = combined_df['glucose'].isna()
combined_df['gap_group'] = (combined_df['is_gap'] != combined_df['is_gap'].shift()).cumsum()
gaps = combined_df[combined_df['is_gap']].groupby('gap_group')
gaps_greater_than_20min = gaps.filter(lambda x: len(x) > 4)

number_of_gaps = len(gaps_greater_than_20min['gap_group'].unique())
print(f"Number of gaps greater than 20 minutes: {number_of_gaps}")

In [None]:
# Save to csv if you wish to inspect for further insight into missing glucose readings in your data
gaps_greater_than_20min.to_csv('data/gaps_over_20min.csv')

In [None]:
if number_of_gaps > 0:
    print("Gaps greater than 20 minutes:")
    print(gaps_greater_than_20min.groupby('gap_group').sum()) # Show each group and size of group

## Drop rows with large gaps

We will now drop all of these groups ensuring no gaps over 20 minutes will be interpolated and helping maintain the integrity of our data. We will then interpolate the remaining gaps.

In [None]:
# Drop rows in gaps_greater_than_20min from bg_df and interpolate the rest
combined_df_cleaned = combined_df.drop(gaps_greater_than_20min.index)
combined_df_cleaned['glucose'] = combined_df_cleaned['glucose'].interpolate(method='time', limit=4) # Interpolate gaps upto 20 mins

In [None]:
combined_df_cleaned.info()

In [None]:
combined_df_cleaned.describe()

# Annonymise my data

I will now create a `day_of_week` column and a `time` column, before dropping the timestamp column to ensure the sample data provided with this project remains somewhat annonymis. You can skip this section if your using this tool with your own data, some of the analysis functions will need altering in order to work with the actual timestamps. I will try and incorporate a settings variable in the analysis note book in order to define wether your running it with sample data, or your own data containing timestamps.

In [None]:
# Create a 'day' column by extracting the date part of the DatetimeIndex
combined_df_cleaned['day_of_week'] = combined_df_cleaned.index.day_name()

# Create a 'time' column by extracting the time part of the DatetimeIndex
combined_df_cleaned['time'] = combined_df_cleaned.index.time

# Comment this line out to maintain the Timestamp in your data
combined_df_cleaned = combined_df_cleaned.reset_index()
combined_df_cleaned = combined_df_cleaned.drop('timestamp', axis=1)

In [None]:
combined_df_cleaned

In [None]:
combined_df_cleaned.describe()

# Conclusion
With this notebook we have succesfully transforme the Xdrip+ data back-up from raw unaligned data into a combined dataset ready for analysis. We have done the following:
- Loaded in the BgReading and Treatments tables from Sqlite using Pandas
- Removed all unnecessary columns from the BgReadings and Treatments tables
- Split the insulin column into Basal and Bolus dose columns using personal experience(adapt this section if needed)
- Cleaned the data, including: 
    1. Renaming columns to more intuitive names
    2. Rounding irregular time intervals into consistent 5-minute intervals and aggregating data appropriatley
    3. Clipping and modifying unrealistic data points using common sense and personal experience(adapt this section if needed)
    4. Addressing NaN values using a strict standard of not interpolating and gaps over 20 minutes, ensuring data validity
    5. Complex merging of two dataframes and proper handling of missing values
    6. Thourough consistency check in glucose data
    7. Adaption of data to maintain temporal relationships of data whilst removing sensitive data
    
## Export your data

If you are running the analysis on your own data you can export to a csv file now and begin the analysis. Be aware this data will span however long your backup from XDrip+ covers not just 90 days like the sample data.


In [None]:
filtered_df.to_csv('data/sample_analysis_data.csv')

### End of Notebook
Copyright (c) 2024 Warren Bebbington 