# License Notice

Copyright (c) 2024 Warren Bebbington

This notebook is part of the simple-glucose-analysis project and is licensed under the MIT License. For the full license text, please see the LICENSE file in the project's root directory.

# How to Backup SQLite Database from XDrip+ Android App

To manually back up the SQLite database in the XDrip+ app and save it for use in your `simple_glucose_analysis` project, follow these steps:

## Steps to Backup the Database
```
1. **Open XDrip+ App**:
   - Launch the XDrip+ app on your Android device.

2. **Access the Menu**:
   - Tap the **hamburger menu** (three horizontal lines) located at the top right of the screen.

3. **Select Import/Export**:
   - From the dropdown menu, select **Import/Export**.

4. **Export Database**:
   - Choose the **Export Database** option.
   - Follow any prompts to confirm the backup location if necessary.

5. **Save the Database File**:
   - When prompted to select a save location, choose a folder that is easily accessible.
   - **Important**: Save the database file (typically named `export.sqlite`) in the main directory of your `simple_glucose_analysis` project.

6. **Verify Backup**:
   - Ensure the database file is saved correctly in your project directory. You can check this using a file explorer on your device or your computer.
```
## Using the Database in Your Project

Once the database file is saved in the `simple_glucose_analysis` project directory, you can load it into the preprocessing notebook.

**Note**: It's good practice to back up your database regularly to prevent data loss!


In [209]:
from sqlalchemy import create_engine, inspect
import pandas as pd

### Load your Xdrip+ Sqlite backup

In [None]:
# Path to your SQLite file
db_path = 'path-to-your-file.sqlite'

# Create an SQLAlchemy engine
engine = create_engine(f'sqlite:///{db_path}')

# Use SQLAlchemy's inspector to list all tables
inspector = inspect(engine)
tables = inspector.get_table_names()
print(tables)

In [211]:
# Load BgReadings table into a pandas DataFrame
glucose_data = 'BgReadings'  # Table containing all BG Readings from XDrip+
bg_df = pd.read_sql_table(glucose_data, con=engine)
bg_df['timestamp'] = pd.to_datetime(bg_df['timestamp'], unit='ms')

# Load Treatments table into a pandas DataFrame
treatments_data = 'Treatments'  # Table containing all Treatments from XDrip+
treatments_df = pd.read_sql_table(treatments_data, con=engine)
treatments_df['timestamp'] = pd.to_datetime(treatments_df['timestamp'], unit='ms')

# Explore the first few rows of the blood glucose table
bg_df.head()

Unnamed: 0,_id,a,age_adjusted_raw_value,b,c,calculated_value,calculated_value_slope,calibration,calibration_flag,calibration_uuid,...,raw_calculated,raw_data,rb,rc,sensor,sensor_uuid,source_info,time_since_sensor_started,timestamp,uuid
0,72464,1.245412e-13,190.23528,-0.419946,354008500000.0,133.923468,-3.5e-05,559.0,0,cba43ee7-5a7c-47e7-beb5-a5d7c211e64b,...,0.0,190.23528,-0.546322,460541600000.0,42,73b87f32-2b6b-47c9-b2bc-9283632d169a,,1211123000.0,2023-06-03 22:31:05.757,a63f2267-5f99-4663-8714-7146835e118c
1,72465,7.766019e-12,178.352928,-26.184433,22071300000000.0,124.789758,-3e-05,559.0,0,cba43ee7-5a7c-47e7-beb5-a5d7c211e64b,...,0.0,178.352928,-34.064212,28713300000000.0,42,73b87f32-2b6b-47c9-b2bc-9283632d169a,,1211424000.0,2023-06-03 22:36:06.812,d1386140-bdfe-4bd3-802e-9b406cf320c2
2,72466,2.847681e-11,173.176458,-96.014259,80931960000000.0,120.810715,-1.3e-05,559.0,0,cba43ee7-5a7c-47e7-beb5-a5d7c211e64b,...,0.0,173.176458,-124.908189,105287100000000.0,42,73b87f32-2b6b-47c9-b2bc-9283632d169a,,1211727000.0,2023-06-03 22:41:09.457,1f24e9fc-b7f5-4ae9-a5ca-2ba50b8d3e09
3,72467,-1.041753e-11,157.5294,35.124388,-29606900000000.0,108.783156,-2.2e-05,559.0,0,cba43ee7-5a7c-47e7-beb5-a5d7c211e64b,...,0.0,157.5294,45.694502,-38516610000000.0,42,73b87f32-2b6b-47c9-b2bc-9283632d169a,,1212273000.0,2023-06-03 22:50:16.249,b6395bd2-e2fb-4471-845a-ad2f2efb5e66
4,72468,-3.626429e-12,147.764695,12.227082,-10306390000000.0,101.277236,-2.5e-05,559.0,0,cba43ee7-5a7c-47e7-beb5-a5d7c211e64b,...,0.0,147.764695,15.906624,-13407930000000.0,42,73b87f32-2b6b-47c9-b2bc-9283632d169a,,1212573000.0,2023-06-03 22:55:15.702,3cdb52ca-e4e3-43d0-8274-f16c70c24513


In [212]:
treatments_df.head()

Unnamed: 0,_id,carbs,created_at,enteredBy,eventType,insulin,insulinJSON,notes,timestamp,uuid
0,4286,0.0,2023-06-03T23:35:28Z,xdrip,<none>,0.0,,Warning: Sensor will expire in 22 hours,2023-06-03 23:35:28.993,a25c9a90-63c5-478f-bb67-4c4abb5c9d38
1,4287,0.0,2023-06-03T23:58:08Z,xdrip,<none>,4.0,[],,2023-06-03 23:58:08.909,8dcd2210-1564-4ad1-95d3-4614edcb807e
2,4288,0.0,2023-06-04T03:16:45Z,xdrip,<none>,2.0,[],,2023-06-04 03:16:45.168,1e639240-3981-4676-b62c-ed9235070a09
3,4289,0.0,2023-06-04T04:29:36Z,xdrip,<none>,2.0,"[{""insulin"":""Novorapid"",""units"":2.0}]",,2023-06-04 04:29:36.016,046c164e-c3c9-4b6d-bf94-85c58786b1eb
4,4290,0.0,2023-06-04T08:00:49Z,xdrip,<none>,2.0,[],,2023-06-04 08:00:49.489,0873a329-e0aa-4942-9775-5b0a679863c1


We can see that the insulin column in XDrip+ is used for storing both basal and bolus insulin doses and these can be differentiated by the insulinJSON column which will show the type of insulin you set in XDrip+. In this case Novorapid(bolus) and Levemir(basal). We will create a function that loops the database and for each row in `insulin` that has any value above 0.0, we will check the insulinJSON for the word 'Novorapid' if this word is present we will move the vale to a column named `bolus` and if not we will set the value in a column named `basal`. We will then drop the rest of the rows in the treatments table.

**UPDATE** - It seems the word Novorapid is not always present in the insulinJSON column and for this reason we will use the word 'Levemir' instead to try and isolate basal doses, this may be different depending on how you setup XDrip+.

**UPDATE** - Neither value is consistent enough to distinguish the insulin type, for this reason i will use a cut off value of 10 units to decide if the insulin is basal or bolus. I have chosen 10 because my basal dose has always been above this and my maximum bolus dose is 6 units. This should adequatley determine which is which for my data. You may need to adjust these values. 

### Save Raw Data

We will save the data in csv files for your own use. The BgReadings tables contains more data to be looked into, and there seem to be other useful tables including HeartRate(recorded by XDrip+ if health data is available on android device, eg. SmartWatch), Calibrations(calibration data), BloodReadings(Finger Prick results) and more...

In [213]:
bg_df.to_csv('data/raw_bg.csv')
treatments_df.to_csv('data/raw_treaments.csv')

PermissionError: [Errno 13] Permission denied: 'data/raw_bg.csv'

# Clean data

In [214]:
bg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132658 entries, 0 to 132657
Data columns (total 29 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   _id                        132658 non-null  int64         
 1   a                          132658 non-null  float64       
 2   age_adjusted_raw_value     132658 non-null  float64       
 3   b                          132658 non-null  float64       
 4   c                          132658 non-null  float64       
 5   calculated_value           132658 non-null  float64       
 6   calculated_value_slope     132658 non-null  float64       
 7   calibration                132356 non-null  float64       
 8   calibration_flag           132658 non-null  int64         
 9   calibration_uuid           132280 non-null  object        
 10  dg_delta_name              129427 non-null  object        
 11  dg_mgdl                    132658 non-null  float64 

### Split Basal and Bolus insulin

In [215]:
# Create two new columns 'bolus' and 'basal', initializing with NaN values
treatments_df['bolus'] = float('nan')
treatments_df['basal'] = float('nan')

# Filter rows where insulin > 0
insulin_positive = treatments_df['insulin'] > 0

# Filter rows where insulin >= 10
above_10 = treatments_df['insulin'] >= 10

# For rows where 'insulin' is >= 10, assign to 'basal'
treatments_df.loc[above_10, 'basal'] = treatments_df['insulin']

# For rows where 'insulin' > 0 and 'insulin' is < 10, assign to 'bolus'
treatments_df.loc[insulin_positive & ~above_10, 'bolus'] = treatments_df['insulin']

# Display the updated DataFrame to check the result
print(treatments_df[['insulin', 'bolus', 'basal']])

      insulin  bolus  basal
0         0.0    NaN    NaN
1         4.0    4.0    NaN
2         2.0    2.0    NaN
3         2.0    2.0    NaN
4         2.0    2.0    NaN
...       ...    ...    ...
7397      0.0    NaN    NaN
7398     12.0    NaN   12.0
7399      0.0    NaN    NaN
7400      4.0    4.0    NaN
7401      2.0    2.0    NaN

[7402 rows x 3 columns]


In [216]:
treatments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7402 entries, 0 to 7401
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   _id          7402 non-null   int64         
 1   carbs        7402 non-null   float64       
 2   created_at   7402 non-null   object        
 3   enteredBy    7402 non-null   object        
 4   eventType    7402 non-null   object        
 5   insulin      7402 non-null   float64       
 6   insulinJSON  7239 non-null   object        
 7   notes        591 non-null    object        
 8   timestamp    7402 non-null   datetime64[ns]
 9   uuid         7402 non-null   object        
 10  bolus        4179 non-null   float64       
 11  basal        420 non-null    float64       
dtypes: datetime64[ns](1), float64(4), int64(1), object(6)
memory usage: 694.1+ KB


## Unrequired data

We will now drop all unrequired columns and adjust timestamps in both table to 5 minute intervals, improving alignment of the two data sources.

In [217]:
# Create dataframes with only our required columns and rename calculated_value to glucose
bg_df = bg_df[['calculated_value', 'timestamp']].copy()
bg_df['timestamp'] = pd.to_datetime(bg_df['timestamp']).dt.round('5min')
bg_df = bg_df.groupby('timestamp').agg({'calculated_value': 'mean'})
bg_df.rename(columns={'calculated_value': 'glucose'}, inplace=True)

# For treatments_df
treatments_df = treatments_df[['carbs', 'basal', 'bolus', 'timestamp']].copy()
treatments_df['timestamp'] = pd.to_datetime(treatments_df['timestamp']).dt.round('5min')
treatments_df = treatments_df.groupby('timestamp').agg({
    'carbs': 'sum',
    'basal': 'sum',
    'bolus': 'sum'
})

In [218]:
treatments_df

Unnamed: 0_level_0,carbs,basal,bolus
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-06-03 23:35:00,0.0,0.0,0.0
2023-06-04 00:00:00,0.0,0.0,4.0
2023-06-04 03:15:00,0.0,0.0,2.0
2023-06-04 04:30:00,0.0,0.0,2.0
2023-06-04 08:00:00,0.0,0.0,2.0
...,...,...,...
2024-09-28 02:00:00,16.0,0.0,0.0
2024-09-28 03:05:00,0.0,12.0,0.0
2024-09-28 05:30:00,3.0,0.0,0.0
2024-09-28 09:45:00,0.0,0.0,2.0


In [219]:
bg_df

Unnamed: 0_level_0,glucose
timestamp,Unnamed: 1_level_1
2023-06-03 22:30:00,133.923468
2023-06-03 22:35:00,124.789758
2023-06-03 22:40:00,120.810715
2023-06-03 22:50:00,108.783156
2023-06-03 22:55:00,101.277236
...,...
2024-09-28 11:35:00,81.736290
2024-09-28 11:45:00,77.464741
2024-09-28 11:50:00,75.691645
2024-09-28 11:55:00,75.611050


## Unrealistic data

We will now limit glucose values to their physiological limits to help negate sensor errors, we will limit glucose levels on the upper range to no more than 20.0 mmol/l and on the lower side we will limit all glucose levels to the Libre 2 cut off limit of 2.2 mmol/l. we will also change any basal or bolus doses over 15u to 0, as these must be Xdrip+ issues, as I have never taken such large doses in my 18 years as a Type 1 Diabetic.

In [220]:
# For treatments_df: Set any value above 15 in 'bolus' and 'basal' columns to 0
treatments_df['bolus'] = treatments_df['bolus'].apply(lambda x: 0 if x > 15 else x)
treatments_df['basal'] = treatments_df['basal'].apply(lambda x: 0 if x > 15 else x)

# For bg_df: Limit glucose values to the range [2.2, 20.0] - We first need to convert mg/dl to mmol/l
# Uncomment the below lines to adjust your data

# Convert glucose from mg/dL to mmol/L using standard /18
bg_df['glucose'] = bg_df['glucose'] / 18.0
bg_df['glucose'] = bg_df['glucose'].clip(lower=2.2, upper=20.0)

In [221]:
treatments_df.describe()

Unnamed: 0,carbs,basal,bolus
count,6878.0,6878.0,6878.0
mean,13.454171,0.725792,2.206005
std,19.659942,2.87303,2.479235
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,2.0,0.0,2.0
75%,20.0,0.0,4.0
max,140.0,15.0,15.0


In [222]:
# Replace allbasal and bolus NaN values with 0 in the entire DataFrame
treatments_df.fillna(0, inplace=True)

treatments_df

Unnamed: 0_level_0,carbs,basal,bolus
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-06-03 23:35:00,0.0,0.0,0.0
2023-06-04 00:00:00,0.0,0.0,4.0
2023-06-04 03:15:00,0.0,0.0,2.0
2023-06-04 04:30:00,0.0,0.0,2.0
2023-06-04 08:00:00,0.0,0.0,2.0
...,...,...,...
2024-09-28 02:00:00,16.0,0.0,0.0
2024-09-28 03:05:00,0.0,12.0,0.0
2024-09-28 05:30:00,3.0,0.0,0.0
2024-09-28 09:45:00,0.0,0.0,2.0


In [223]:
bg_df.describe()

Unnamed: 0,glucose
count,132404.0
mean,6.6009
std,2.96802
min,2.2
25%,4.540925
50%,6.121654
75%,8.020547
max,20.0


## Combine

We will now ensure the combined_df contains the complete date range and combine the two dataframes. We can then turn our focus to ensuring the glucose data is consistent and interpolate where suitable and drop rows with large gaps in the blood glucose readings data.

In [224]:
# Ensure both dataframes have a complete range of 5-minute intervals
start_time = min(bg_df.index.min(), treatments_df.index.min())
end_time = max(bg_df.index.max(), treatments_df.index.max())
full_range = pd.date_range(start=start_time, end=end_time, freq='5min')

bg_df = bg_df.reindex(full_range)
treatments_df = treatments_df.reindex(full_range)

# Merge the dataframes
combined_df = pd.merge(bg_df, treatments_df, left_index=True, right_index=True, how='outer')

# Explicitly name the index
combined_df.index.name = 'timestamp'

# Handle missing values
combined_df[['carbs', 'basal', 'bolus']] = combined_df[['carbs', 'basal', 'bolus']].fillna(0) # Fill all NaN values with 0

                     carbs  basal  bolus
2023-06-03 22:30:00    NaN    NaN    NaN
2023-06-03 22:35:00    NaN    NaN    NaN
2023-06-03 22:40:00    NaN    NaN    NaN
2023-06-03 22:45:00    NaN    NaN    NaN
2023-06-03 22:50:00    NaN    NaN    NaN
...                    ...    ...    ...
2024-09-28 11:40:00    NaN    NaN    NaN
2024-09-28 11:45:00    NaN    NaN    NaN
2024-09-28 11:50:00    NaN    NaN    NaN
2024-09-28 11:55:00    NaN    NaN    NaN
2024-09-28 12:00:00    NaN    NaN    NaN

[138979 rows x 3 columns]


## Check glucose data consistency

We will now investigate any gaps in the `glucose` column and drop any rows with gaps over 20 minutes and drop all others.

In [226]:
combined_df.isna().sum()

glucose    6575
carbs         0
basal         0
bolus         0
dtype: int64

In [227]:
# Identify gaps in glucose readings
combined_df['is_gap'] = combined_df['glucose'].isna()
combined_df['gap_group'] = (combined_df['is_gap'] != combined_df['is_gap'].shift()).cumsum()
gaps = combined_df[combined_df['is_gap']].groupby('gap_group')
gaps_greater_than_20min = gaps.filter(lambda x: len(x) > 4)

number_of_gaps = len(gaps_greater_than_20min['gap_group'].unique())
print(f"Number of gaps greater than 20 minutes: {number_of_gaps}")

Number of gaps greater than 20 minutes: 19


In [228]:
# Save to csv if you wish to inspect for further insight into missing glucose readings in your data
gaps_greater_than_20min.to_csv('data/gaps_over_20min.csv')

In [229]:
if number_of_gaps > 0:
    print("Gaps greater than 20 minutes:")
    print(gaps_greater_than_20min.groupby('gap_group').sum()) # Show each group and size of group

Gaps greater than 20 minutes:
           glucose  carbs  basal  bolus  is_gap
gap_group                                      
100            0.0    0.0    0.0    0.0       5
1014           0.0   40.0    0.0   12.0     131
1448           0.0    0.0    0.0    0.0      12
2076           0.0    0.0    0.0    0.0       9
3838           0.0    0.0    0.0    0.0       5
3948           0.0    0.0    0.0    0.0     483
5554           0.0    0.0    0.0    2.0       5
5592           0.0    0.0    0.0    0.0       9
6548           0.0    0.0    0.0    0.0       5
6608           0.0    0.0    0.0    1.0      10
6780           0.0   14.0    0.0    0.0       7
6806           0.0   35.0    0.0    3.0       6
6866           0.0    0.0    0.0    0.0       6
7184           0.0    0.0    0.0    0.0       8
7508           0.0    0.0    0.0    0.0       5
7712           0.0    0.0    0.0    0.0      23
9048           0.0    0.0    0.0    0.0       8
9368           0.0    0.0    0.0    0.0       5
9452      

## Drop rows with large gaps

We will now drop all of these groups ensuring no gaps over 20 minutes will be interpolated and helping maintain the integrity of our data. We will then interpolate the remaining gaps.

In [230]:
# Drop rows in gaps_greater_than_20min from bg_df and interpolate the rest
combined_df_cleaned = combined_df.drop(gaps_greater_than_20min.index)
combined_df_cleaned['glucose'] = combined_df_cleaned['glucose'].interpolate(method='time', limit=4) # Interpolate gaps upto 20 mins

In [231]:
combined_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 138187 entries, 2023-06-03 22:30:00 to 2024-09-28 12:00:00
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   glucose    138187 non-null  float64
 1   carbs      138187 non-null  float64
 2   basal      138187 non-null  float64
 3   bolus      138187 non-null  float64
 4   is_gap     138187 non-null  bool   
 5   gap_group  138187 non-null  int64  
dtypes: bool(1), float64(4), int64(1)
memory usage: 6.5 MB


In [234]:
combined_df_cleaned.describe()

Unnamed: 0,glucose,carbs,basal,bolus,gap_group
count,138187.0,138187.0,138187.0,138187.0,138187.0
mean,6.59302,0.668737,0.036125,0.109648,5203.090016
std,2.963226,5.269994,0.660076,0.731792,3271.588368
min,2.2,0.0,0.0,0.0,1.0
25%,4.535785,0.0,0.0,0.0,2347.0
50%,6.118622,0.0,0.0,0.0,5085.0
75%,8.012762,0.0,0.0,0.0,7825.0
max,20.0,140.0,15.0,15.0,11487.0


# Annonymise my data

I will now create a `day_of_week` column and a `time` column, before dropping the timestamp column to ensure the sample data provided with this project remains somewhat annonymis. You can skip this section if your using this tool with your own data, some of the analysis functions will need altering in order to work with the actual timestamps. I will try and incorporate a settings variable in the analysis note book in order to define wether your running it with sample data, or your own data containing timestamps.

In [240]:
# Create a 'day' column by extracting the date part of the DatetimeIndex
combined_df_cleaned['day_of_week'] = combined_df_cleaned.index.day_name()

# Create a 'time' column by extracting the time part of the DatetimeIndex
combined_df_cleaned['time'] = combined_df_cleaned.index.time

# Comment this line out to maintain the Timestamp in your data
combined_df_cleaned = combined_df_cleaned.reset_index()
combined_df_cleaned = combined_df_cleaned.drop('timestamp', axis=1)

In [241]:
combined_df_cleaned

Unnamed: 0,glucose,carbs,basal,bolus,is_gap,gap_group,day_of_week,time
0,7.440193,0.0,0.0,0.0,False,1,Saturday,22:30:00
1,6.932764,0.0,0.0,0.0,False,1,Saturday,22:35:00
2,6.711706,0.0,0.0,0.0,False,1,Saturday,22:40:00
3,6.377608,0.0,0.0,0.0,True,2,Saturday,22:45:00
4,6.043509,0.0,0.0,0.0,False,3,Saturday,22:50:00
...,...,...,...,...,...,...,...,...
138182,4.422251,0.0,0.0,0.0,True,11486,Saturday,11:40:00
138183,4.303597,0.0,0.0,0.0,False,11487,Saturday,11:45:00
138184,4.205091,0.0,0.0,0.0,False,11487,Saturday,11:50:00
138185,4.200614,0.0,0.0,0.0,False,11487,Saturday,11:55:00


In [242]:
combined_df_cleaned.describe()

Unnamed: 0,glucose,carbs,basal,bolus,gap_group
count,138187.0,138187.0,138187.0,138187.0,138187.0
mean,6.59302,0.668737,0.036125,0.109648,5203.090016
std,2.963226,5.269994,0.660076,0.731792,3271.588368
min,2.2,0.0,0.0,0.0,1.0
25%,4.535785,0.0,0.0,0.0,2347.0
50%,6.118622,0.0,0.0,0.0,5085.0
75%,8.012762,0.0,0.0,0.0,7825.0
max,20.0,140.0,15.0,15.0,11487.0


# Conclusion
With this notebook we have succesfully transforme the Xdrip+ data back-up from raw unaligned data into a combined dataset ready for analysis. We have done the following:
- Loaded in the BgReading and Treatments tables from Sqlite using Pandas
- Removed all unnecessary columns from the BgReadings and Treatments tables
- Split the insulin column into Basal and Bolus dose columns using personal experience(adapt this section if needed)
- Cleaned the data, including: 
    1. Renaming columns to more intuitive names
    2. Rounding irregular time intervals into consistent 5-minute intervals and aggregating data appropriatley
    3. Clipping and modifying unrealistic data points using common sense and personal experience(adapt this section if needed)
    4. Addressing NaN values using a strict standard of not interpolating and gaps over 20 minutes, ensuring data validity
    5. Complex merging of two dataframes and proper handling of missing values
    6. Thourough consistency check in glucose data
    7. Adaption of data to maintain temporal relationships of data whilst removing sensitive data
    
## Export your data

If you are running the analysis on your own data you can export to a csv file now and begin the analysis. Be aware this data will span however long your backup from XDrip+ covers not just 90 days like the sample data.


In [243]:
filtered_df.to_csv('data/sample_analysis_data.csv')

### End of Notebook
Copyright (c) 2024 Warren Bebbington 