# CGM Data Quality Analysis Tutorial
## Using cgm-data-processor with XDrip+ Backups

This notebook demonstrates a practical workflow for processing Continuous Glucose Monitoring (CGM) data using the cgm-data-processor tool. We'll walk through loading data from an XDrip+ SQLite backup, performing quality assessments, and exporting the processed data in a standardized format suitable for further analysis.

### Overview

The cgm-data-processor tool simplifies the process of working with CGM data by handling common preprocessing tasks and standardizing the output format. This example focuses on three key aspects:

1. Data Loading: Extracting CGM measurements, carbohydrate records, and insulin data from an XDrip+ SQLite backup
2. Quality Assessment: Evaluating data completeness, identifying gaps, and assessing measurement reliability
3. Standardized Export: Saving the processed data in a consistent CSV format that facilitates further analysis

### Prerequisites

Before running this notebook, ensure you have:
- The cgm-data-processor package installed
- An XDrip+ SQLite backup file
- Basic familiarity with Python and pandas

### Expected Output

The processed dataset will include:
- Glucose measurements aligned to 5-minute intervals
- Values in both mg/dL and mmol/L units
- Validated carbohydrate and insulin records
- Quality metrics for each time period
- Clearly marked data gaps and interpolated values

### Data Quality Considerations

Throughout this tutorial, we'll examine several key quality metrics:
- Measurement frequency and consistency
- Gap duration and distribution
- Sensor reliability indicators
- Record completeness for insulin and carbohydrate data

This quality assessment helps ensure that subsequent analyses are based on reliable data and that any limitations are well understood.

### Next Steps

After completing this tutorial, you'll have a standardized dataset ready for various analyses such as:
- Glucose variability assessment
- Meal response patterns
- Insulin sensitivity calculations
- Time-in-range analysis

Let's begin by importing the necessary libraries and setting up our environment.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML, Image
display(HTML("<style>.container { width:95% !important; }</style>")) # Make Jupyter cells wider for better visuals
import pprint

## Project Setup and Module Imports

This notebook relies on a modular codebase organized into three main components:

1. Preprocessing: Handles data loading from XDrip+ backups, cleaning operations, and timeline alignment
2. Analysis: Provides tools for assessing data quality, gap detection, and statistical analysis
3. Visualization: Creates informative dashboards and plots for quality assessment

The following code adds the project root to the Python path and imports the necessary functionality from each module. Each import is organized by its primary function to maintain clarity and facilitate future extensions of the codebase.

In [2]:
# Path modification used to allow Notebook access to src directory
import os
import sys
notebook_path = os.path.abspath('.')
project_root = os.path.join(notebook_path, '../../')
if project_root not in sys.path:
    sys.path.append(project_root)

# Preprocessing Module - Load, clean and align data
from src.preprocessing.loading import XDrip
from src.preprocessing.cleaning import clean_classify_insulin, clean_classify_carbs, clean_glucose
from src.preprocessing.alignment import align_diabetes_data

# Analysis Module - Check and display data quality
from src.analysis.gaps import analyse_glucose_gaps
from src.analysis.insulin import analyse_insulin_over_time
from src.analysis.metrics import display_quality_metrics

# Visualisation Module - Format data for visual appeal in Jupyter
from src.visualisation.quality_dashboard import create_quality_dashboard
from src.visualisation.meal_statistics_dashboard import create_meal_statistics_dashboard
from src.visualisation.gap_dashboard import create_gap_dashboard

## Data Loading

Here we initialize our data processing pipeline by loading an XDrip+ SQLite backup file. The `XDrip` class provides a clean interface for accessing and processing the raw CGM data.

The SQLite backup file contains the complete dataset including glucose readings, insulin records, and carbohydrate entries. The `XDrip` class handles the low-level database interactions and provides structured access to this data.

Note: When using your own data, replace the `db_path` with the path to your XDrip+ SQLite backup file. XDrip+ backups can be generated from within the XDrip+ application under Settings > Data Source > Export Database.

In [3]:
# Path to your SQLite file
db_path = '../../data/export20240928-130349.sqlite'
data = XDrip(db_path) # Load db path into XDrip class - Class found in src directory

## Initial Data Extraction

In this step, we extract two primary datasets from the XDrip+ backup:

1. Glucose Measurements (`bg_df`): This dataset contains continuous glucose monitoring readings, typically recorded at 5-minute intervals. Each reading includes the glucose value and associated metadata such as the timestamp and reading quality indicators.

2. Treatment Records (`treatment_df`): This dataset encompasses both insulin administration and carbohydrate intake records. The timestamps in this dataset may be irregular as they correspond to specific events rather than continuous monitoring.

Both dataframes are automatically configured with their timestamps as indices, and dropping any rows with duplicate timestamps, facilitating temporal analysis and alignment in subsequent processing steps. The `XDrip` class handles the SQL queries and initial data structuring, ensuring consistent data types and timestamp handling across the extracted datasets.

Note: These raw dataframes will undergo further processing and quality assessment before being combined into our standardized format. This two-stage loading approach allows us to validate and clean each data type independently before integration.

In [4]:
bg_df = data.load_glucose_df() # Load glucose data into a pandas dataframe - Function found in src directory
treatment_df = data.load_treatment_df() # Load treatment data into a pandas dataframe - Function found in src directory

## Insulin Data Processing

The `clean_classify_insulin()` function processes the raw treatment records to create a structured insulin dataset. This critical step separates insulin records from other treatments and applies standardization rules specific to insulin data. 

There are two optional parameters than can be supplied to this function:
- bolus_limit - Number of units where insulin doses above this should be classified as basal, default = 8
- max_limit - Number of units where user would suggest it must be an error to be discarded, default = 15

The function handles several key aspects of insulin data processing:
- Extracts insulin-specific records from the treatment dataset
- Classifies insulin entries into basal and bolus categories - through meta-data or doseage
- Validates dosage values and units
- Standardizes timestamp formats
- Removes any duplicate or invalid entries
- Sets index to timestamp column

The resulting `insulin_df` provides a clean, validated dataset of insulin records, split by basal vs bolus and with a flag to see if the data was labeled by the user, that is ready to be integrated into our final standardized format.

In [5]:
insulin_df = clean_classify_insulin(treatment_df) # Function in source directory

## Carbohydrate Data Processing

After processing insulin records, we now clean and standardize the carbohydrate data using the `clean_classify_carbs()` function. This function implements specific validation rules to ensure data quality and consistency:

1. The function filters for meaningful carbohydrate entries by keeping only records with 1.0 grams or more, eliminating negligible or potentially erroneous entries.

2. It handles duplicate timestamps by keeping only the first entry for any given time, which prevents double-counting of meals while preserving the earliest recorded entry.

3. The resulting `carb_df` contains a simplified structure with just the essential carbohydrate quantities indexed by timestamp, making it ready for integration with our other standardized data streams.

This cleaned carbohydrate dataset will be crucial for analyzing meal-related glucose responses and understanding overall patterns in carbohydrate intake alongside glucose measurements.

In [6]:
carb_df = clean_classify_carbs(treatment_df) # Function in source directory

## Glucose Data Processing and Standardization

The `clean_glucose()` function performs comprehensive processing of raw CGM data to create a standardized, analysis-ready glucose dataset with consistent time intervals and validated measurements.

### Temporal Standardization
The function first standardizes the temporal aspects of the data by rounding timestamps to 5-minute intervals, which is the standard measurement frequency for most CGM systems. It then creates a complete timeline by generating a continuous 5-minute interval index spanning the entire monitoring period. This ensures we have a consistent temporal structure, even when raw data points are missing or irregularly spaced.

### Gap Handling and Data Quality
A key feature of the processing is its sophisticated handling of data gaps. The function identifies and flags missing data points, creating a 'missing' indicator that allows downstream analyses to distinguish between measured and interpolated values. For gaps up to 20 minutes (four 5-minute intervals), the function applies linear interpolation to estimate glucose values. This approach balances the need for continuous data with the importance of maintaining data integrity.

### Measurement Standardization
The function implements several measurement quality controls:
- Glucose values are constrained to a physiologically reasonable range of 39.64 to 360.36 mg/dL (2.2 to 20.0 mmol/L)
- Measurements are provided in both mg/dL and mmol/L units using the standard conversion factor of 0.0555
- Where multiple readings exist for a single 5-minute interval, they are averaged to provide a single representative value

The resulting dataset includes three essential columns:
1. Glucose measurements in mg/dL
2. Parallel measurements in mmol/L
3. Missing data flags to indicate interpolated values

This processed glucose dataset forms the backbone of our standardized CGM data structure, providing a reliable foundation for subsequent analysis while maintaining transparency about data quality and completeness.

In [7]:
glucose_df = clean_glucose(bg_df) # Function in source directory

In [8]:
aligned_df = align_diabetes_data(glucose_df, carb_df, insulin_df)

  aligned_df['labeled_insulin'] = aligned_df['labeled_insulin'].fillna(False).astype('boolean')


In [9]:
aligned_df.tail(40)

Unnamed: 0,mg_dl,mmol_l,missing,carbs,bolus,basal,labeled_insulin
2024-09-28 08:45:00,107.849155,5.985628,False,0.0,0.0,0.0,False
2024-09-28 08:50:00,106.559631,5.91406,False,0.0,0.0,0.0,False
2024-09-28 08:55:00,106.962607,5.936425,False,0.0,0.0,0.0,False
2024-09-28 09:00:00,108.413322,6.016939,False,0.0,0.0,0.0,False
2024-09-28 09:05:00,110.911775,6.155604,False,0.0,0.0,0.0,False
2024-09-28 09:10:00,113.652014,6.307687,False,0.0,0.0,0.0,False
2024-09-28 09:15:00,117.762373,6.535812,False,0.0,0.0,0.0,False
2024-09-28 09:20:00,123.968208,6.880236,False,0.0,0.0,0.0,False
2024-09-28 09:25:00,130.335234,7.233605,False,0.0,0.0,0.0,False
2024-09-28 09:30:00,136.299284,7.56461,False,0.0,0.0,0.0,False


In [10]:
aligned_df['labeled_insulin'].describe()

count     138979
unique         2
top        False
freq      135889
Name: labeled_insulin, dtype: object

In [11]:
insulin_df['labeled_insulin'].describe()

count     4596
unique       2
top       True
freq      3353
Name: labeled_insulin, dtype: object

In [12]:
aligned_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 138979 entries, 2023-06-03 22:30:00 to 2024-09-28 12:00:00
Freq: 5min
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   mg_dl            138316 non-null  float64
 1   mmol_l           138316 non-null  float64
 2   missing          138979 non-null  bool   
 3   carbs            138979 non-null  float64
 4   bolus            138979 non-null  float64
 5   basal            138979 non-null  float64
 6   labeled_insulin  138979 non-null  boolean
dtypes: bool(1), boolean(1), float64(5)
memory usage: 6.8 MB


In [13]:
insulin_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4596 entries, 2023-06-03 23:58:08.909000 to 2024-09-28 09:44:00.948000
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   basal            4596 non-null   float64
 1   bolus            4596 non-null   float64
 2   labeled_insulin  4596 non-null   bool   
dtypes: bool(1), float64(2)
memory usage: 241.2 KB


In [14]:
# Filter rows where 'labeled_insulin' is True
true_labeled_rows = aligned_df[(aligned_df['labeled_insulin'] == True)]

# Display the filtered rows
print(true_labeled_rows)
basal_dose = aligned_df[(aligned_df['basal'] > 0) & (aligned_df['labeled_insulin'] == True)]
print(basal_dose.count())

                          mg_dl     mmol_l  missing  carbs  bolus  basal  \
2023-06-04 04:30:00  145.498863   8.075187    False    0.0    2.0    0.0   
2023-06-04 14:30:00  127.891029   7.097952    False    0.0    5.0    0.0   
2023-06-05 08:40:00  113.890816   6.320940    False   11.5    2.0    0.0   
2023-06-05 10:25:00  173.369688   9.622018    False    0.0    1.0    0.0   
2023-06-05 11:50:00  169.412673   9.402403    False    0.0    2.0    0.0   
...                         ...        ...      ...    ...    ...    ...   
2024-09-27 18:45:00   80.446765   4.464795    False    0.0    2.0   12.0   
2024-09-28 00:15:00  242.967116  13.484675     True    0.0    5.0    0.0   
2024-09-28 03:05:00   52.238423   2.899232    False    0.0    0.0   12.0   
2024-09-28 09:45:00  146.776668   8.146105    False    0.0    2.0    0.0   
2024-09-28 10:15:00  154.030242   8.548678    False   45.0    4.0    0.0   

                     labeled_insulin  
2023-06-04 04:30:00             True  
2023-06-0

In [15]:
aligned_df[(aligned_df['bolus'] > 0) | (aligned_df['basal'] > 0)]

Unnamed: 0,mg_dl,mmol_l,missing,carbs,bolus,basal,labeled_insulin
2023-06-04 00:00:00,88.254916,4.898148,False,0.0,4.0,0.0,False
2023-06-04 03:15:00,98.021656,5.440202,False,0.0,2.0,0.0,False
2023-06-04 04:30:00,145.498863,8.075187,False,0.0,2.0,0.0,True
2023-06-04 08:00:00,114.028257,6.328568,False,0.0,2.0,0.0,False
2023-06-04 14:30:00,127.891029,7.097952,False,0.0,5.0,0.0,True
...,...,...,...,...,...,...,...
2024-09-27 20:40:00,115.989277,6.437405,False,0.0,2.0,0.0,False
2024-09-28 00:15:00,242.967116,13.484675,True,0.0,5.0,0.0,True
2024-09-28 03:05:00,52.238423,2.899232,False,0.0,0.0,12.0,True
2024-09-28 09:45:00,146.776668,8.146105,False,0.0,2.0,0.0,True


In [16]:
aligned_df.to_csv('../../data/complete.csv')

In [17]:
aligned_df[aligned_df['labeled_insulin'] == True]

Unnamed: 0,mg_dl,mmol_l,missing,carbs,bolus,basal,labeled_insulin
2023-06-04 04:30:00,145.498863,8.075187,False,0.0,2.0,0.0,True
2023-06-04 14:30:00,127.891029,7.097952,False,0.0,5.0,0.0,True
2023-06-05 08:40:00,113.890816,6.320940,False,11.5,2.0,0.0,True
2023-06-05 10:25:00,173.369688,9.622018,False,0.0,1.0,0.0,True
2023-06-05 11:50:00,169.412673,9.402403,False,0.0,2.0,0.0,True
...,...,...,...,...,...,...,...
2024-09-27 18:45:00,80.446765,4.464795,False,0.0,2.0,12.0,True
2024-09-28 00:15:00,242.967116,13.484675,True,0.0,5.0,0.0,True
2024-09-28 03:05:00,52.238423,2.899232,False,0.0,0.0,12.0,True
2024-09-28 09:45:00,146.776668,8.146105,False,0.0,2.0,0.0,True
