# 7. Analysis of Participant Characteristics in Exercise Studies

This Jupyter Notebook provides a comprehensive analysis of participant characteristics across various exercise studies. The analysis encompasses both demographic and exercise bout data, as well as metrics of glycemic control derived from continuous glucose monitoring (CGM) data.

### Structure of the Notebook:

1. Import Packages and Upload Data
        The necessary Python packages are imported for data processing, statistical analyses, and visualization.
        Raw data files are uploaded from specified directories, and an initial dataset df is created, which is further enhanced with the addition of id_key and race data from DEXI and DEXIP studies.
        Appropriate data directories for different studies are specified.

2. Data Dictionary
        The data is described in terms of its structure and content.
        Data types, missing values, and basic statistics (min, max, quartiles) for numeric columns are calculated.
        Unique values and categories for categorical columns are identified.

3. Statistical Analysis of Demographic and Exercise Bout Data
        Lists of categorical and numeric variables related to exercise bouts and demographics are defined.
        Summary statistics, including medians and interquartile ranges, are computed for each study as well as for all studies combined.

4. Metrics of Glycemic Control
        Glycemic metrics for each study (T1-DEXI, T1-DEXIP, EXT-101, EXT-edu) are calculated using the metrics.all_standard_metrics function.
        Results from individual studies are combined into a single dataset all_metrics.
        Summary statistics for these metrics are again computed for each study and for the combined dataset.

5. Compilation and Presentation of Results
        The results from the demographic/exercise bout data analysis and the glycemic control metrics analysis are combined.
        A final table, ci_table, is created, which provides a comparative overview of the results across all studies.
        This table is then saved to a CSV file for external use.

### Data Sources:

Demographics Data: demographics_df.csv
ID Key for DEXIP Study: id_key_dexip.csv
Race Data for DEXI and DEXIP Studies: dexi_race.csv, dexip_race.csv
CGM Data for Various Studies: Located in specified directories for each study.

### Objective:

The goal of this analysis is to provide a clear understanding of the participant characteristics across the different studies, focusing on demographics, exercise bout information, and glycemic control metrics. This consolidated view aids in drawing meaningful insights and comparisons between studies and helps identify patterns or trends related to exercise and glycemia.

## 7.0 Import packages and upload data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, date, timedelta
from functools import reduce
import preprocess_helper
import sys
path = "../../diametrics"
sys.path.append(path)
import metrics

import warnings
warnings.filterwarnings('ignore')


In [2]:
# Directories
directory_101 = '../../data/tidy_data/extod_101/'
directory_edu = '../../data/tidy_data/extod_edu/'
directory_helm = '../../data/tidy_data/helmsley/'
directory_dexip = '../../data/tidy_data/dexip/'

In [3]:
# Upload df 
df = pd.read_csv('../../data/tidy_data/demographics_df.csv')

In [4]:
df.shape

(16490, 414)

# Upload ID key and merge with df
id_key = pd.read_csv('../1_preprocessing/id_key_dexip.csv').drop_duplicates()
df = df.merge(id_key, on='bout_id', how='inner')

id_key

## 7.1. Data dictionary

In [5]:
# Convert ordinal variables to str
cols_to_convert = ['intensity', 'day_of_week'] #'season', 
for col in cols_to_convert:
    df[col] = df[col].astype(str)

In [6]:
# Calculate relevant info for data dictionary
numeric_dict = {}
categorical_dict = {}

for column in df.columns:
    missing_values = df[column].isnull().sum()

    if df[column].dtype in ['int64', 'float64']:  # Numeric columns
        numeric_dict[column] = {
            'data_type': str(df[column].dtype),
            'missing_values': missing_values,
            'min': df[column].min(),
            'Q1': df[column].quantile(0.25),
            'Q2': df[column].median(),
            'Q3': df[column].quantile(0.75),
            'max': df[column].max()
        }
    else:  # Categorical columns
        categorical_dict[column] = {
            'data_type': str(df[column].dtype),
            'unique_values': df[column].nunique(),
            'missing_values': missing_values,
            'categories': df[column].unique()#.tolist()
        }


In [7]:
# Save categorical data
cat_data = pd.DataFrame(categorical_dict).T
cat_data.to_csv('../../results/data_dictionary/categorical_data.csv')

In [8]:
cat_data

Unnamed: 0,data_type,unique_values,missing_values,categories
ID,object,835,0,"[dexip_100, dexip_101, dexip_102, dexip_103, d..."
start_datetime,object,16373,0,"[2021-08-10 13:06:00, 2021-08-10 15:32:22, 202..."
finish_datetime,object,16387,0,"[2021-08-10 13:26:00, 2021-08-10 15:47:22, 202..."
intensity,object,4,0,"[0.0, 1.0, 2.0, nan]"
type_of_exercise,object,328,0,"[Walking, Running/Jogging, Video Games, WALKIN..."
day_of_week,object,7,0,"[1, 2, 3, 4, 5, 6, 0]"
time_of_day,object,3,0,"[afternoon, evening, morning]"
form_of_exercise,object,3,0,"[aer, mix, ana]"
study,object,4,0,"[dexip, ext_101, ext_edu, helm]"
bout_id,object,16471,0,"[dexip_100_20210610130600, dexip_100_202132101..."


In [9]:
num_data = pd.DataFrame(numeric_dict).T

In [10]:
num_data.to_csv('../../results/data_dictionary/numeric_data.csv')

## 7.2. Statistical analysis

### 7.2.1. Demographic and exercise bout data

In [11]:
# Categorical variables relating to exercise bout
cat_bout = ['intensity',
               'day_of_week',
               'time_of_day',
               'form_of_exercise',
               ]

# Numeric variables relating to exercise bout
numeric_bout = ['duration',
    'start_glc',
    'start_roc',]

In [12]:
# Categorical variables relating to demographics
cat_demo = [
    'sex',
    'insulin_modality',
    'race'
]

# Numeric variables relating to demographics
numeric_demo = [
    'age',
    'hba1c',
    'bmi',
    'years_since_diagnosis',
    ]

In [13]:
# Calculate relevant info for each study
overview = df.groupby('study').apply(lambda group: preprocess_helper.overview_results(group,  numeric_demo, cat_demo, numeric_bout, cat_bout))
overview = overview.reset_index().drop(columns='level_1')

   index         ID       start_datetime      finish_datetime intensity  \
0      0  dexip_100  2021-08-10 13:06:00  2021-08-10 13:26:00       0.0   
1      1  dexip_100  2021-08-10 15:32:22  2021-08-10 15:47:22       0.0   
2      2  dexip_100  2021-08-10 19:11:00  2021-08-10 19:36:00       0.0   
3      3  dexip_100  2021-08-11 12:58:00  2021-08-11 13:13:00       0.0   
4      4  dexip_100  2021-08-11 16:35:00  2021-08-11 16:50:00       0.0   

  type_of_exercise  month  day day_of_week time_of_day  ...  \
0          Walking      8   10           1   afternoon  ...   
1  Running/Jogging      8   10           1   afternoon  ...   
2      Video Games      8   10           1     evening  ...   
3  Running/Jogging      8   11           2   afternoon  ...   
4  Running/Jogging      8   11           2   afternoon  ...   

   glc__fourier_entropy__bins_5 glc__fourier_entropy__bins_10  \
0                      0.955700                      1.153742   
1                      0.410116         

In [14]:
# Calculate relevant info for all studies combined
all_studies = preprocess_helper.overview_results(df,  numeric_demo, cat_demo, numeric_bout, cat_bout)
all_studies['study'] = 'all'
all_results = all_studies.append(overview)

   index         ID       start_datetime      finish_datetime intensity  \
0      0  dexip_100  2021-08-10 13:06:00  2021-08-10 13:26:00       0.0   
1      1  dexip_100  2021-08-10 15:32:22  2021-08-10 15:47:22       0.0   
2      2  dexip_100  2021-08-10 19:11:00  2021-08-10 19:36:00       0.0   
3      3  dexip_100  2021-08-11 12:58:00  2021-08-11 13:13:00       0.0   
4      4  dexip_100  2021-08-11 16:35:00  2021-08-11 16:50:00       0.0   

  type_of_exercise  month  day day_of_week time_of_day  ...  \
0          Walking      8   10           1   afternoon  ...   
1  Running/Jogging      8   10           1   afternoon  ...   
2      Video Games      8   10           1     evening  ...   
3  Running/Jogging      8   11           2   afternoon  ...   
4  Running/Jogging      8   11           2   afternoon  ...   

   glc__fourier_entropy__bins_5 glc__fourier_entropy__bins_10  \
0                      0.955700                      1.153742   
1                      0.410116         

In [15]:
# Rename cols and reset index
all_results.columns = ['Feature', 'Result (Median [IQR])', 'Study']
all_results = all_results.reset_index(drop=True)

### 7.2.2 Metrics of glycemic control 

# T1-DEXI metrics
helmsley_cgm = pd.read_csv(directory_helm + 'cgm.csv')
helmsley_cgm['time'] = pd.to_datetime(helmsley_cgm['time'])
helm_metrics = metrics.all_standard_metrics(helmsley_cgm)

# T1-DEXIP metrics
dexip_cgm = pd.read_csv(directory_dexip + 'cgm.csv')
dexip_cgm['time'] = pd.to_datetime(dexip_cgm['time'])
dexip_metrics = metrics.all_standard_metrics(dexip_cgm)

# EXT-101 metrics
extod_101_cgm = pd.read_csv(directory_101 + 'cgm.csv')
extod_101_cgm['time'] = pd.to_datetime(extod_101_cgm['time'])
ext_101_metrics = metrics.all_standard_metrics(extod_101_cgm)

# EXT-edu metrics
extod_edu_cgm = pd.read_csv(directory_edu + 'cgm.csv')
extod_edu_cgm['time'] = pd.to_datetime(extod_edu_cgm['time'])
ext_edu_metrics = metrics.all_standard_metrics(extod_edu_cgm)

# Combine all studies
all_metrics = helm_metrics.append(dexip_metrics).append(ext_edu_metrics).append(ext_101_metrics)

# Save the metrics
all_metrics.to_csv('../../results/metrics_glycemic_control.csv')

In [16]:
all_metrics = pd.read_csv('../../results/metrics_glycemic_control.csv')

In [17]:
all_metrics

Unnamed: 0.1,Unnamed: 0,ID,Average glucose (mmol/L),eA1c (%),SD (mmol/L),CV (%),AUC (mmol h/L),LBGI,HBGI,MAGE (mmol/L),...,Number LV2 hypoglycemic events,Number prolonged hypoglycemic events,Avg. length of hypoglycemic events,Total time spent in hypoglycemic events,Total number hyperglycemic events,Number LV1 hyperglycemic events,Number LV2 hyperglycemic events,Number prolonged hyperglycemic events,Avg. length of hyperglycemic events,Total time spent in hyperglycemic events
0,0,helm_1,8.952967,7.259728,3.607006,40.288382,8.174627,0.836351,7.821003,8.225111,...,0,0,0 days 00:41:09,0 days 17:50:00,82,53,29,11,0 days 02:50:48,9 days 17:25:30
1,1,helm_1000,7.862634,6.573983,2.237376,28.455809,7.172314,0.332605,3.676388,5.506297,...,0,0,0 days 00:30:00,0 days 00:59:59,47,39,8,1,0 days 02:27:02,4 days 19:10:15
2,2,helm_1004,8.334872,6.870989,2.654680,31.850281,7.612339,0.800243,5.128033,6.160585,...,4,0,0 days 01:00:57,0 days 21:20:05,69,62,7,3,0 days 02:18:16,6 days 14:59:51
3,3,helm_1010,7.465760,6.324377,3.168540,42.440956,6.760118,2.282610,4.340416,6.931330,...,18,2,0 days 00:57:30,2 days 05:40:10,49,38,11,6,0 days 02:34:00,5 days 05:45:43
4,4,helm_1012,7.266100,6.198805,2.076445,28.577160,6.575361,0.716802,2.528263,5.060042,...,3,0,0 days 00:38:03,0 days 11:24:57,43,40,3,0,0 days 01:50:14,3 days 06:59:49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,37,ext_101_3043,7.423824,6.298002,2.237951,30.145526,5.410065,0.650862,2.936287,5.427980,...,4,0,0 days 01:00:27,1 days 23:21:00,175,155,20,2,0 days 02:06:16,15 days 08:16:00
854,38,ext_101_3044,7.311511,6.227365,2.613584,35.746155,5.312929,1.786254,3.313367,6.443787,...,33,0,0 days 01:21:20,4 days 08:23:00,99,87,12,1,0 days 02:44:18,11 days 07:06:00
855,39,ext_101_3045,8.576277,7.022816,3.969074,46.279681,6.333879,2.439792,7.477813,9.405907,...,63,14,0 days 01:28:06,10 days 18:25:00,331,198,133,37,0 days 02:45:50,38 days 02:52:00
856,40,ext_101_3046,6.940633,5.994109,3.315082,47.763393,5.119489,5.290168,3.777357,7.447610,...,124,55,0 days 02:50:37,22 days 12:18:00,166,123,43,14,0 days 04:27:12,30 days 19:16:00


In [18]:
# Drop columns that won't be included in table
cols = ['Avg. length of hypoglycemic events', 
        'Total time spent in hypoglycemic events',
         'Avg. length of hyperglycemic events',
         'Total time spent in hyperglycemic events','ID',
         'Unnamed: 0' ]
all_metrics_dropped  = all_metrics.drop(columns=cols)
remaining_cols = all_metrics_dropped.columns

In [19]:
len(all_metrics.ID.unique())

858

In [20]:
# Calculate mean[iqr] glycemic metrics for all participants 
all_metrics_iqr = preprocess_helper.col_iqr(all_metrics, remaining_cols)
all_metrics_iqr['Study'] = 'all'

# Calculate mean[iqr] glycemic metrics for dexi participants 
helm_metrics_iqr =  preprocess_helper.col_iqr(all_metrics[all_metrics['ID'].str.startswith('helm')], remaining_cols)
helm_metrics_iqr['Study'] = 'helm'

# Calculate mean[iqr] glycemic metrics for dexip participants 
dexip_metrics_iqr =  preprocess_helper.col_iqr(all_metrics[all_metrics['ID'].str.startswith('dexip')], remaining_cols)
dexip_metrics_iqr['Study'] = 'dexip'

# Calculate mean[iqr] glycemic metrics for edu participants 
ext_edu_metrics_iqr =  preprocess_helper.col_iqr(all_metrics[all_metrics['ID'].str.startswith('ext_edu')], remaining_cols)
ext_edu_metrics_iqr['Study'] = 'ext_edu'

# Calculate mean[iqr] glycemic metrics for 101 participants 
ext_101_metrics_iqr =  preprocess_helper.col_iqr(all_metrics[all_metrics['ID'].str.startswith('ext_101')], remaining_cols)
ext_101_metrics_iqr['Study'] = 'ext_101'

Average glucose (mmol/L)
8.3 [7.45, 9.55]
eA1c (%)
6.85 [6.31, 7.64]
SD (mmol/L)
2.88 [2.34, 3.57]
CV (%)
34.21 [29.98, 38.42]
AUC (mmol h/L)
7.43 [6.7, 8.58]
LBGI
0.7 [0.37, 1.13]
HBGI
5.37 [3.23, 9.04]
MAGE (mmol/L)
6.98 [5.73, 8.6]
TIR normal (%)
72.18 [57.2, 82.13]
TIR normal 1 (%)
46.16 [33.2, 59.17]
TIR normal 2 (%)
22.88 [18.81, 26.56]
TIR level 1 hypoglycemia (%)
1.98 [0.84, 3.67]
TIR level 2 hypoglycemia (%)
0.3 [0.08, 0.79]
TIR level 1 hyperglycemia (%)
18.82 [11.78, 25.12]
TIR level 2 hyperglycemia (%)
4.74 [1.43, 11.99]
Total number hypoglycemic events
12.0 [5.0, 25.0]
Number LV1 hypoglycemic events
10.0 [4.0, 20.0]
Number LV2 hypoglycemic events
2.0 [0.0, 4.0]
Number prolonged hypoglycemic events
0.0 [0.0, 0.0]
Total number hyperglycemic events
39.0 [24.0, 60.0]
Number LV1 hyperglycemic events
27.0 [15.25, 44.0]
Number LV2 hyperglycemic events
11.0 [5.0, 18.0]
Number prolonged hyperglycemic events
3.0 [1.0, 6.0]
Average glucose (mmol/L)
7.94 [7.24, 8.77]
eA1c (%)
6.62 [6.1

In [21]:
# Combine all the studies cgm data
overview_cgm = all_metrics_iqr.append(helm_metrics_iqr).append(dexip_metrics_iqr).append(ext_edu_metrics_iqr).append(ext_101_metrics_iqr)

# Rename cols
overview_cgm.columns = ['Feature', 'Result (Median [IQR])', 'Study']


### 7.2.3. Combine datasets

In [22]:
# Add CGM results to demographic and exercise data
all_results = all_results.append(overview_cgm)

In [23]:
# Display table
ci_table = all_results.pivot(columns='Feature', values='Result (Median [IQR])', index='Study').T

In [24]:
ci_table_ordered = ci_table[['all','helm',	'dexip',	'ext_101',	'ext_edu']].reset_index()

In [25]:
# Conversion function
def convert_to_mg_per_dl(value_str):
    # Conversion factor from mmol/L to mg/dL
    conversion_factor = 18.01559
    
    # Extract the main value and the range values
    main_value = float(value_str.split('[')[0].strip())
    range_values = [float(val) for val in value_str.split('[')[1].replace(']', '').split(',')]
    
    # Convert the values
    main_value_mg_per_dl = round(main_value * conversion_factor, 1)
    range_values_mg_per_dl = [round(val * conversion_factor, 1) for val in range_values]
    
    # Format the result
    result = "{} [{} , {}]".format(main_value_mg_per_dl, range_values_mg_per_dl[0], range_values_mg_per_dl[1])
    
    return result

# List of features to convert
features_to_convert = ["Average glucose (mmol/L)", "SD (mmol/L)", "start_glc"]

# Columns to apply the conversion to
cols_to_convert = ['all', 'helm', 'dexip', 'ext_101',	'ext_edu']  # Add other columns as needed

# Apply the function to the specified rows and columns
for feature in features_to_convert:
    ci_table_ordered.loc[ci_table_ordered['Feature'] == feature, cols_to_convert] = ci_table_ordered.loc[ci_table_ordered['Feature'] == feature, cols_to_convert].applymap(convert_to_mg_per_dl)

print(ci_table_ordered)

Study                                Feature  \
0                             AUC (mmol h/L)   
1                   Average glucose (mmol/L)   
2                                     CV (%)   
3                                       HBGI   
4                                       LBGI   
5                              MAGE (mmol/L)   
6            Number LV1 hyperglycemic events   
7             Number LV1 hypoglycemic events   
8            Number LV2 hyperglycemic events   
9             Number LV2 hypoglycemic events   
10     Number prolonged hyperglycemic events   
11      Number prolonged hypoglycemic events   
12                               SD (mmol/L)   
13             TIR level 1 hyperglycemia (%)   
14              TIR level 1 hypoglycemia (%)   
15             TIR level 2 hyperglycemia (%)   
16              TIR level 2 hypoglycemia (%)   
17                            TIR normal (%)   
18                          TIR normal 1 (%)   
19                          TIR normal 2

In [26]:
ci_table_ordered

Study,Feature,all,helm,dexip,ext_101,ext_edu
0,AUC (mmol h/L),"7.43 [6.7, 8.58]","7.21 [6.59, 7.99]","7.96 [7.05, 9.01]","6.51 [5.83, 7.26]","9.28 [8.01, 9.96]"
1,Average glucose (mmol/L),"149.5 [134.2 , 172.0]","143.0 [130.4 , 158.0]","158.2 [139.8 , 180.2]","158.5 [142.3 , 177.6]","185.2 [162.1 , 200.0]"
2,CV (%),"34.21 [29.98, 38.42]","32.96 [29.09, 36.87]","34.58 [30.79, 39.23]","40.34 [35.88, 46.18]","37.84 [34.76, 42.71]"
3,HBGI,"5.37 [3.23, 9.04]","4.34 [2.72, 6.64]","6.69 [4.12, 10.12]","7.46 [5.29, 11.25]","11.54 [7.53, 14.51]"
4,LBGI,"0.7 [0.37, 1.13]","0.72 [0.42, 1.17]","0.6 [0.27, 0.99]","1.03 [0.58, 2.43]","0.65 [0.33, 1.35]"
5,MAGE (mmol/L),"6.98 [5.73, 8.6]","6.46 [5.38, 7.56]","7.56 [6.08, 9.08]","8.29 [7.46, 9.93]","9.35 [8.3, 10.72]"
6,Number LV1 hyperglycemic events,"27.0 [15.25, 44.0]","39.0 [27.0, 48.0]","15.0 [11.0, 20.0]","124.0 [64.75, 198.0]","14.5 [8.0, 19.25]"
7,Number LV1 hypoglycemic events,"10.0 [4.0, 20.0]","16.0 [7.0, 26.0]","5.0 [2.0, 9.0]","43.5 [21.75, 81.75]","3.0 [2.0, 6.25]"
8,Number LV2 hyperglycemic events,"11.0 [5.0, 18.0]","11.0 [4.0, 20.0]","9.0 [4.0, 13.0]","64.5 [35.25, 130.0]","13.0 [8.75, 16.0]"
9,Number LV2 hypoglycemic events,"2.0 [0.0, 4.0]","2.0 [0.0, 5.0]","0.0 [0.0, 2.0]","19.0 [6.25, 45.75]","2.0 [0.0, 4.0]"


In [27]:
# Save table
ci_table.to_csv('../../results/demographics_table.csv')