# 3. Preprocessing JAEB T1-DEXI dataset

The goal of this notebook is to prepare the EXTOD education data for machine learning - predicting euglycemia during and around exercise.

EXTOD education was a pilot study in which 106 participants were randomly allocated to either standard care or an education programme teaching them how better to manage blood glucose around exercise.

The data used will be demographic, lab, clinical and physiological if available.

### Objectives:
1. Clean and combine the exercise diaries
2. Preprocess demographic data
3. Preprcoess lab data
4. Prepare CGM data

## 3.0. Import packages

In [1]:
# Import packages and upload dataset
import pandas as pd
from datetime import timedelta as time
import warnings
import preprocess_helper
import functools as ft

warnings.filterwarnings('ignore')

In [2]:
# Set the directory
directory = '../../data/raw_data/helmsley/'

## 3.1. Clean and combine exercise diaries

### 3.1.1. Load files and rename columns

In [3]:
# Read file
exercise = pd.read_csv(directory+'pr_dexi.csv')

In [4]:
# Convert the start time to datetime
exercise['PRSTDTC'] = pd.to_datetime(exercise['PRSTDTC'])

In [5]:
exercise

Unnamed: 0,STUDYID,DOMAIN,USUBJID,PRSEQ,PRTRT,PRCAT,PRSCAT,PRSTDTC,PRTRTC,PLNEXDUR,EXCINTSY,HRSLSTML,EXCOMP,SNKBEFEX,INUSDTDY,BETAGTDY,RESQCARB
0,T1DEXI,PR,1.0,1.000000e+00,Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),2020-05-16 10:03:30,Biking (Indoor or Outdoor),180,1,2 hours,0,1,0,0,0
1,T1DEXI,PR,1.0,2.000000e+00,Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),2020-05-18 10:30:37,Biking (Indoor or Outdoor),88,1,>4 hours,0,1,0,0,0
2,T1DEXI,PR,1.0,3.000000e+00,Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),2020-05-30 09:46:11,Biking (Indoor or Outdoor),228,1,<1 hour,0,1,0,0,0
3,T1DEXI,PR,1.0,4.000000e+00,Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),2020-06-01 13:13:19,Biking (Indoor or Outdoor),87,1,<1 hour,0,1,0,0,0
4,T1DEXI,PR,1.0,5.000000e+00,Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),Biking (Indoor or Outdoor),2020-06-03 10:19:19,Biking (Indoor or Outdoor),135,1,<1 hour,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15195,,,,3.687825e-40,,"Walking, Dog Walking","Walking, Dog Walking",2021-04-07 12:29:33,"Walking, Dog Walking",25,1,<1 hour,0,0,0,0,0
15196,,,,3.687825e-40,,"Walking, Dog Walking","Walking, Dog Walking",2021-04-07 12:29:33,"Walking, Dog Walking",25,1,<1 hour,0,0,0,0,0
15197,,,,3.687825e-40,,"Walking, Dog Walking","Walking, Dog Walking",2021-04-07 12:29:33,"Walking, Dog Walking",25,1,<1 hour,0,0,0,0,0
15198,,,,3.687825e-40,,"Walking, Dog Walking","Walking, Dog Walking",2021-04-07 12:29:33,"Walking, Dog Walking",25,1,<1 hour,0,0,0,0,0


In [6]:
# Select and rename columns
exdi = exercise[['USUBJID', 'PRSTDTC', 'PLNEXDUR', 'EXCINTSY', 'PRTRTC']] 
exdi.columns = ['ID', 'start_datetime', 'duration', 'intensity', 'type_of_exercise']

In [7]:
exdi.dropna(subset=['ID'], inplace=True)

In [8]:
# ID as int for concatenation
exdi['ID'] = exdi['ID'].astype(int)

In [9]:
exdi.shape

(15104, 5)

In [10]:
len(exdi.ID.unique())

502

### 3.1.2. Adjust date columns

In [10]:
# Calculate end time from start and duration 
exdi['finish_datetime'] = exdi.apply(lambda row: row['start_datetime'] + time(minutes=row['duration']), axis=1)

In [11]:
# Collect date information from start_datetime
exdi = preprocess_helper.date_preprocessing(exdi, 'start_datetime', False, True, True)

In [12]:
# Drop any instances with no id, start time or duration
exdi = exdi.dropna(subset=['ID', 'start_datetime', 'duration'])

In [13]:
# Make sure there's no minus durations
exdi = exdi.loc[exdi['duration']>0]

In [14]:
exdi.shape[0]

15098

In [15]:
def handle_overlaps(group):
    # Sort by start_datetime
    group = group.sort_values(by='start_datetime')
    
    non_overlapping = []
    overlap_count = 0  # Counter to limit the number of printed overlaps
    for _, row in group.iterrows():
        if non_overlapping:
            last_interval = non_overlapping[-1]
            # Check for overlap
            if row['start_datetime'] < last_interval['finish_datetime']:
                # Print the overlapping intervals (limited to a few for brevity)
                if overlap_count < 5:
                    print(f"Overlap detected for ID {group['ID'].iloc[0]}:")
                    print(f"Interval 1: {last_interval['start_datetime']} to {last_interval['finish_datetime']}")
                    print(f"Interval 2: {row['start_datetime']} to {row['finish_datetime']}\n")
                    overlap_count += 1

                curr_duration = (row['finish_datetime'] - row['start_datetime']).seconds / 60
                last_duration = (last_interval['finish_datetime'] - last_interval['start_datetime']).seconds / 60
                
                # Decide which interval to keep
                if (curr_duration >= 10 and curr_duration <= 120) and (last_duration < 10 or last_duration > 120):
                    non_overlapping[-1] = row
                elif (last_duration >= 10 and last_duration <= 120) and (curr_duration < 10 or curr_duration > 120):
                    continue
                elif curr_duration > last_duration:
                    non_overlapping[-1] = row
            else:
                non_overlapping.append(row)
        else:
            non_overlapping.append(row)
    
    return pd.DataFrame(non_overlapping)

# Apply the function to each group
exdi = exdi.groupby('ID').apply(handle_overlaps).reset_index(drop=True)


Overlap detected for ID 67:
Interval 1: 2020-07-08 09:40:30 to 2020-07-08 11:00:30
Interval 2: 2020-07-08 11:00:24 to 2020-07-08 12:03:24

Overlap detected for ID 67:
Interval 1: 2020-07-20 08:30:46 to 2020-07-20 09:00:46
Interval 2: 2020-07-20 09:00:33 to 2020-07-20 09:30:33

Overlap detected for ID 80:
Interval 1: 2019-12-29 18:59:30 to 2019-12-29 19:13:30
Interval 2: 2019-12-29 19:13:29 to 2019-12-29 19:18:29

Overlap detected for ID 290:
Interval 1: 2020-02-18 18:52:12 to 2020-02-18 19:17:12
Interval 2: 2020-02-18 19:16:59 to 2020-02-18 19:46:59

Overlap detected for ID 317:
Interval 1: 2019-03-24 07:33:30 to 2019-03-24 08:03:30
Interval 2: 2019-03-24 07:57:45 to 2019-03-24 07:59:45

Overlap detected for ID 393:
Interval 1: 2020-03-01 19:47:16 to 2020-03-01 20:07:16
Interval 2: 2020-03-01 20:05:28 to 2020-03-01 20:55:28

Overlap detected for ID 393:
Interval 1: 2020-03-18 19:45:38 to 2020-03-18 20:15:38
Interval 2: 2020-03-18 20:15:12 to 2020-03-18 20:40:12

Overlap detected for ID

In [16]:
exdi.shape[0]

15024

### 3.1.3. Type of exercise

In [17]:
# Convert form_of_exercise
exdi['form_of_exercise'] = exdi.type_of_exercise.apply(lambda x: 
                                                       preprocess_helper.divide_exercise_into_type(x))

outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
outdoor chores
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
outdoor chores
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
outdoor chores
packing and loading for travel
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
cleaning/housework
outdoor chores
cleaning/housework
outdoor chores
cleaning/housework
outdoor chores
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
cleaning/housework
outdoor chores
cleaning/housew

In [18]:
# Removing bouts that aren't really exercise
#exdi.dropna(subset=['form_of_exercise'], inplace=True)

In [19]:
exdi.shape

(15024, 11)

## 3.2. Clean and combine CGM files

In [20]:
# Upload directory for CGM data
lb = pd.read_csv(directory+'lb_dexi.csv')

# Select CGM readings
cgm = lb.loc[lb['LBCAT']=='CGM']

# Convert timestamp to datetime
cgm['LBDTC'] = pd.to_datetime(cgm['LBDTC'])

# Select and rename columns 
cgm = cgm[['USUBJID', 'LBDTC', 'LBORRES']]
cgm.columns = ['ID', 'time', 'glc']

# Drop any null 
cgm.dropna(inplace=True)

# Convert glucose to mg/dl
cgm['glc'] = (cgm['glc']/18).round(2)

# ID to int for concatenation
cgm['ID'] = cgm['ID'].astype(int)


## 3.3. Lab & demographic data

In [21]:
# Read file
demographics = pd.read_csv(directory+'dm_dexi.csv')

# Select and rename columns
demo = demographics[['USUBJID', 'AGE', 'SEX', 'RACE']]
demo.columns = ['ID', 'age', 'sex', 'race']

# Replace m/f values
demo['sex'] = demo['sex'].replace({'M':'male', 'F':'female'})

# ID as int for concatenation
demo['ID'] = demo['ID'].astype(int)

### 3.3.1. Calculate hba1c

In [22]:
# Select the hba1c results from the tests file
hba1c = lb.loc[lb['LBTESTCD']=='HBA1C']

# Make id int for concatenation
hba1c['USUBJID'] = hba1c['USUBJID'].astype(int)


# Select and rename columns
hba1c = hba1c[['USUBJID', 'LBORRES']]
hba1c.columns = ['ID', 'hba1c']

# Convert to mmol/mol
hba1c['hba1c'] =(hba1c['hba1c'] -2.15)*10.929



### 3.3.2. Calculate diabetes onset

In [23]:
# Set up list for results
fa_data = []

# Read file
fa = pd.read_sas(directory+'FA.xpt', encoding='utf-8',chunksize=10000, iterator=True)

In [24]:
# Read the file in chunks because it's so massive
for chunk in fa:
    chunk = chunk.loc[chunk['FAOBJ']== "DIABETES ONSET"]
    if chunk.shape[0]==0:
        continue
    fa_data.append(chunk[['USUBJID', 'FASTRESN']])

In [25]:
# Concat all the chunks
fa_data=pd.concat(fa_data)

# Rename coluns
fa_data.columns = ['ID', 'years_since_diagnosis']

# Make id int for concatenation
fa_data['ID'] = fa_data['ID'].astype(int)

### 3.3.3. Calculate BMI

In [26]:
# Set up list for results
vs_data=[]

# Read file
vs = pd.read_sas(directory+'VS.xpt', encoding='utf-8',chunksize=10000, iterator=True)

In [27]:
# Read the file in chunks because it's so massive
for chunk in vs:
    chunk = chunk.loc[(chunk['VSTESTCD']== 'HEIGHT') | (chunk['VSTESTCD']=='WEIGHT')]
    if chunk.shape[0]==0:
        continue
    vs_data.append(chunk[['USUBJID', 'VSTESTCD', 'VSORRES']])

In [28]:
# Concat all the chunks
vs_data= pd.concat(vs_data)

# Make it the right way round
vs_data_adj = vs_data.pivot(columns='VSTESTCD', index='USUBJID')

vs_data_adj.columns = ['h', 'w']

#Calculate bmi
vs_data_adj['bmi'] = (703*vs_data_adj['w'])/(vs_data_adj['h']*vs_data_adj['h'])

# Reset index
vs_data_adj = vs_data_adj.reset_index()

# Select only id and bmi and rename
bmi = vs_data_adj[['USUBJID', 'bmi']]
bmi.columns = ['ID', 'bmi']

# Make id int for concatenatin
bmi['ID'] = bmi['ID'].astype(int)

## 3.4. Insulin modality

In [29]:
# Read file
dx = pd.read_sas(directory+'DX.xpt', encoding='utf-8')

In [30]:
# Rename values
insulin_modality = dx[dx['DXCAT']=='INSULIN MODALITY'][['USUBJID', 'DXTRT']].replace({'MULTIPLE DAILY INJECTIONS': 'mdi', 
                                                                                      'INSULIN PUMP':'pump', 
                                                                                      'CLOSED LOOP INSULIN PUMP':'closed_loop'})

In [31]:
# Rename columns
insulin_modality.columns = ['ID', 'insulin_modality']

# Convert ID to int for concatenation
insulin_modality['ID'] = insulin_modality['ID'].astype(int)

## 3.5. Combine all dfs

In [32]:
# List of all dfs
dfs = [demo, hba1c, bmi, fa_data, insulin_modality]

# Merge them all
demo_lab = ft.reduce(lambda left, right: pd.merge(left, right, on='ID', how='outer'), dfs)

In [33]:
# Add 'helm' to IDs to make them unique
for i in [exdi, cgm, demo_lab]:
    i['ID'] = i['ID'].apply(lambda row: 'helm_'+str(row))

## 3.6. Save files

In [34]:
df_directory = '../../data/tidy_data/helmsley/'
exdi.to_csv(df_directory+'exercise.csv', index=False)
cgm.to_csv(df_directory+'cgm.csv', index=False)
demo_lab.to_csv(df_directory+'demo_lab.csv', index=False)