<h1><b>Gestational diabetes

<h2><b>1. Data collection

<h3><b> 1.1 Get Microbiom Data

In [59]:
%pip install gdown --quiet
%pip install plotly --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [60]:
import re
import gdown
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import sklearn

In [61]:
# Download microbiome data provided by a research laboratory in Israel
file_id = '1kt7l75LsKXrQjykGQLppLRIpDrb2tp4D'
url = f'https://drive.google.com/uc?id={file_id}'
gdown.download(url, 'microbiom_data.csv', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1kt7l75LsKXrQjykGQLppLRIpDrb2tp4D
To: c:\Users\artis\OneDrive\Документы\Работа\ML models\MicrobesAndGlucouseAnalysis\microbiom_data.csv
100%|██████████| 3.04M/3.04M [00:00<00:00, 18.0MB/s]


'microbiom_data.csv'

In [62]:
original_micr_data = pd.read_csv('microbiom_data.csv', dtype='string', encoding='UTF-8')

In [63]:
# As the warning suggests, the provided table stores mixed data types
type_constrained_micr_data = pd.concat([original_micr_data.iloc[0:413, 0].astype('string'),
                                        original_micr_data.iloc[0:413, 1:].astype('float')], axis=1)

In [64]:
numerical_micr_data = type_constrained_micr_data.copy()
sample = numerical_micr_data.pop('Sample')
visit = numerical_micr_data.pop('Visit')

In [65]:
# The sample name contains the patient number,
# which will be required to merge the tables
N = sample.apply(lambda row: int(row[0:3]))
N.name = 'N'

In [66]:
# Reading genetic sequences can be very costly in terms of memory consumption
def read_line(file):
    while True:
        line = file.readline()
        if not line:
            break
        yield line

In [67]:
def read_files(files):
    LefSe_otu = []
    for file_name in files:
        try:
            with open(file_name, mode='r', encoding='UTF-8') as file:
                for line in read_line(file):
                    bacteria = re.findall(r'\bOTU_\d+\b', line)
                    if bacteria:
                        LefSe_otu.append(*bacteria)
        except FileNotFoundError:
            print(f"File '{file_name}' not found.")
        except IOError as e:
            print(f"Error reading file '{file_name}': {e}")
    return LefSe_otu

In [68]:
# From the list of representatives of the gut microbiota,
# it was proposed to select for the study the bacteria that showed
# the greatest differences between patient groups as a result of LefSe analysis
LefSe_files = ['LefSe_above_median_BgMax_20_01.tsv', 'LefSe_above_median_iauc120_20_01.tsv']
selected_otu = read_files(LefSe_files)
otu = numerical_micr_data.loc[:, selected_otu]

In [69]:
micr_data = pd.concat([N, visit, otu], axis=1)
micr_data.set_index(['N', 'Visit'], inplace=True)

In [70]:
# The column reflects the period in which the analytical sample was processed
# The filter is designed to cut off a small portion of the data
# that has been corrupted in transit
micr_data = micr_data.loc[(micr_data.index.get_level_values('Visit') == 99) | 
                          (micr_data.index.get_level_values('Visit') == 146)]
micr_data = micr_data.reset_index(level='Visit', drop=True)

In [71]:
micr_data.sort_index()

Unnamed: 0_level_0,OTU_241,OTU_187,OTU_197,OTU_587,OTU_609,OTU_439,OTU_529,OTU_51,OTU_68,OTU_79,...,OTU_338,OTU_305,OTU_312,OTU_496,OTU_420,OTU_193,OTU_454,OTU_399,OTU_257,OTU_337
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
281,0.0,23.0,35.0,0.0,10.0,6.0,0.0,0.0,19.0,0.0,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
288,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,15.0,...,3.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,14.0,0.0
289,0.0,0.0,58.0,0.0,61.0,0.0,0.0,0.0,66.0,0.0,...,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,47.0,0.0
290,0.0,0.0,0.0,111.0,0.0,0.0,0.0,151.0,107.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,35.0,0.0
291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,92.0,0.0,...,8.0,30.0,0.0,0.0,3.0,0.0,0.0,0.0,36.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
872,0.0,106.0,0.0,0.0,0.0,0.0,0.0,988.0,0.0,205.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,156.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,127.0,67.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
875,0.0,137.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0,0.0,...,12.0,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0,0.0


<h3><b> 1.2 Get Clinical Data

In [72]:
# Set of clinical parameters proposed by the principal investigator for analysis
with open('clinical_param.txt', mode='r', encoding='UTF-8') as file:
    clinical_param = file.read().split(', ')

In [73]:
original_clinical_data = pd.read_csv('clinical_data.csv', index_col='N', usecols=['N', *clinical_param], dtype='float', encoding='UTF-8')
original_clinical_data.index = original_clinical_data.index.astype('int')

In [74]:
additional_clinical_data = pd.read_csv('additional_clinical_data.csv', index_col='N',
                                       usecols=['N', 'CGM_g_age1', 'GM_g_age2', 'diet_before_V1', 'Diet_duration_V1'],
                                       dtype='float', encoding='UTF-8')
additional_clinical_data.index = additional_clinical_data.index.astype('int')

In [75]:
original_clinical_data = original_clinical_data.join(additional_clinical_data, on='N')

In [76]:
# The filter is designed to cut off some patients who carelessly filled out data
clinical_data = original_clinical_data.copy()
clinical_data = clinical_data.loc[clinical_data['quality_cgm1']==0, :]
clinical_data.drop(columns=['quality_cgm1'], axis=1, inplace=True)

In [77]:
# It was suggested that diaries of patients who had taken antibiotics
# less than 4 weeks prior to the study should not be analyzed
clinical_data.drop(labels=[712, 724], axis=0, inplace=True)

In [78]:
clinical_data.sort_index()

Unnamed: 0_level_0,CGMS_срок,диета_срок,диета_кал,срок_кал1,Глюкоза_нт_общая,прибавка_m1,отеки1,АД_сист1,АД_диаст1,N_беременностей,...,ТГ_V1,ЛПВП_V1,ЛПОНП_V1,ЛПНП_V1,КА_V1,АБ_бер_ть,CGM_g_age1,GM_g_age2,diet_before_V1,Diet_duration_V1
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
78,30.0,21.0,,,5.26,6.0,0.0,120.0,80.0,2.0,...,1.37,2.71,0.63,4.59,1.93,1.0,,,1.0,8.0
166,35.0,33.0,,,4.60,8.0,0.0,110.0,70.0,1.0,...,3.81,2.44,1.75,5.61,3.02,0.0,,,0.0,0.0
198,31.0,24.0,,,5.37,3.0,1.0,120.0,80.0,3.0,...,2.21,2.12,1.01,2.90,1.84,0.0,,,0.0,0.0
203,,12.0,,,4.90,2.2,0.0,122.0,83.0,2.0,...,2.03,2.54,0.93,5.28,2.44,0.0,,,1.0,14.0
212,32.0,,,,4.18,9.0,0.0,110.0,70.0,1.0,...,2.25,2.08,1.03,4.90,2.85,0.0,,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900,,12.0,,,5.40,0.4,0.0,127.0,90.0,3.0,...,,,,,,0.0,,,1.0,4.0
901,,8.0,20.0,28.0,5.20,10.0,0.0,100.0,55.0,2.0,...,,,,,,0.0,,,1.0,19.0
903,,12.0,,,6.90,0.0,0.0,122.0,77.0,3.0,...,2.35,1.41,1.08,2.34,2.43,0.0,,,1.0,14.0
904,,,,23.0,4.33,5.0,0.0,110.0,70.0,1.0,...,2.03,1.28,0.93,2.66,2.20,0.0,,,0.0,


<h3><b>1.3 Get monitoring Data

In [79]:
# Set of garbage monitoring parameters (time lags, device info etc.)
with open('garbage_params.txt', mode='r', encoding='UTF-8') as file:
    garbage_params = file.read().split(', ')

In [80]:
original_monitoring_data = pd.read_csv('monitoring_data.csv', index_col='N',  dtype='string', encoding='UTF-8')

In [81]:
for column in original_monitoring_data.columns:
    try:
        original_monitoring_data[column] = original_monitoring_data[column].astype('float')
    except:
        pass
original_monitoring_data.index = pd.to_numeric(original_monitoring_data.index, errors='raise')

In [82]:
# Remove all unnecessary items
redundant_params = original_monitoring_data.filter(regex=f'Unnamed|meal_items|meal_mass|meal_time|without', axis=1).columns
monitoring_data = original_monitoring_data.copy()
monitoring_data = monitoring_data.loc[:, ~monitoring_data.columns.isin([*garbage_params, *redundant_params])]

In [83]:
# Exclude patients who were taking insulin
insulin_features = ['i_before', 'i_before_t', 'i_type']
monitoring_data = monitoring_data[(monitoring_data['project'] == 3) & (monitoring_data['i_before'].isna())]
monitoring_data.drop(labels=['project', *insulin_features], axis=1, inplace=True)

In [84]:
# During the analysis of the literature, the most
# significant parameters for forecasting were identified
targets = ['BG30', 'BG60', 'BG90', 'BG120', 'BGMax', 'AUC60', 'AUC120', 'iAUC60', 'iAUC120']
factors = ['BG0', 'gi', 'gl', 'carbo', 'prot', 'fat']
monitoring_data.dropna(subset=[*factors, *targets], inplace=True)

In [85]:
monitoring_data.sort_index()

Unnamed: 0_level_0,meal_type_n,gi,gl,carbo,prot,fat,kcal,water,mds,kr,...,алкоголь2,сладкие_напитки2,кофе2,сосиски2,ходьба1,подъем1,спорт1,ходьба2,подъем2,спорт2
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
77.0,2.0,14.7,55.2,45.00,19.90,42.20,642.30,342.27,8.00,36.89,...,1.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,1.0,1.0
77.0,3.0,18.6,19.4,29.50,47.90,12.50,432.10,389.52,22.24,7.19,...,1.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,1.0,1.0
77.0,1.0,5.3,18.4,29.40,40.60,14.00,414.70,366.38,20.58,8.79,...,1.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,1.0,1.0
77.0,2.0,51.3,85.2,75.30,24.20,31.90,691.00,349.60,8.45,66.95,...,1.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,1.0,1.0
77.0,3.0,50.7,49.3,47.60,10.20,19.80,415.00,262.25,23.80,23.80,...,1.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1017.0,4.0,54.9,36.6,23.53,11.26,8.76,220.30,63.10,0.00,0.11,...,1.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,2.0,1.0
1017.0,4.0,45.0,41.9,31.68,1.65,18.70,302.50,0.33,0.00,1.65,...,1.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,2.0,1.0
1017.0,1.0,55.0,54.6,47.85,25.29,12.56,400.61,325.41,0.00,0.12,...,1.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,2.0,1.0
1017.0,4.0,53.9,72.3,65.01,18.64,12.41,442.61,272.06,0.00,0.00,...,1.0,3.0,3.0,3.0,2.0,2.0,1.0,2.0,2.0,1.0


<h3><b>1.4 Conjugate tables

In [86]:
data = clinical_data.merge(micr_data, on='N', how='inner')

In [87]:
data = monitoring_data.merge(data, on='N', how='inner', suffixes=('_monitoring', '_clinical'))

In [88]:
redundant_params = data.filter(regex='_monitoring').columns
data.drop(labels=redundant_params, axis=1, inplace=True)
data.rename(columns=lambda x: re.sub(r'_clinical$', '', x), inplace=True)

In [89]:
data.sort_index(inplace=True)
data

Unnamed: 0_level_0,meal_type_n,gi,gl,carbo,prot,fat,kcal,water,mds,kr,...,OTU_338,OTU_305,OTU_312,OTU_496,OTU_420,OTU_193,OTU_454,OTU_399,OTU_257,OTU_337
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
281.0,4.0,38.5,3.7,9.6,0.8,0.3,49.0,186.30,9.50,0.10,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
281.0,4.0,1.5,3.7,10.3,2.9,0.8,71.5,327.08,27.30,0.10,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
281.0,3.0,16.4,26.4,36.4,13.6,43.1,593.7,499.42,18.11,18.30,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
281.0,4.0,45.0,7.4,18.2,19.2,4.0,200.0,444.45,34.82,0.62,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
281.0,4.0,22.2,14.7,20.9,1.8,0.7,102.0,208.65,19.53,1.38,...,17.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,14.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873.0,4.0,16.0,12.8,20.9,9.3,3.3,161.8,317.89,20.93,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
873.0,2.0,27.0,22.6,20.0,8.7,11.0,216.8,493.06,5.33,14.62,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
873.0,4.0,51.6,22.8,19.5,5.9,14.9,228.6,390.38,16.24,1.89,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
873.0,3.0,16.9,29.9,33.9,16.5,8.8,280.4,264.00,2.07,31.79,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h3><b>1.5 Data description

In [90]:
microbial_sequences = original_micr_data.iloc[413, 2:]
otu_micr_pairs = dict(zip(microbial_sequences.index, microbial_sequences.to_numpy()))
selected_otu_micr_pairs = {key: otu_micr_pairs[key] for key in micr_data.columns}

In [91]:
with open('dataset_description.txt', 'w', encoding='UTF-8') as file:

    file.write(f'Total number of unique patients: {len(data.index.unique())}\n')
    file.write(f'Total number of rows: {data.shape[0]}\n\n')
    
    file.write(f'FEATURES ({len(data.columns)})\n\n')
    file.write('\n'.join(data.columns) + '\n\n')

    file.write('OTU DECIPHER\n\n')
    for key, value in selected_otu_micr_pairs.items():
        file.write(f'{key}: {value}\n')

<h2><b>2. Exploratory data analysis

<h3><b>2.1 Exploring hidden patters and feature engineering

In [94]:
# Lets examine the data characterizing the patients included in the preliminary dataset
N = data.index.unique()
noutof_clinical_data = clinical_data[clinical_data.index.isin(N)]
noutof_clinical_data.describe()

Unnamed: 0,CGMS_срок,диета_срок,диета_кал,срок_кал1,Глюкоза_нт_общая,прибавка_m1,отеки1,АД_сист1,АД_диаст1,N_беременностей,...,ТГ_V1,ЛПВП_V1,ЛПОНП_V1,ЛПНП_V1,КА_V1,АБ_бер_ть,CGM_g_age1,GM_g_age2,diet_before_V1,Diet_duration_V1
count,12.0,69.0,58.0,82.0,97.0,95.0,97.0,97.0,97.0,97.0,...,96.0,96.0,63.0,63.0,63.0,95.0,97.0,16.0,97.0,97.0
mean,30.0,22.391304,4.862069,27.560976,4.917423,6.656842,0.14433,119.082474,74.835052,2.14433,...,1.960729,1.968438,0.843968,3.266508,2.172222,0.105263,30.113402,34.625,0.536082,2.979381
std,3.592922,6.761085,5.410997,3.399701,0.590796,3.873963,0.35325,11.542196,9.573923,1.274671,...,0.748446,0.414148,0.316791,0.814464,0.765191,0.30852,3.363023,2.680174,0.501287,5.004122
min,22.0,3.0,0.0,14.0,3.2,0.0,0.0,90.0,60.0,1.0,...,0.73,1.18,0.35,1.73,0.82,0.0,14.0,25.0,0.0,0.0
25%,28.75,17.0,2.0,26.0,4.5,4.25,0.0,110.0,70.0,1.0,...,1.4,1.65,0.61,2.69,1.705,0.0,28.0,35.0,0.0,0.0
50%,30.0,25.0,3.0,28.0,5.03,7.0,0.0,120.0,75.0,2.0,...,1.93,1.945,0.82,3.18,2.11,0.0,31.0,35.0,1.0,1.0
75%,31.5,26.0,5.0,30.0,5.3,9.5,0.0,125.0,80.0,3.0,...,2.3225,2.275,1.0,3.7,2.555,0.0,33.0,36.0,1.0,4.0
max,36.0,35.0,27.0,32.0,6.5,15.0,1.0,150.0,120.0,7.0,...,4.35,3.44,2.0,6.22,4.89,1.0,36.0,37.0,1.0,26.0


In [95]:
px.imshow(noutof_clinical_data.corr())

### Inference:
1) In general, the correlation between columns is low, with individual yellow dots corresponding to columns that store test dates in weeks of pregnancy.
This probably indicates that most people had their tests either on the same day or with a small gap.
2) The small cluster at the top of the heat map reflects correlations between number of births, abortions, number of children successfully carried, etc., which is not questionable and looks logical.
3) The large cluster of fiches at the bottom of the diagonal line indicates strong correlations between biochemical parameters taken, as already established in (1), at the same week of gestation.
4) Associations between systolic and diastolic blood pressure, pre-pregnancy weight and BMI taken at the time of enrollment are also seen.
5) White dots on the graph: CGMS_срок, GM_g_age2 - contain too much NaN, НТГ - does not contain missing values, but its variability is almost 0

### Conclusions
Let's remove these columns

In [96]:
data.drop(labels=['CGMS_срок', 'GM_g_age2', 'НТГ'], axis=1, inplace=True)

In [97]:
px.box(pd.concat([noutof_clinical_data['Глюкоза_нт_общая'],
                  data['BGMax'].groupby('N').median(),
                  data['AUC120'].groupby('N').median(),
                  noutof_clinical_data['СД_у_родственников']],
                  axis=1), color='СД_у_родственников', notched=True, points='all')

### Inference:
In those patients who have relatives with type 1 or type 2 diabetes (unclear), fasting glucose levels (Глюкоза_нт_общая) are higher and the median peak postprandial glucose level (BGMax) is also noticeably higher, however, the total area under the glycemic curve (AUC120) is rather the same. The latter fact suggests that higher peak glucose levels after meals are followed by a sharp drop in glycemic levels ((BGMax-BG120)/BG120). An alternative cause may be lower premeal glucose values BG0 (not to be confused with fasting glucose values - Глюкоза_нт_общая). You may also notice some asymmetry in the data. There are slightly more patients who have no relatives with diabetes.

In [98]:
px.box(pd.concat([data['BG0'].groupby('N').median(),
                  noutof_clinical_data['СД_у_родственников']],
                  axis=1), color='СД_у_родственников', notched=True, points='all')

In [99]:
px.box(pd.concat([((data['BGMax'].groupby('N').median()-data['BG120'].groupby('N').median())/data['BG120'].groupby('N').median()) ,
                  noutof_clinical_data['СД_у_родственников']],
                  axis=1), color='СД_у_родственников', notched=True, points='all')

### Conclusion
Let's add the ratio of the magnitude of the drop in glucose level to the baseline value after 120 minutes as another target variable. Additionally, it may be useful to know the median of this value over previous meals to predict AUC120, but this requires more careful work, selection of the correct time window and consultations with a domain specialist about the nature of this effect.

In [100]:
data['BG_drop_rate'] = data['BGMax'] - data['BG120'] / data['BG120']

In [101]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.concat([data['gl'], data['carbo'],
                data['gi'], data['meal_type_n']], axis=1)
df[['gl', 'carbo', 'gi']] = scaler.fit_transform(df[['gl', 'carbo', 'gi']])
px.scatter_ternary(df, a='gl', b='carbo', c='gi', color='meal_type_n')

In [102]:
df = pd.concat([data['carbo'], data['prot'],
                 data['fat'], data['meal_type_n']], axis=1)
df[['carbo', 'prot', 'fat']] = scaler.fit_transform(df[['carbo', 'prot', 'fat']])
px.scatter_ternary(df, a='carbo', b='prot', c='fat', color='meal_type_n')

### Inference
The graph shows that meal_type_n = 4, which corresponds to snacks associated with a high glycemic index (gi) and carbohydrates (carbo). It is logical to assume that during snacks, patients used sugar-containing products as part of their meals. It might also make sense to break down meal times into more categories and see if we can visually separate the classes on a ternary plot.

In [103]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder

nbins = 4
fig = make_subplots(rows=1, cols=nbins-1,
                    specs=[[{'type': 'ternary'} for n in range(nbins-1)]],
                    subplot_titles=[f'nbins = {n+2}' for n in range(nbins-1)],
                    horizontal_spacing=.2)

for n in range(2, nbins+1):
    binned_daytime = pd.cut(data['daytime'], bins=n)
    df = pd.concat([data['carbo'], data['prot'], data['fat'], binned_daytime], axis=1)
    df[['carbo', 'prot', 'fat']] = scaler.fit_transform(df[['carbo', 'prot', 'fat']])
    daytimes = df['daytime'].unique()
    encoder = LabelEncoder()
    colors = encoder.fit_transform(daytimes)
    for i, daytime in enumerate(daytimes):
        mask = df['daytime'] == daytime
        fig.add_trace(go.Scatterternary(a=df.loc[mask, 'carbo'],
                                        b=df.loc[mask, 'prot'],
                                        c=df.loc[mask, 'fat'],
                                        mode='markers',
                                        name=str(daytime),
                                        marker={'color': colors[i]}),
                                        row=1, col=n-1)
    fig.update_ternaries(aaxis_title='Carbo', baxis_title='Prot', caxis_title='Fat', row=1, col=n-1)
fig.update_layout(title='Changes in the nutritional composition of food throughout the day', legend_title='Time intervals')
fig.show()

### Conclusion
1) It can be seen that there is a slight bias towards saturated fatty foods in the period before 11am. After consultation with a nutritionist and diabetologist, it was learned that the postprandial glycemic response does tend to be sharper in the morning. A new binary variable should be added, 0 - eating before 11am, 1 - after 11am.
2) Snacking is associated with a higher glycemic index of the product, but surprisingly, the glycemic load and number of carbohydrates in the product, is relatively low. After consultation with a nutritionist, it became clear that these features were related to the carbohydrate-free diet the patients were on during the monitoring period.

In [104]:
data['daytimeb11'] = (data['daytime'] < 11).astype('int')

<h3><b>2.2 Replacing missing values

In [105]:
nans_percentage = data.isna().mean()
any_na = nans_percentage[nans_percentage > 0]
any_na.index

Index(['prec_meal_gi', 'prec_meal_gl', 'prec_meal_carbo', 'prec_meal_prot',
       'prec_meal_fat', 'prec_meal_pv', 'iAUCb240', 'iAUCb120', 'iAUCb60',
       'BGRiseb240', 'BGRiseb120', 'BGRiseb60', 'BGb240', 'BGb120', 'BGb60',
       'BGb50', 'BGb40', 'BGb30', 'BGb25', 'BGb20', 'BGb15', 'BGb10', 'BGb5',
       'bgBefore_glu', 'через1час_тест', 'через2часа_тест', 'ИЦН',
       'диета_срок', 'диета_кал', 'срок_кал1', 'прибавка_m1',
       'rs10830963_MTNR1B_N', 'ФР_V1', 'Хол_V1', 'ТГ_V1', 'ЛПВП_V1',
       'ЛПОНП_V1', 'ЛПНП_V1', 'КА_V1', 'АБ_бер_ть'],
      dtype='object')

### Inference
1) All prognostic parameters with the prefix "prec" refer to previous meals, missing values are equivalent to zero.
2) All monitoring parameters are marked with time stamps from 0 to 120 and are critical for prediction. Omissions in them are rare, such lines should be deleted.
3) The bgBefore_glu parameter corresponds to BG0, however, they are taken by different devices. To increase the consistency of the dataset, it is better to leave one of the columns with fewer skips, i.e. BG0.
4) Gaps in the other parameters should not be evaluated in the same way because they are not unique values. They are patient parameters and are repeated many times in each row of the dataset. We should cast the dataset to 1 line per 1 patient and count the number of omissions, relative to the number of patients.

### Conclusions

In [106]:
prec_meal_params = any_na.filter(regex='prec_').index
data[prec_meal_params] = data[prec_meal_params].fillna(value=0)

In [107]:
matches = re.findall(r"iAUCb\d+|BGRiseb\d+|BGb\d+", ', '.join(any_na.index.to_list()))
data.dropna(subset=matches, inplace=True)

In [108]:
data.drop('bgBefore_glu', axis=1, inplace=True)

In [109]:
row_per_patient = data.groupby('N').mean()
nans_percentage = row_per_patient.isna().mean()
any_na = nans_percentage[nans_percentage > 0]
any_na

через1час_тест         0.278351
через2часа_тест        0.268041
ИЦН                    0.041237
диета_срок             0.288660
диета_кал              0.402062
срок_кал1              0.154639
прибавка_m1            0.020619
rs10830963_MTNR1B_N    0.072165
ФР_V1                  0.010309
Хол_V1                 0.010309
ТГ_V1                  0.010309
ЛПВП_V1                0.010309
ЛПОНП_V1               0.350515
ЛПНП_V1                0.350515
КА_V1                  0.350515
АБ_бер_ть              0.020619
dtype: float64

### Inference
About half of all the remaining parameters have more than 20% omissions, it would be best to just delete them. The remaining gaps are categorized as MAR (Missing at Random). It is not entirely clear whether to use MICE imputation or to leave them as is and use models based on decision trees. Nevertheless, based on the fact that LSTM models showed the best results in predicting glycemia levels from continuous monitoring data, the possibility of using deep neural networks should be considered and, therefore, the gaps should be filled in by MICE (Iterative Imputer) or KNN Imputer methods.

In [110]:
data.drop(labels=nans_percentage[nans_percentage > 0.2].index, axis=1, inplace=True) 

In [111]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

<h3><b>2.3 Export

In [None]:
data.to_pickle('data.pkl')

In [None]:
data.to_csv('data.csv', sep=';', decimal=',', encoding='UTF-8')

<h2><b>3. Model pipeline development