# 6. Final preparation for machine learning
Before delving into the machine learning modeling, it's pivotal to ensure that the data is adequately cleaned, processed, and formatted. This section delineates the steps undertaken to finalize the data in preparation for model training and validation.
Highlights:

* Data Importing: Leveraging libraries such as pandas and numpy, the DEXIP study data is loaded and initialized.

* Target Variables: Two distinct binary target variables (y & y_3) are crafted, stemming from glucose level thresholds. Additionally, a 'stratify' column is introduced, marrying participant ID with the y_3 target, to aid in stratified sampling later on.

* Data Trimming: Emphasis is placed on data quality. Rows deficient in data (less than 60% sufficiency) are pruned, and columns with over 20% missing data are cast aside. Furthermore, rows lacking key columns like start_glc and duration are eradicated.

* Data Refinement: Innovative features like 'season' are derived from existing ones, offering richer data for modeling. Certain columns, mostly timestamps, are processed and converted to a numeric format.

* Tidying & Saving Data: A series of operations—including resetting indices, discarding superfluous columns, replacing infinite values, creating dummy variables for categorical columns, and employing KNN imputation for missing values—culminates in the final data. The finished dataframe, along with other critical components like feature matrix (X) and target vector (y), is archived for subsequent use.

With the data now in its prime, it paves the way for the ensuing phase: machine learning modeling.

## 6.0. Import packages and load data

In [1]:
import pandas as pd
import numpy as np
import preprocess_helper
import sys
from sklearn.impute import KNNImputer
path = "../../diametrics"
sys.path.append(path)

In [2]:
df = pd.read_csv('../../data/tidy_data/data_with_targets_dexip.csv')

In [3]:
df[(df['duration']>=10) | (df['duration']<=120)].shape

(20420, 853)

In [4]:
[i for i in df.columns]

['ID',
 'start_datetime',
 'finish_datetime',
 'intensity',
 'type_of_exercise',
 'starting_glucose',
 'finishing_glucose',
 'month',
 'day',
 'day_of_week',
 'time_of_day',
 'duration',
 'form_of_exercise',
 'interval',
 'study',
 'bout_id',
 'start_glc',
 'start_roc',
 'y',
 'age',
 'sex',
 'race',
 'hba1c',
 'bmi',
 'years_since_diagnosis',
 'insulin_modality',
 'urine_cpep',
 'urine_creat',
 'chol',
 'cpep',
 'h_index',
 'hdl',
 'ldl',
 'nhdl',
 'trig',
 'data_suff_during',
 'before_data_suff',
 'Average glucose (mmol/L)',
 'eA1c (%)',
 'SD (mmol/L)',
 'CV (%)',
 'AUC (mmol h/L)',
 'LBGI',
 'HBGI',
 'MAGE (mmol/L)',
 'TIR normal (%)',
 'TIR normal 1 (%)',
 'TIR normal 2 (%)',
 'TIR level 1 hypoglycemia (%)',
 'TIR level 2 hypoglycemia (%)',
 'TIR level 1 hyperglycemia (%)',
 'TIR level 2 hyperglycemia (%)',
 'Total number hypoglycemic events',
 'Number LV1 hypoglycemic events',
 'Number LV2 hypoglycemic events',
 'Number prolonged hypoglycemic events',
 'Avg. length of hypoglycemic

## 6.1. Prepare target variables

# Target variable for 5mmol/L
df['y'] = df['tir_below_5']>0

# Target variable for 3.9mmol/L
df['y_3'] = df['tir_below_3']>0

In [5]:
# Create column for stratifying based on ID and target (3.9)
df['stratify'] = df['ID']+ '_'+ df['y'].astype(str)

## 6.2. Cutting data

In [6]:
# Drop any instances without start glucose and duration
df = df.dropna(subset=['start_glc', 'duration']).reset_index()

In [7]:
df.shape

(18180, 855)

In [8]:
# Drop columns with < 60% data sufficiency
df = df.loc[df['data_suff_during']>=60]

In [9]:
# Drop columns with < 60% data sufficiency
df = df.loc[df['before_data_suff']>=40]

In [10]:
df.shape

(17657, 855)

In [11]:
# Drop columns with less than 80% data available
df_dropped = df.loc[:, df.isnull().mean() < .2]

In [12]:
df.shape

(17657, 855)

In [13]:
df.shape

(17657, 855)

## 6.3. Data adjustments

In [14]:
# Create 'season' variable
df['season'] = (df.month%12 // 3 + 1).astype(str)

In [15]:
# Make datetime cols numeric
cols = ['Avg. length of hypoglycemic events', 
        'Total time spent in hypoglycemic events',
         'Avg. length of hyperglycemic events',
         'Total time spent in hyperglycemic events', ]

for col in cols:
    df_dropped[col] = df_dropped[col].apply(lambda x: preprocess_helper.make_numeric(x)).astype(float)

## Get rid of weird bouts

In [16]:
# Convert form_of_exercise
df_dropped['form_of_exercise'] = df_dropped.type_of_exercise.apply(lambda x: 
                                                       preprocess_helper.divide_exercise_into_type(x))

hit
hit
hit
step class
step class
hit
hit
hit
hit
boot camp
painting
2.6
circuits
circuits
moving fence panels
nk (teaching?)
sheep weighing
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
strawberry picking
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
holiday decorating indoor and out
snow shoveling
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor chores
outdoor cho

In [17]:
df_dropped = df_dropped.dropna(subset=['form_of_exercise'])

In [18]:
df_dropped.shape

(16490, 414)

In [19]:

df_dropped=df_dropped.drop_duplicates(['bout_id', 'start_datetime'])

In [20]:
df_dropped.shape

(16477, 414)

## 6.4. Adjust and tidy data

In [21]:
# Save df for demographics table
df_dropped.to_csv('../../data/tidy_data/demographics_df.csv', index=False)

In [22]:
# Save id, bout id etc as a key
df_dropped[['ID', 'bout_id', 'start_datetime','finish_datetime']].to_csv('../1_preprocessing/id_key_dexip.csv', index=False)

In [23]:
# Reset index after dropping
df_dropped.reset_index(inplace=True, drop=True)

In [24]:
# Drop columns that arent needed
cols_to_drop = ['start_datetime', 'finish_datetime', 'type_of_exercise',
               'data_suff_during', # 'interval', 'tir_below_5', 'tir_below_3',
             'month', 'day',  'ID',  'index']

df = df_dropped.drop(columns=cols_to_drop)

In [25]:
# Repace infs in tsfresh with nan
df.replace([np.inf, -np.inf], np.nan, inplace=True)

In [26]:
# Create X 
X = df.drop(columns=['study','interval', 'stratify', 'bout_id', 'y'])

# Get dummies
X = pd.get_dummies(X)

# Fill missing data
imputer = KNNImputer(n_neighbors=5, missing_values=np.nan)
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Round intensity back to ordinal after imputing
X.intensity = X.intensity.round()

# Set y as 3.9mmol/L
y = df['y']

In [27]:
# Save everything
df.to_csv('../../data/tidy_data/final_df/df.csv', index=False)
df['stratify'].to_csv('../../data/tidy_data/final_df/strat.csv', index=False)
X.to_csv('../../data/tidy_data/final_df/X.csv', index=False)
y.to_csv('../../data/tidy_data/final_df/y.csv', index=False)