# Preparation

This project aims on using Deep Learning with Kullback-Leibler Divergence. Some basic packages are required for this project. The aim of the following chunk is to install required packages such as `lifelines`, `sklearn-pandas` and `torchtuples`. Note that `pycox` is built based on `torchtuples`.



In [None]:
!pwd
!pip install lifelines
! pip install sklearn-pandas
! pip install torchtuples
! pip install optunity
# ! pip install statsmodels
! pip install statsmodels --upgrade
! pip install pycox

We have used the API called `google.colab.drive` to get the software. Currently it is a series of python codes that wrapping necessary components for the experiments.

In [2]:
from google.colab import drive

drive.mount("/content/drive")

%cd "/content/drive/MyDrive/Kevin He"
# %pwd
# from pycox.models import LogisticHazard

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Kevin He


`data_simulation` is the simulation codes that used for simulating data for our experiments, `read_data` is a class that can be used to generate different datasets, `KLDL` wraps necessary components.

In [3]:
import os
import sys
import time

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import torch # For building the networks 
import torch.nn as nn
import torch.nn.functional as F
import torchtuples as tt

# For preprocessing
from random import sample
# from pycox.models import PMF
# from pycox.models import DeepHitSingle

import KLDL
from KLDL import NewlyDefinedLoss
import data_simulation
import read_data

# MNIST is part of torchvision
from torchvision import datasets, transforms

In [4]:
np.random.seed(1234)
_ = torch.manual_seed(1234)

If you have mounted the data rightly, you should print out the list for files in given folder. For example, this will print out all files in `/content/drive/My Drive/Kevin He`

In [5]:
# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 2022WI folder and put all the files under A4 folder, then "2022WI/A4"
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = "Kevin He"
GOOGLE_DRIVE_PATH = os.path.join("/content/drive", "My Drive", GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

# Add to sys so we can import .py files.
sys.path.append(GOOGLE_DRIVE_PATH)

['local_data.csv', 'prior_data.csv', 'local_data_2.csv', 'local_data_3.csv', 'prior_data_subset.csv', 'prior_data_subset_2.csv', 'prior_data_subset_3.csv', 'R-code.ipynb', 'local_data_together.csv', 'Untitled', 'Deep Learning with KL Divergence Simulation Image Data.ipynb', 'Deep Learning with KL Divergence.ipynb', 'Deep Learning with KL Divergence Real Data.ipynb', 'prior_data_subset_4.csv', 'Deep Learning with KL Divergence Simulation Non-linear-prop.ipynb', 'prior_data_subset_5.csv', 'Deep Learning with KL Divergence Simulation 3: Tutorial.ipynb', 'Tutorial 1: Using Our Model with Deep Learning as Prior.ipynb', 'Deep Learning with KL Divergence Simulation 2-2: Tutorial.ipynb', 'Deep Learning with KL-divergence Real Data: Tutorial.ipynb', 'Deep Learning with KL Divergence Simulation 1: Tutorial', 'Deep Learning with KL Divergence Non-linear Visualization.ipynb', '.ipynb_checkpoints', 'Deep Learning with KL Divergence Simulation 2-1: Tutorial.ipynb', 'Deep Learning with KL Divergence:

# Data Preprocessing

You will expect to see:

1. How to read different kinds of data with `read_data`.

2. How to change the version of the time points when your time is continuous.

## Read Data

`read_data` is the package that used for generating different kinds of data. It consists of simulation data and real data.

Usage: `read_data.simulation_data(option, n, grouping_number)`

`option`: whether the data is linear or non-linear, proportional or non-proportional. There are 3 options: `non linear non ph`, `non linear ph`, `linear ph`.

`n`: The number of data points.

`grouping_number`: The number of groups, with each group the same number of data points. If `grouping_number == 0`, then only one dataset will be generated, otherwise, it will generate a list of datasets. *Required: `n mod grouping_number == 0 if grouping_number > 0`*.

In [None]:
data_1 = read_data.simulation_data(option = "non linear non ph", n = 30000, grouping_number = 100)
data_1

In [13]:
data_1[50][1]

Unnamed: 0,x1,x2,x3,duration,event,temp
15000,0.390851,0.000927,0.564790,6.332959,1,50
15001,-0.807298,0.571272,-0.729109,7.068435,1,50
15002,-0.988140,0.348277,0.928933,0.839747,1,50
15003,-0.920326,-0.570021,-0.820893,30.000000,0,50
15004,0.738997,0.980095,-0.110314,0.710694,0,50
...,...,...,...,...,...,...
15295,0.529739,0.775728,-0.306719,6.134285,1,50
15296,-0.264944,-0.289541,-0.474699,14.559678,0,50
15297,-0.820400,0.137996,0.309843,6.658355,0,50
15298,-0.935507,-0.182500,-0.650530,3.273953,1,50


In [18]:
data_2 = read_data.simulation_data(option = "non linear non ph", n = 300, grouping_number = 0)
data_2

Unnamed: 0,x1,x2,x3,duration,event
0,0.451017,-0.143499,-0.516321,22.029775,0
1,-0.432551,0.737546,-0.812051,28.454158,1
2,0.056319,0.912434,-0.633304,1.583693,0
3,0.336413,-0.074994,-0.552038,10.458747,0
4,-0.594424,-0.307192,-0.374162,8.316722,0
...,...,...,...,...,...
295,0.574306,0.834613,0.095500,5.822907,1
296,-0.019220,0.879195,-0.764657,23.564802,1
297,-0.746507,0.480667,-0.932133,10.427925,1
298,-0.457561,-0.462020,-0.787083,18.039878,1


Usage: `read_data.support_data()`, this will generate the processed SUPPORT data, more information will be described in the paper and added here.

---

Usage: `read_data.metabric_data()`, this will generate the processed METABRIC data, it follows Deepsurv.

In [7]:
support = read_data.support_data()
support

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,duration,event
0,62.84998,male,other,0,0,0,metastatic,97.0,69.0,22.0,36.00000,6.000000,141.0,1.199951,20.0,2029,0
1,60.33899,female,white,2,0,0,no,43.0,112.0,34.0,34.59375,17.097656,132.0,5.500000,74.0,4,1
2,52.74698,female,white,2,0,0,no,70.0,88.0,28.0,37.39844,8.500000,134.0,2.000000,45.0,47,1
3,42.38498,female,white,2,0,0,metastatic,75.0,88.0,32.0,35.00000,9.099609,139.0,0.799927,19.0,133,1
4,79.88495,female,white,1,0,0,no,59.0,112.0,20.0,37.89844,13.500000,143.0,0.799927,30.0,2029,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8826,68.61597,female,white,2,0,0,yes,71.0,110.0,10.0,36.19531,12.599609,128.0,0.799927,19.0,353,0
8827,66.07300,male,white,1,0,0,no,109.0,104.0,22.0,35.69531,7.399414,131.0,1.099854,22.0,350,0
8828,70.38196,male,white,1,0,0,no,111.0,83.0,24.0,36.69531,8.398438,139.0,2.699707,39.0,346,0
8829,47.01999,male,white,1,0,0,yes,99.0,110.0,24.0,36.39844,7.599609,135.0,3.500000,51.0,7,1


In [8]:
metabric = read_data.metabric_data()
metabric

Dataset 'metabric' not locally available. Downloading...
Done


Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,duration,event
0,5.603834,7.811392,10.797988,5.967607,1.0,1.0,0.0,1.0,56.840000,99.333336,0
1,5.284882,9.581043,10.204620,5.664970,1.0,0.0,0.0,1.0,85.940002,95.733330,1
2,5.920251,6.776564,12.431715,5.873857,0.0,1.0,0.0,1.0,48.439999,140.233337,0
3,6.654017,5.341846,8.646379,5.655888,0.0,0.0,0.0,0.0,66.910004,239.300003,0
4,5.456747,5.339741,10.555724,6.008429,1.0,0.0,0.0,1.0,67.849998,56.933334,1
...,...,...,...,...,...,...,...,...,...,...,...
1899,5.946987,5.370492,12.345780,5.741395,1.0,1.0,0.0,1.0,76.839996,87.233330,1
1900,5.339228,5.408853,12.176101,5.693043,1.0,1.0,0.0,1.0,63.090000,157.533340,0
1901,5.901610,5.272237,14.200950,6.139390,0.0,0.0,0.0,1.0,57.770000,37.866665,1
1902,6.818109,5.372744,11.652624,6.077852,1.0,0.0,0.0,1.0,58.889999,198.433334,0


For the further experiments, we design two schemes of prior data and local data to show the usage of our package. In general, prior data will have sufficient number of data points, but the model assumption/the number of features will be weaker, so here we generate prior data with only linear terms and proportional hazard assumption, but with 10000 data points. For local data, we generate 300 non linear and non proportional hazard data points, and we assume that this is the true model.

In [6]:
# Default grouping number is 0

data_local = read_data.simulation_data(option = "non linear non ph", n = 300)

prior_data = read_data.simulation_data(option = "linear ph", n = 10000)

## Continuous to Discrete

Most experimental clinical data has continuous event time, which is not suitable for this project. Thus we apply the ways of discretization. More specifically, equi-distance or quantile-based.

Usage: `KLDL.cont_to_disc(data, labtrans = None, scheme = 'quantiles', time_intervals = 20)`

`data`: the dataframe, it should contain `duration` and `event` columns, which is the event time and the event status (censored or not).

`labtrans`: The transforming scheme, a class defined in PyCox. If you would like to pass in a `labtrans` object, the method understand it as you would like to pass a **fitted** scheme, in this case, the scheme will not be fitted by the data, but just do transformation on the data based on the fitted scheme. Otherwise, it will fit a new scheme and return this scheme back for further transformation.

`scheme`: `quantiles` or not.

`time_intervals`: The number of time intervals that you would like to have for the model. Note that the time point will start from $0$. 

In [7]:
time_intervals = 20

labtrans, prior_data = KLDL.cont_to_disc(prior_data, time_intervals = time_intervals)
data_local = KLDL.cont_to_disc(data_local, labtrans = labtrans, time_intervals = time_intervals)

In [8]:
prior_data

Unnamed: 0,x1,x2,x3,duration,event
0,-0.686096,0.551481,-0.932677,14,0
1,-0.978083,-0.359753,0.053454,10,1
2,0.259453,0.763028,0.282278,7,1
3,-0.954624,0.390381,0.769304,10,1
4,0.502788,-0.473844,0.015149,13,1
...,...,...,...,...,...
9995,-0.804342,-0.374652,-0.291306,19,1
9996,-0.220712,-0.690892,-0.774906,4,1
9997,0.837798,-0.581770,0.629657,4,1
9998,0.633505,-0.266602,-0.436441,2,1


In [9]:
data_local

Unnamed: 0,x1,x2,x3,duration,event
0,-0.616961,0.244218,-0.124545,10,0
1,0.570717,0.559952,-0.454815,2,0
2,-0.447071,0.603744,0.916279,4,1
3,0.751865,-0.284365,0.001990,18,0
4,0.366926,0.425404,-0.259498,6,0
...,...,...,...,...,...
295,-0.194381,0.862594,0.346326,4,1
296,-0.161540,0.137333,-0.916579,15,0
297,0.991268,0.238008,0.558932,9,1
298,0.862677,-0.316763,0.622598,14,1


In [10]:
data_local.groupby("duration").count()

Unnamed: 0_level_0,x1,x2,x3,event
duration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,4,4,4,4
1,18,18,18,18
2,20,20,20,20
3,13,13,13,13
4,13,13,13,13
5,10,10,10,10
6,18,18,18,18
7,9,9,9,9
8,26,26,26,26
9,16,16,16,16


# Model Training

You will expect to see:

1. How to train a prior model.

2. How to use the trained prior model to select a proper $\eta$, which is the hyperparameter used to define the weights of prior and local information (maybe changed in the future).

3. How to train a local model with or without the aid of prior model and the selected $\eta$.

4. How to do hyperparameter tuning.

## Prior Model Training

Usage: `KLDL.prior_model_generation(data, parameter_set = None, time_intervals = 20, dropout = 0.1, optimizer = tt.optim.Adam(), epochs = 512, patience = 5, verbose = False)`

`data`: prior_data

`parameter_set`: The hyperparameters dict related to the model structure that can be tuned. It consists of:

* number of hidden_nodes (default 32)
* number of hidden_layers (default 2)
* whether use Batch Normalization or not (default `True`)
* learning rate (default 0.01)
* batch size (default 32)

`time_intervals`: time points for the discrete model, should be the same as the number you have assigned if you use `cont_to_disc()` to transform continuous time to discrete version.

`dropout`: dropout rate

`optimizer`: optimizer, wrapped by `torchtuple`

`epochs`: maximum epochs for training

`patience`: The waiting steps used in earlystopping

`verbose`: Whether you want to print out the logs for deep learning model training

In [11]:
parameter_set = {
    "hidden_nodes": 32,
    "hidden_layers": 2,
    "batch_norm": True,
    "learning_rate": 0.0001,
    "batch_size": 32
}

In [12]:
model_prior = KLDL.prior_model_generation(prior_data, parameter_set = parameter_set, verbose = True)

0:	[1s / 1s],		train_loss: 7.2531,	val_loss: 6.8533
1:	[1s / 3s],		train_loss: 6.8347,	val_loss: 6.5090
2:	[1s / 4s],		train_loss: 6.4734,	val_loss: 6.1252
3:	[1s / 5s],		train_loss: 6.1032,	val_loss: 5.6914
4:	[1s / 6s],		train_loss: 5.6911,	val_loss: 5.3180
5:	[1s / 8s],		train_loss: 5.2043,	val_loss: 4.8396
6:	[1s / 9s],		train_loss: 4.7018,	val_loss: 4.1826
7:	[1s / 10s],		train_loss: 4.1963,	val_loss: 3.8376
8:	[1s / 12s],		train_loss: 3.7608,	val_loss: 3.4585
9:	[1s / 14s],		train_loss: 3.3914,	val_loss: 3.1168
10:	[2s / 16s],		train_loss: 3.1002,	val_loss: 2.8875
11:	[1s / 17s],		train_loss: 2.8824,	val_loss: 2.6823
12:	[1s / 19s],		train_loss: 2.7154,	val_loss: 2.5854
13:	[0s / 20s],		train_loss: 2.6181,	val_loss: 2.5327
14:	[0s / 20s],		train_loss: 2.5362,	val_loss: 2.4606
15:	[1s / 21s],		train_loss: 2.4774,	val_loss: 2.4159
16:	[0s / 22s],		train_loss: 2.4351,	val_loss: 2.3869
17:	[0s / 23s],		train_loss: 2.4034,	val_loss: 2.3728
18:	[0s / 24s],		train_loss: 2.3962,	val_loss

### Hyperparameter Tuning

It is recommended that we use hyperparameter tuning for the prior model (and also the local model).

Usage: `KLDL.hyperparameter_set_list(hidden_nodes=[32, 64, 128],
                            hidden_layers=[2, 3, 4],
                            batch_norm=[True, False],
                            learning_rate=[0.0005, 0.001, 0.01, 0.1],
                            batch_size=[32, 64, 128])`

Generate a list of dicts using Grid Search, each one is a hyperparameter set that can be directly passed as `parameter_set` parameter in method `model_generation`. The description 5 options can be seen above (usage of `KLDL.prior_model_generation`). 

*Required: Each parameter in this method must be a list even if only one element is passed*.

In [23]:
KLDL.hyperparameter_set_list()

[{'batch_norm': True,
  'batch_size': 32,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.0005},
 {'batch_norm': True,
  'batch_size': 64,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.0005},
 {'batch_norm': True,
  'batch_size': 128,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.0005},
 {'batch_norm': True,
  'batch_size': 32,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.001},
 {'batch_norm': True,
  'batch_size': 64,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.001},
 {'batch_norm': True,
  'batch_size': 128,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.001},
 {'batch_norm': True,
  'batch_size': 32,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.01},
 {'batch_norm': True,
  'batch_size': 64,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'learning_rate': 0.01},
 {'batch_norm': True,
  'batch_size': 128,
  'hidden_layers': 2,
  'hidden_nodes': 32,
  'lea

Below is a sketch of doing hyperparameter tuning. Due to the time limit, we do not test it in this tutorial.

In [None]:
# DO NOT RUN!!

best_model = None
best_score = 10000

set_list = KLDL.hyperparameter_set_list()
for i in set_list:
  log, model = KLDL.model_generation(data, parameter_set = i)
  score = log.to_pandas().val_loss.min()
  if(score < best_score):
    best_score = score
    best_model = model

## Cross Validation (CV)

After we train a prior model, we need to decide the value of $\eta$ to see how much we can 'trust' the information from the prior model. Mathematically, $l_\eta = \frac{l - \eta l_{KL}}{\eta + 1}$, so we can understand the combined loss as '1 local information and $\eta$ prior information'. For better understanding, if $\eta = 1$, this means the weights for local and prior information are 50-50 ($\frac 1 {1 + 1} = \frac 12 = 50\%$). The CV will be done 5-fold and the metrics will be averaged. Currently C-index is used, measured on a separate dataset in the local data. 

Usage: `KLDL.cross_validation_eta(df_local, eta_list, model_prior, parameter_set=None,
                         time_intervals=20,
                         dropout=0.1,
                         optimizer=tt.optim.Adam(),
                         epochs=512,
                         patience=5,
                         verbose=False)`

`df_local`: (local) data

`eta_list`: The list of $\eta$ that we would like to chosose.

`model_prior`: The trained prior model

For others, see above.

*Remark: Except for the best $\eta$ that this method will return, it will also return `df_train, df_test, x_test`. This is because we want the CV part and the further local model training having the same training and test split (**the test data will be used for evaluation for all models used for comparison**). So we would like to do this splitting in this function and then use the splitted data for further steps.*

In [13]:
eta_list = [0.1, 1, 5]
eta, df_train, df_test, x_test = KLDL.cross_validation_eta(data_local, eta_list, model_prior)



eta:  0.1
[0.7497921862011637, 0.7847049044056525, 0.7830423940149626, 0.772236076475478, 0.7772236076475478]
0.7733998337489609
eta:  1
[0.7755610972568578, 0.7622610141313383, 0.771404821280133, 0.7863674147963424, 0.6974231088944306]
0.7586034912718204
eta:  5
[0.7398171238570241, 0.7464671654197839, 0.7364921030756443, 0.7605985037406484, 0.742310889443059]
0.7451371571072319
CV ends


In [14]:
eta

0.1

## Local Model Training

We will train two versions of model (same model structure, only difference is whether we have prior information). 

We will use the selected $\eta$ and the trained prior model to train our local model. Note that we will not use the same splitting as what have shown before (The `df_train` and `df_test` that output from `KLDL.cross_validation_eta`.

Usage: `KLDL.mapper_generation(cols_standardize = None, cols_leave = None)`

`cols_standardize`: The list of features that user would like to choose for standardization. 

`cols_leave`: The list of features that user would like to choose without any operations. 

This function will return a mapper that will be used for transforming dataframe to array matrix in `numpy`. 

*Remark: The features that will be generated into the array matrix will be `cols_standardize + cols_leave`. It is **NOT** a default version that the features not in `cols_standardize` will be in `cols_leave`.*

Usage: `KLDL.model_generation(x_train, x_val, y_train, y_val, with_prior=True, eta=None, model_prior=None,
                     parameter_set=None,
                     time_intervals=20,
                     dropout=0.1,
                     optimizer=tt.optim.Adam(),
                     epochs=512,
                     patience=5,
                     verbose=False)`

`x_train, x_val, y_train, y_val`: x and y data that used for training the model

`with_prior`: whether you want to incorporate prior model. If `False`, `eta` and `model_prior` will not be required

For other parameters, see above.

*Remark: We pass the data in a seemly weird way (`x_train, x_val, y_train, y_val` but not `df_train, df_val`) due to the reason that for model with or without prior information, the data structure used for model training is different.*

In [23]:
parameter_set = {
    "hidden_nodes": 32,
    "hidden_layers": 2,
    "batch_norm": True,
    "learning_rate": 0.005,
    "batch_size": 32
}

In [24]:
df_val = df_train.sample(frac=0.2)
df_train = df_train.drop(df_val.index)

mapper = KLDL.mapper_generation(cols_standardize = ['x1', 'x2', 'x3'])

x_train = mapper.fit_transform(df_train).astype('float32')
x_val = mapper.transform(df_val).astype('float32')

y_train = (x_train, df_train['duration'].values, df_train['event'].values)
y_val = (x_val, df_val['duration'].values, df_val['event'].values)

model = KLDL.model_generation(x_train, x_val, y_train, y_val, eta = eta, model_prior = model_prior, parameter_set = parameter_set, verbose = True)

0:	[0s / 0s],		train_loss: 8.2483,	val_loss: 7.5524
1:	[0s / 0s],		train_loss: 7.9858,	val_loss: 7.3296
2:	[0s / 0s],		train_loss: 7.6148,	val_loss: 7.1517
3:	[0s / 0s],		train_loss: 7.3960,	val_loss: 6.9831
4:	[0s / 0s],		train_loss: 7.0011,	val_loss: 6.8196
5:	[0s / 0s],		train_loss: 6.9090,	val_loss: 6.6447
6:	[0s / 0s],		train_loss: 6.7233,	val_loss: 6.4749
7:	[0s / 0s],		train_loss: 6.6260,	val_loss: 6.3057
8:	[0s / 0s],		train_loss: 6.2498,	val_loss: 6.1165
9:	[0s / 0s],		train_loss: 6.0998,	val_loss: 5.9197
10:	[0s / 0s],		train_loss: 6.0497,	val_loss: 5.7481
11:	[0s / 0s],		train_loss: 5.8341,	val_loss: 5.5688
12:	[0s / 0s],		train_loss: 5.5578,	val_loss: 5.3814
13:	[0s / 0s],		train_loss: 5.3787,	val_loss: 5.1904
14:	[0s / 0s],		train_loss: 5.2287,	val_loss: 5.0136
15:	[0s / 0s],		train_loss: 5.2278,	val_loss: 4.8247
16:	[0s / 0s],		train_loss: 5.1106,	val_loss: 4.6008
17:	[0s / 0s],		train_loss: 4.6288,	val_loss: 4.4145
18:	[0s / 0s],		train_loss: 4.4807,	val_loss: 4.2170
19:

We can also train a normal model, which is the model that will not incorporate prior information.

In [27]:
parameter_set = {
    "hidden_nodes": 32,
    "hidden_layers": 2,
    "batch_norm": True,
    "learning_rate": 0.005,
    "batch_size": 32
}

In [28]:
# The data structure of y_train and y_val is different

get_target = lambda df: (df['duration'].values, np.array(df['event'].values, dtype = np.float32))

y_train = get_target(df_train)
y_val = get_target(df_val)

model_local = KLDL.model_generation(x_train, x_val, y_train, y_val, with_prior = False, parameter_set = parameter_set, verbose = True)

0:	[0s / 0s],		train_loss: 8.2824,	val_loss: 7.5078
1:	[0s / 0s],		train_loss: 7.8921,	val_loss: 7.3277
2:	[0s / 0s],		train_loss: 7.6798,	val_loss: 7.1883
3:	[0s / 0s],		train_loss: 7.2649,	val_loss: 7.1078
4:	[0s / 0s],		train_loss: 7.2439,	val_loss: 7.0381
5:	[0s / 0s],		train_loss: 6.9598,	val_loss: 6.9477
6:	[0s / 0s],		train_loss: 6.8880,	val_loss: 6.8648
7:	[0s / 0s],		train_loss: 6.6893,	val_loss: 6.7423
8:	[0s / 0s],		train_loss: 6.4755,	val_loss: 6.5773
9:	[0s / 0s],		train_loss: 6.1216,	val_loss: 6.4025
10:	[0s / 0s],		train_loss: 5.9379,	val_loss: 6.2195
11:	[0s / 0s],		train_loss: 5.8156,	val_loss: 6.0275
12:	[0s / 0s],		train_loss: 5.6496,	val_loss: 5.8409
13:	[0s / 0s],		train_loss: 5.3284,	val_loss: 5.6368
14:	[0s / 0s],		train_loss: 5.1364,	val_loss: 5.4424
15:	[0s / 0s],		train_loss: 5.0785,	val_loss: 5.2270
16:	[0s / 0s],		train_loss: 4.7531,	val_loss: 5.0362
17:	[0s / 0s],		train_loss: 4.6068,	val_loss: 4.8283
18:	[0s / 0s],		train_loss: 4.1924,	val_loss: 4.6221
19:

# Evaluation

You will expect to see:

1. How to evaluate the model given the test data

Usage: `KLDL.evaluation_metrics(x_test, durations_test, events_test, model)`

`x_test, durations_test, events_test`: All information that coming from test dataset. 

`model`: Model used in evaluation.

Three metrics will be used for evaluation: time-dependent C-index, Integrated Brier Score (IBS), Integrated Negative Binary Log Likelihood (INBLL). For C-index, higher means better, while for the other two, lower means better.

In [29]:
durations_test, events_test = get_target(df_test)

In [39]:
concordance_td_local, integrated_brier_score_local, integrated_nbll_local = KLDL.evaluation_metrics(x_test, durations_test, events_test, model_local)
concordance_td_prior, integrated_brier_score_prior, integrated_nbll_prior = KLDL.evaluation_metrics(x_test, durations_test, events_test, model_prior)
concordance_td, integrated_brier_score, integrated_nbll = KLDL.evaluation_metrics(x_test, durations_test, events_test, model)

print("The C-index for local without prior: ", concordance_td_local)
print("The C-index for local with prior: ", concordance_td)
print("The C-index for prior: ", concordance_td_prior)
print("The IBS for local without prior: ", integrated_brier_score_local)
print("The IBS for local with prior: ", integrated_brier_score)
print("The IBS for prior: ", integrated_brier_score_prior)
print("The INBLL for local without prior: ", integrated_nbll_local)
print("The INBLL for local with prior: ", integrated_nbll)
print("The INBLL for prior: ", integrated_nbll_prior)

The C-index for local without prior:  0.6533665835411472
The C-index for local with prior:  0.71238570241064
The C-index for prior:  0.7115544472152951
The IBS for local without prior:  0.142921780409424
The IBS for local with prior:  0.1502653697125696
The IBS for prior:  0.14910643935465429
The INBLL for local without prior:  0.43334645529434623
The INBLL for local with prior:  0.4544365897650282
The INBLL for prior:  0.451068920566815


# Future Work

1. Interpolation: From discrete survival probability to continuous for fair comparison. 
2. Different link functions: Modify the link functions to adapt to different kinds of data (which data adapts to which link functions).
3. Options for avoiding data expansion (required interface of C++ and Python).
4. Hyperparameter Tuning based on Random Search.