# Self supervised Learning

Pretrain a "foundation model" using data without outcomes, from a large unlabeled dataset. Note that the idea behind the pre-training is to guide the fine tunning task (survival prediction). **Note that given this is a toy example the pre-training effect may be marginal.** 

The input to the mode are the features from the dataset and **not the outcomes**

**Note that this notebook may take a while to complete when number of epochs is large**: For self supervised learning it is recommended to use a large  number of epochs. For illustration, we ran it for 1,000 epochs. 

In [None]:
import sys
sys.path.append('/root/capsule/environment/clinical_transformer/')

In [2]:
import pandas as pd
from collections import Counter

from xai.models.SimplifiedClinicalTransformer.Trainer import Trainer
from xai.losses.survival import cIndex_SigmoidApprox as cindex_loss
from xai.metrics.survival import sigmoid_concordance as cindex

2024-05-28 17:59:09.171291: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-28 17:59:09.856672: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
INFO	2024-05-28 17:59:16,217	generated new fontManager


In [3]:
from xai.models import Trainer
from xai.models import SelfSupervisedTransformer
from xai.models import OptimizedSelfSupervisedDataGenerator as SelfSupervisedDataGenerator

from xai.losses.selfsupervision.classifier_regression import CompositeLoss

In [4]:
from samecode.random import set_seed

In [5]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

## Parameters

In [6]:
max_features_percentile=100
test_size=0.1
mode='self-supervision'
learning_rate=0.0001
repetitions=1
epochs=1000
verbose=0
seed=0
embedding_size = 128
num_heads = 2
num_layers = 2

loss = CompositeLoss(feature_w=1, value_w=0.1) # Contribution of individual losses (predicts keys, values) 

In [7]:
data = pd.read_csv('../data/dataset-pretrain.data.csv')
features = ["f_{}".format(i) for i in range(10)]

In [8]:
features

['f_0', 'f_1', 'f_2', 'f_3', 'f_4', 'f_5', 'f_6', 'f_7', 'f_8', 'f_9']

In [9]:
!rm -r ../results/runs/FoundationModel

rm: cannot remove '../results/runs/FoundationModel': No such file or directory


In [10]:
set_seed(0)
outdir = '../results/runs/FoundationModel/'

trainer = Trainer(
    out_dir = outdir,
    max_features_percentile=max_features_percentile,
    test_size=test_size,
    mode=mode,
    model=SelfSupervisedTransformer, 
    dataloader=SelfSupervisedDataGenerator,
    loss=loss,
    metrics=[]
)

trainer.setup_data(
    data, 
    discrete_features = [],
    continuous_features = features,
)

trainer.setup_model(
    embedding_size=embedding_size, 
    num_heads=num_heads, 
    num_layers=num_layers,
    learning_rate=learning_rate,
    batch_size_max = True,
    # batch_size=10000, # This will take a batch with the size of the training / testing shape,
    save_best_only=False
)

trainer.fit(repetitions=repetitions, epochs=epochs, verbose=verbose, seed=seed)

INFO	2024-05-28 17:59:18,295	Setting up working directory: ../results/runs/FoundationModel/
INFO	2024-05-28 17:59:18,298	Number of continuous features: 10
INFO	2024-05-28 17:59:18,298	Number of discrete features: 0
INFO	2024-05-28 17:59:18,299	Number of samples: 19000
INFO	2024-05-28 17:59:18,327	Number of classes: 18
INFO	2024-05-28 17:59:18,327	RUN ID: fold-0_id-0
INFO	2024-05-28 17:59:18,328	RUN ID out directory: ../results/runs/FoundationModel//fold-0_id-0/
INFO	2024-05-28 17:59:19,995	Training samples: 17100
INFO	2024-05-28 17:59:19,996	Testing samples: 1900
INFO	2024-05-28 17:59:20,031	Number of features at 100th percentile: 10 that are non nans
2024-05-28 17:59:21.193247: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-05-28 17:59:21.193282: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: e376292c42ee
202

In [11]:
from xai.models import clean_run

clean_run(
    path='../results/runs/FoundationModel/',
    keep=[1, 10, 100, 500, 1000]
)

Processing: ../results/runs/FoundationModel/
