# ICR - Identifying Age-Related Conditions
## Using Machine Learning to detect conditions with measurements of anonymous characteristics

TabPFN is a tabular prior-data fitted network for tabular data, combining approximate Bayesian inference and transformer tokenization. The paper https://arxiv.org/abs/2207.01848 presents the TabPFN in detail. I used the demo shown in the readme at https://github.com/automl/TabPFN for my model.

We start by loading our libraries and reading in our data as we did before.

In [4]:
# load libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
import warnings
from sklearn.metrics import log_loss
from tabpfn import TabPFNClassifier
from imblearn.over_sampling import RandomOverSampler
warnings.filterwarnings('ignore')

# include all paths to data from local storage location
TRAIN_DATA = os.environ['DATAFILES_PATH'] + '/ICR_Competition/' + 'train.csv'
TEST_DATA = os.environ['DATAFILES_PATH'] + '/ICR_Competition/' + 'test.csv'
GREEKS_DATA = os.environ['DATAFILES_PATH'] + '/ICR_Competition/' + 'greeks.csv'

# load training data
train_df = pd.read_csv(TRAIN_DATA)

# allocate
X = train_df.drop(columns=['Class', 'Id'])
X = pd.get_dummies(X, drop_first=True)

y = train_df['Class'].astype(int)

### Training, tuning, and validating the TabPFN.

In [13]:
def bal_log_loss(p, y):
    ind0 = np.where(y==0)[0]
    ind1 = np.where(y==1)[0]
    
    N0 = len(ind0)
    N1 = len(ind1)
    
    y0 = (y==0).astype(int)
    y1 = y.astype(int)
    
    return (- np.sum(y0*np.log(p[:, 0]))/N0 - np.sum(y1*np.log(p[:, 1]))/N1) / 2

In [42]:
# set fold size (5 fold cv)
fold_sz = np.floor(X.shape[0] / 5).astype(int)

verbose = True

accs_total = []
blls_total = []

# loop through each fold
for i in range(5):
    
    print(f"\n*********************** FOLD #{i} ***********************\n")
    
    mask = np.ones(X.shape[0]).astype(bool)
    mask[(i*fold_sz):((i+1)*fold_sz)] = 0
    
    X_train_raw = X.to_numpy()[mask,]
    y_train_raw = y.to_numpy()[mask,]
    
    X_val = X.to_numpy()[~mask,]
    y_val = y.to_numpy()[~mask,]
    
    # over sample the diagnosed patients in training set
    ros = RandomOverSampler(random_state=77, sampling_strategy='minority')
    X_train, y_train = ros.fit_resample(X_train_raw, y_train_raw)

    # shuffle (in case the model choice may be impacted by ordering)
    shuff_ind = np.random.choice(len(y_train), len(y_train), replace=False)

    X_train = X_train[shuff_ind,]
    y_train = y_train[shuff_ind,]

    # set up grid of hyper-params
    parameters = {
        'N_ensemble_configurations': range(12, 42, 2)
    }

    # init lists of metrics
    accs = []
    blls = []

    # init counter
    trial = 1

    for n in parameters['N_ensemble_configurations']:

                # fit TabPFN
                classifier = TabPFNClassifier(device='cpu', N_ensemble_configurations=n)
                classifier.fit(X_train, y_train)
                
                # collect val probabilities 
                probs = classifier.predict_proba(X_val)

                # collect val accuracy 
                acc = np.mean(np.argmax(probs, 1) == y_val)
                
                # collect val balanced logarithmic loss
                bll = bal_log_loss(probs, y_val)

                # store val metrics
                accs.append(acc)
                blls.append(bll)

                if verbose:
                    print(f'Trail #{trial:2} || Using:  N_ensemble_configurations={n} || Balanced Log-Loss = {bll:.5f}')

                trial += 1
            
    accs_total.append(accs)
    blls_total.append(blls)
    



*********************** FOLD #0 ***********************

Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Trail # 1 || Using:  N_ensemble_configurations=12 || Balanced Log-Loss = 0.42141
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Trail # 2 || Using:  N_ensemble_configurations=14 || Balanced Log-Loss = 0.41848
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Trail # 3 || Using:  N_ensemble_configurations=16 || Balanced Log-Loss = 0.42590
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Trail # 4 || Using:  N_ensemble_configurations=18 || Balanced Log-Loss = 0.42609
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Trail # 5 || Using:  N_ensemble_configurations=20 || Balanced Log-Loss = 0.43226
Loading model that can be used for inference only
Using a Transform

NameError: name 'accs_total' is not defined