# ICR - Identifying Age-Related Conditions
## Using Machine Learning to detect conditions with measurements of anonymous characteristics

In this notebook, we evaluate the validation results from the trained models: 

* **TabTransformer** w/ SMOTE
* **SVM** w/ SMOTE
* **XGBoost** w/ SMOTE

We first load in the validation probability estimates from each model...

In [30]:
# load libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
import warnings
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
warnings.filterwarnings('ignore')

# load validation results
TABTR_PROBS = pd.read_csv('val_pred_probs/amzn-tab-trans.csv').to_numpy()
SVM_PROBS = pd.read_csv('val_pred_probs/svm-tuned.csv').to_numpy()
XGB_PROBS = pd.read_csv('val_pred_probs/xgboost-tuned.csv').to_numpy()


# include paths to data from local storage location
TRAIN_DATA = os.environ['DATAFILES_PATH'] + '/ICR_Competition/' + 'train.csv'

# load training data
train_df = pd.read_csv(TRAIN_DATA)

# allocate
X = train_df.drop(columns=['Class', 'Id'])
X = pd.get_dummies(X, drop_first=True)

y = train_df['Class'].astype(int)

# train-validation split 
_, _, _, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

Define Balanced Logarithmic Loss Function

In [18]:
def bal_log_loss(p, y):
    ind0 = np.where(y==0)[0]
    ind1 = np.where(y==1)[0]
    
    N0 = len(ind0)
    N1 = len(ind1)
    
    y0 = (y==0).astype(int)
    y1 = y.astype(int)
    
    return (- np.sum(y0*np.log(p[:, 0]))/N0 - np.sum(y1*np.log(p[:, 1]))/N1) / 2

Compare individual validation results from each model.

In [33]:
acc = np.mean(np.argmax(SVM_PROBS, 1)==y_val)
bll = bal_log_loss(SVM_PROBS, y_val)

print(75*"*")
print(75*"*")
print(f'Test Accuracy for TabTransformer: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for TabTransformer: {bll:.4f}')
print(75*"*")

acc = np.mean(np.argmax(SVM_PROBS, 1)==y_val)
bll = bal_log_loss(SVM_PROBS, y_val)

print(75*"*")
print(f'Test Accuracy for SVM: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for SVM: {bll:.4f}')
print(75*"*")

acc = np.mean(np.argmax(XGB_PROBS, 1)==y_val)
bll = bal_log_loss(XGB_PROBS, y_val)

print(75*"*")
print(f'Test Accuracy for XGBoost: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for XGBoost: {bll:.4f}')
print(75*"*")
print(75*"*")

***************************************************************************
***************************************************************************
Test Accuracy for TabTransformer: 0.9113
Testing Balanced Logarithmic Loss for TabTransformer: 0.4894
***************************************************************************
***************************************************************************
Test Accuracy for SVM: 0.9113
Testing Balanced Logarithmic Loss for SVM: 0.4894
***************************************************************************
***************************************************************************
Test Accuracy for XGBoost: 0.9677
Testing Balanced Logarithmic Loss for XGBoost: 0.2356
***************************************************************************
***************************************************************************


 Try combining the output probabilites to create the *ensemble* model.

In [None]:
## try averaging

## try logistic regression

## compare results

### Output Probabilties 

We first generate probabilities for the *official* validaiton set with tuned hyperparameters.

In [16]:
# load training data
train_df = pd.read_csv(TRAIN_DATA)

# allocate
X = train_df.drop(columns=['Class', 'Id'])
X = pd.get_dummies(X, drop_first=True)

y = train_df['Class'].astype(int)

# train-validation split 
X_train_raw, X_val, y_train_raw, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

X_val['EJ_B'].fillna(value=X_train_raw['EJ_B'].mode())
X_val = X_val.fillna(value=X_train_raw.mean())

X_train_raw['EJ_B'].fillna(value=X_train_raw['EJ_B'].mode())
X_train_raw = X_train_raw.fillna(value=X_train_raw.mean())

# over sample the diagnosed patients in training set
oversample = SMOTE(random_state=77, sampling_strategy='minority')
X_train, y_train = oversample.fit_resample(X_train_raw, y_train_raw)

# shuffle (in case the model choice may be impacted by ordering)
shuff_ind = np.random.choice(len(y_train), len(y_train), replace=False)

X_train = X_train.iloc[shuff_ind,]
y_train = y_train.iloc[shuff_ind,]

# make uints just in case 
y_train.astype('uint8')
y_val.astype('uint8')

# fit TabPFN with optimal hyperparams 
classifier = make_pipeline(StandardScaler(), SVC(gamma='auto', 
                                                 kernel='rbf', 
                                                 C=c, 
                                                 probability=True))
classifier.fit(X_train, y_train)

# collect val probabilities 
probs = classifier.predict_proba(X_val.dropna())

# store the validation-set predictions in csv format, locally.
pd.DataFrame(probs).to_csv("val_pred_probs/svm-tuned.csv", header=True, index=False)

Then we generate the probabilites for the test set (for submission).

In [17]:
# fit TabPFN with optimal hyperparams 
classifier = make_pipeline(StandardScaler(), SVC(gamma='auto', 
                                                 kernel='rbf', 
                                                 C=c, 
                                                 probability=True))

# fill NaNs
X['EJ_B'].fillna(value=X['EJ_B'].mode())
X = X.fillna(value=X.mean())

# over sample the diagnosed patients in training set
oversample = SMOTE(random_state=77, sampling_strategy='minority')
X_rs, y_rs = oversample.fit_resample(X, y)

# shuffle (in case the model choice may be impacted by ordering)
shuff_ind = np.random.choice(len(y_rs), len(y_rs), replace=False)
X_rs = X_rs.iloc[shuff_ind,]
y_rs = y_rs.iloc[shuff_ind,]

# fit xgboost model
classifier.fit(X_rs, y_rs)

# load testing data
test_df = pd.read_csv(TEST_DATA)
test_df['EJ_B'] = (test_df['EJ'] == 'B').astype('int')
X_test = test_df.drop(columns=['Id', 'EJ'])

# collect val probabilities 
probs = classifier.predict_proba(X_test)

# store the test-set predictions in csv format, locally.
pd.DataFrame(probs).to_csv("test_pred_probs/svm-tuned.csv", header=True, index=False)