# ICR - Identifying Age-Related Conditions
## Using Machine Learning to detect conditions with measurements of anonymous characteristics

In this notebook, we evaluate the validation results from the trained models: 

* **TabTransformer** w/ SMOTE
* **SVM** w/ SMOTE
* **XGBoost** w/ SMOTE

We first load in the validation probability estimates from each model...

In [1]:
# load libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
import warnings
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
warnings.filterwarnings('ignore')

# load validation results
TABTR_VAL_PROBS = pd.read_csv('val_pred_probs/amzn-tab-trans.csv').to_numpy()
SVM_VAL_PROBS = pd.read_csv('val_pred_probs/svm-tuned.csv').to_numpy()
XGB_VAL_PROBS = pd.read_csv('val_pred_probs/xgboost-tuned.csv').to_numpy()

# load training results
TABTR_TRAIN_PROBS = pd.read_csv('train_pred_probs/amzn-tab-trans.csv').to_numpy()
SVM_TRAIN_PROBS = pd.read_csv('train_pred_probs/svm-tuned.csv').to_numpy()
XGB_TRAIN_PROBS = pd.read_csv('train_pred_probs/xgboost-tuned.csv').to_numpy()

# include paths to data from local storage location
TRAIN_DATA = os.environ['DATAFILES_PATH'] + '/ICR_Competition/' + 'train.csv'

# load training data
train_df = pd.read_csv(TRAIN_DATA)

# allocate
X = train_df.drop(columns=['Class', 'Id'])
X = pd.get_dummies(X, drop_first=True)

y = train_df['Class'].astype(int)

# train-validation split 
X_train_raw, X_val, y_train_raw, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

X_val['EJ_B'].fillna(value=X_train_raw['EJ_B'].mode())
X_val = X_val.fillna(value=X_train_raw.mean())

X_train_raw['EJ_B'].fillna(value=X_train_raw['EJ_B'].mode())
X_train_raw = X_train_raw.fillna(value=X_train_raw.mean())

# over sample the diagnosed patients in training set
oversample = SMOTE(random_state=77, sampling_strategy='minority')
X_train, y_train = oversample.fit_resample(X_train_raw, y_train_raw)

# shuffle (in case the model choice may be impacted by ordering)
np.random.seed(77)
shuff_ind = np.random.choice(len(y_train), len(y_train), replace=False)

X_train = X_train.iloc[shuff_ind,]
y_train = y_train.iloc[shuff_ind,]

FileNotFoundError: [Errno 2] No such file or directory: 'train_pred_probs/amzn-tab-trans.csv'

Define Balanced Logarithmic Loss Function

In [2]:
def bal_log_loss(p, y):
    ind0 = np.where(y==0)[0]
    ind1 = np.where(y==1)[0]
    
    N0 = len(ind0)
    N1 = len(ind1)
    
    y0 = (y==0).astype(int)
    y1 = y.astype(int)
    
    return (- np.sum(y0*np.log(p[:, 0]))/N0 - np.sum(y1*np.log(p[:, 1]))/N1) / 2

Compare individual validation results from each model.

In [3]:
acc = np.mean(np.argmax(SVM_VAL_PROBS, 1)==y_val)
bll = bal_log_loss(SVM_VAL_PROBS, y_val)

print(75*"*")
print(75*"*")
print(f'Test Accuracy for TabTransformer: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for TabTransformer: {bll:.4f}')
print(75*"*")

acc = np.mean(np.argmax(SVM_VAL_PROBS, 1)==y_val)
bll = bal_log_loss(SVM_VAL_PROBS, y_val)

print(75*"*")
print(f'Test Accuracy for SVM: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for SVM: {bll:.4f}')
print(75*"*")

acc = np.mean(np.argmax(XGB_VAL_PROBS, 1)==y_val)
bll = bal_log_loss(XGB_VAL_PROBS, y_val)

print(75*"*")
print(f'Test Accuracy for XGBoost: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for XGBoost: {bll:.4f}')
print(75*"*")
print(75*"*")

***************************************************************************
***************************************************************************
Test Accuracy for TabTransformer: 0.9113
Testing Balanced Logarithmic Loss for TabTransformer: 0.4894
***************************************************************************
***************************************************************************
Test Accuracy for SVM: 0.9113
Testing Balanced Logarithmic Loss for SVM: 0.4894
***************************************************************************
***************************************************************************
Test Accuracy for XGBoost: 0.9677
Testing Balanced Logarithmic Loss for XGBoost: 0.2356
***************************************************************************
***************************************************************************


Try combining the output probabilites to create the *ensemble* model. First we try averaging the probability outputs from our models.

In [29]:
# concatenate the probability of positives from each model (train set)
train_probs_df = np.concatenate((TABTR_TRAIN_PROBS[:. 1:], SVM_TRAIN_PROBS[:, 1:], XGB_TRAIN_PROBS[:, 1:]), axis=1)

# concatenate the probability of positives from each model (validation set)
val_probs_df = np.concatenate((TABTR_VAL_PROBS[:. 1:], SVM_VAL_PROBS[:, 1:], XGB_VAL_PROBS[:, 1:]), axis=1)

# averaging results
final_probs_avging = np.mean(val_probs_df, axis=1)[:, np.newaxis]
final_probs_avging = np.concatenate((1-final_probs_avging, final_probs_avging), axis=1)

# check new accuracy
acc = np.mean(np.argmax(final_probs_avging, 1)==y_val)

# check new balanced logarithmic loss
bll = bal_log_loss(final_probs_avging, y_val)

print(f'Test Accuracy for Averaging Ensemble: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for Averaging Ensemble: {bll:.4f}')

Test Accuracy for Averaging Ensemble: 0.9677
Testing Balanced Logarithmic Loss for Averaging Ensemble: 0.2999


Second, we try using logistic regression to generate a new probabilty estimate from the transformation of the linear combination of the outputs.

In [37]:
from sklearn.linear_model import LogisticRegression

lr_mod = LogisticRegression(penalty='none')
lr_mod.fit(train_probs_df, y_train)

# collect new validation probabilities 
probs = lr_mod.predict_proba(val_probs_df)

# collect new validation accuracy 
acc = np.mean(np.argmax(probs, 1) == y_val)

# collect new validation balanced logarithmic loss
bll = bal_log_loss(probs, y_val)

print(f'Test Accuracy for Logistic Regression Ensemble: {acc:.4f}')
print(f'Testing Balanced Logarithmic Loss for Logistic Regression Ensemble: {bll:.4f}')

Test Accuracy for Logistic Regression Ensemble: 0.9597
Testing Balanced Logarithmic Loss for Logistic Regression Ensemble: 0.2403


We can see above that the probability outputs from the ensemble demonstrates a small balanced logarithmic loss at about 0.2403.

### Output Probabilties 

We now generate the probabilites for the test set (for submission).

In [16]:
# load test set results
TABTR_TEST_PROBS = pd.read_csv('test_pred_probs/amzn-tab-trans.csv').to_numpy()
SVM_TEST_PROBS = pd.read_csv('test_pred_probs/svm-tuned.csv').to_numpy()
XGB_TEST_PROBS = pd.read_csv('test_pred_probs/xgboost-tuned.csv').to_numpy()

# concatenate the probability of positives from each model (test set)
test_probs_df = np.concatenate((TABTR_TEST_PROBS[:. 1:], SVM_TEST_PROBS[:, 1:], XGB_TEST_PROBS[:, 1:]), axis=1)

# collect test probabilities 
probs = lr_mod.predict_proba(test_probs_df)

# store the test-set predictions in csv format, locally.
pd.DataFrame(probs).to_csv("test_pred_probs/log-reg-ensemble.csv", header=True, index=False)