

```
Reproduction of results of Scheme A in original paper
Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

Kha Vo, Jitendra Jonnagaddala, Siaw-Teng Liaw

February 2019

Jounal of Biomedical Informatics

Source of original code provided by the author:

Resources:

  Ahmad Anis. Pytorch LSTM: The Definitive Guide. Mar. 2022. URL: https://cnvrg.io/pytorch-lstm/.

  Improving LBFGS algorithm in pytorch. URL: http://sagecal.sourceforge.net/pytorch/index.html#: ̃:text=Closure,documentation%5C%2C%5C%20with%5C%20a%5C%20small%5C%20modification.

  Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"


Additional Experiments can be found towards the end of the notebook.

```

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Download the following files and save them in your local file system where they will be retrieved at the time of uploading: febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv from the repo provided by the original.
Repo provided by authors can be found here: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble

Upload febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv after running the following cell.

In [2]:
# upload febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv
from google.colab import files

uploaded = files.upload()

Saving febrl4_UNSW.csv to febrl4_UNSW.csv
Saving ePBRN_D_dup.csv to ePBRN_D_dup.csv
Saving ePBRN_F_dup.csv to ePBRN_F_dup.csv
Saving febrl3_UNSW.csv to febrl3_UNSW.csv


In [3]:
!pip install recordlinkage 
import recordlinkage

Collecting recordlinkage
  Downloading recordlinkage-0.14-py3-none-any.whl (944 kB)
[K     |████████████████████████████████| 944 kB 13.5 MB/s 
Collecting jellyfish>=0.5.4
  Downloading jellyfish-0.9.0.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 36.0 MB/s 
Building wheels for collected packages: jellyfish
  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
  Created wheel for jellyfish: filename=jellyfish-0.9.0-cp37-cp37m-linux_x86_64.whl size=73994 sha256=b75f5a600b3438ff6601210be02c349489a00d0f4b0aca636037cc59d1baab70
  Stored in directory: /root/.cache/pip/wheels/fe/99/4e/646ce766df0d070b0ef04db27aa11543e2767fda3075aec31b
Successfully built jellyfish
Installing collected packages: jellyfish, recordlinkage
Successfully installed jellyfish-0.9.0 recordlinkage-0.14


In [4]:
import recordlinkage as rl, pandas as pd, numpy as np
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
from recordlinkage.preprocessing import phonetic
from numpy.random import choice
import collections, numpy
from IPython.display import clear_output
from sklearn.model_selection import train_test_split, KFold
import time

In [5]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

def generate_true_links(df): 
    # although the match_id column is included in the original df to imply the true links,
    # this function will create the true_link object identical to the true_links properties
    # of recordlinkage toolkit, in order to exploit "Compare.compute()" from that toolkit
    # in extract_function() for extracting features quicker.
    # This process should be deprecated in the future release of the UNSW toolkit.
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    processed = 0
    for match_id in df["match_id"].unique():
        if match_id != -1:    
            processed = processed + 1
            # print("In routine generate_true_links(), count =", processed)
            # clear_output(wait=True)
            linkages = df.loc[df['match_id'] == match_id]
            for j in range(len(linkages)-1):
                for k in range(j+1, len(linkages)):
                    indices_1 = indices_1 + [linkages.iloc[j]["rec_id"]]
                    indices_2 = indices_2 + [linkages.iloc[k]["rec_id"]]    
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def generate_false_links(df, size):
    # A counterpart of generate_true_links(), with the purpose to generate random false pairs
    # for training. The number of false pairs in specified as "size".
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    unique_match_id = df["match_id"].unique()
    for j in range(size):
            false_pair_ids = choice(unique_match_id, 2)
            candidate_1_cluster = df.loc[df['match_id'] == false_pair_ids[0]]
            candidate_1 = candidate_1_cluster.iloc[choice(range(len(candidate_1_cluster)))]
            candidate_2_cluster = df.loc[df['match_id'] == false_pair_ids[1]]
            candidate_2 = candidate_2_cluster.iloc[choice(range(len(candidate_2_cluster)))]    
            indices_1 = indices_1 + [candidate_1["rec_id"]]
            indices_2 = indices_2 + [candidate_2["rec_id"]]  
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def swap_fields_flag(f11, f12, f21, f22):
    return int((f11 == f22) and (f12 == f21))

def extract_features(df, links):
    c = rl.Compare()
    c.string('given_name', 'given_name', method='jarowinkler', label='y_name')
    c.string('given_name_soundex', 'given_name_soundex', method='jarowinkler', label='y_name_soundex')
    c.string('given_name_nysiis', 'given_name_nysiis', method='jarowinkler', label='y_name_nysiis')
    c.string('surname', 'surname', method='jarowinkler', label='y_surname')
    c.string('surname_soundex', 'surname_soundex', method='jarowinkler', label='y_surname_soundex')
    c.string('surname_nysiis', 'surname_nysiis', method='jarowinkler', label='y_surname_nysiis')
    c.exact('street_number', 'street_number', label='y_street_number')
    c.string('address_1', 'address_1', method='levenshtein', threshold=0.7, label='y_address1')
    c.string('address_2', 'address_2', method='levenshtein', threshold=0.7, label='y_address2')
    c.exact('postcode', 'postcode', label='y_postcode')
    c.exact('day', 'day', label='y_day')
    c.exact('month', 'month', label='y_month')
    c.exact('year', 'year', label='y_year')
        
    # Build features
    feature_vectors = c.compute(links, df, df)
    return feature_vectors

def generate_train_X_y(df):
    # This routine is to generate the feature vector X and the corresponding labels y
    # with exactly equal number of samples for both classes to train the classifier.
    pos = extract_features(df, train_true_links)
    train_false_links = generate_false_links(df, len(train_true_links))    
    neg = extract_features(df, train_false_links)
    X = pos.values.tolist() + neg.values.tolist()
    y = [1]*len(pos)+[0]*len(neg)
    X, y = shuffle(X, y, random_state=0)
    X = np.array(X)
    y = np.array(y)
    return X, y

def train_model(modeltype, modelparam, train_vectors, train_labels, modeltype_2):
    if modeltype == 'svm': # Support Vector Machine
        model = svm.SVC(C = modelparam, kernel = modeltype_2)
        model.fit(train_vectors, train_labels) 
    elif modeltype == 'lg': # Logistic Regression
        model = LogisticRegression(C=modelparam, penalty = modeltype_2,class_weight=None, dual=False, fit_intercept=True, 
                                   intercept_scaling=1, max_iter=5000, multi_class='ovr', 
                                   n_jobs=1, random_state=None)
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nb': # Naive Bayes
        model = GaussianNB()
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nn': # Neural Network
        model = MLPClassifier(solver='lbfgs', alpha=modelparam, hidden_layer_sizes=(256, ), 
                              activation = modeltype_2,random_state=None, batch_size='auto', 
                              learning_rate='constant',  learning_rate_init=0.001, 
                              power_t=0.5, max_iter=10000, shuffle=True, 
                              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                              nesterovs_momentum=True, early_stopping=False, 
                              validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model.fit(train_vectors, train_labels)
    ## addtion for testing ablation
    elif modeltype == 'nn_ablation': # Neural Network
        model = MLPClassifier(solver='sgd', alpha=modelparam, hidden_layer_sizes=(256, ), 
                              activation = 'tanh',random_state=None, batch_size='auto', 
                              learning_rate='adaptive',  learning_rate_init=0.001, 
                              power_t=0.5, max_iter=10000, shuffle=True, 
                              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                              nesterovs_momentum=True, early_stopping=False, 
                              validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model.fit(train_vectors, train_labels)
    return model

def classify(model, test_vectors):
    result = model.predict(test_vectors)
    return result

    
def evaluation(test_labels, result):
    true_pos = np.logical_and(test_labels, result)
    count_true_pos = np.sum(true_pos)
    true_neg = np.logical_and(np.logical_not(test_labels),np.logical_not(result))
    count_true_neg = np.sum(true_neg)
    false_pos = np.logical_and(np.logical_not(test_labels), result)
    count_false_pos = np.sum(false_pos)
    false_neg = np.logical_and(test_labels,np.logical_not(result))
    count_false_neg = np.sum(false_neg)
    precision = count_true_pos/(count_true_pos+count_false_pos)
    sensitivity = count_true_pos/(count_true_pos+count_false_neg) # sensitivity = recall
    confusion_matrix = [count_true_pos, count_false_pos, count_false_neg, count_true_neg]
    no_links_found = np.count_nonzero(result)
    no_false = count_false_pos + count_false_neg
    Fscore = 2*precision*sensitivity/(precision+sensitivity)
    metrics_result = {'no_false':no_false, 'confusion_matrix':confusion_matrix ,'precision':precision,
                     'sensitivity':sensitivity ,'no_links':no_links_found, 'F-score': Fscore}
    return metrics_result

def blocking_performance(candidates, true_links, df):
    count = 0
    for candi in candidates:
        if df.loc[candi[0]]["match_id"]==df.loc[candi[1]]["match_id"]:
            count = count + 1
    return count


trainset = 'febrl3_UNSW'
testset = 'febrl4_UNSW'

The following cell creates the train dataset.

In [6]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## TRAIN SET CONSTRUCTION

# Import
print("Import train set...")
df_train = pd.read_csv(trainset+".csv", index_col = "rec_id")
train_true_links = generate_true_links(df_train)
print("Train set size:", len(df_train), ", number of matched pairs: ", str(len(train_true_links)))

# Preprocess train set
df_train['postcode'] = df_train['postcode'].astype(str)
df_train['given_name_soundex'] = phonetic(df_train['given_name'], method='soundex')
df_train['given_name_nysiis'] = phonetic(df_train['given_name'], method='nysiis')
df_train['surname_soundex'] = phonetic(df_train['surname'], method='soundex')
df_train['surname_nysiis'] = phonetic(df_train['surname'], method='nysiis')

# Final train feature vectors and labels
X_train, y_train = generate_train_X_y(df_train)
print("Finished building X_train, y_train")

Import train set...
Train set size: 5000 , number of matched pairs:  1165


  s = s.str.replace(r"[\-\_\s]", "")


Finished building X_train, y_train


In [7]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

# Blocking Criteria: declare non-match of all of the below fields disagree
# Import
print("Import test set...")
df_test = pd.read_csv(testset+".csv", index_col = "rec_id")
test_true_links = generate_true_links(df_test)
leng_test_true_links = len(test_true_links)
print("Test set size:", len(df_test), ", number of matched pairs: ", str(leng_test_true_links))

print("BLOCKING PERFORMANCE:")
blocking_fields = ["given_name", "surname", "postcode"]
all_candidate_pairs = []
for field in blocking_fields:
    block_indexer = rl.BlockIndex(on=field)
    candidates = block_indexer.index(df_test)
    detects = blocking_performance(candidates, test_true_links, df_test)
    all_candidate_pairs = candidates.union(all_candidate_pairs)
    print("Number of pairs of matched "+ field +": "+str(len(candidates)), ", detected ",
         detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
          str(leng_test_true_links-detects) )
detects = blocking_performance(all_candidate_pairs, test_true_links, df_test)
print("Number of pairs of at least 1 field matched: " + str(len(all_candidate_pairs)), ", detected ",
     detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
          str(leng_test_true_links-detects) )

Import test set...
Test set size: 10000 , number of matched pairs:  5000
BLOCKING PERFORMANCE:
Number of pairs of matched given_name: 154898 , detected  3287 /5000 true matched pairs, missed 1713
Number of pairs of matched surname: 170843 , detected  3325 /5000 true matched pairs, missed 1675
Number of pairs of matched postcode: 53197 , detected  4219 /5000 true matched pairs, missed 781
Number of pairs of at least 1 field matched: 372073 , detected  4894 /5000 true matched pairs, missed 106


The following cell creates a test dataset.

In [8]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## TEST SET CONSTRUCTION

# Preprocess test set
print("Processing test set...")
print("Preprocess...")
df_test['postcode'] = df_test['postcode'].astype(str)
df_test['given_name_soundex'] = phonetic(df_test['given_name'], method='soundex')
df_test['given_name_nysiis'] = phonetic(df_test['given_name'], method='nysiis')
df_test['surname_soundex'] = phonetic(df_test['surname'], method='soundex')
df_test['surname_nysiis'] = phonetic(df_test['surname'], method='nysiis')

# Test feature vectors and labels construction
print("Extract feature vectors...")
df_X_test = extract_features(df_test, all_candidate_pairs)
vectors = df_X_test.values.tolist()
labels = [0]*len(vectors)
feature_index = df_X_test.index
for i in range(0, len(feature_index)):
    if df_test.loc[feature_index[i][0]]["match_id"]==df_test.loc[feature_index[i][1]]["match_id"]:
        labels[i] = 1
X_test, y_test = shuffle(vectors, labels, random_state=0)
X_test = np.array(X_test)
y_test = np.array(y_test)
print("Count labels of y_test:",collections.Counter(y_test))
print("Finished building X_test, y_test")

Processing test set...
Preprocess...
Extract feature vectors...


  s = s.str.replace(r"[\-\_\s]", "")


Count labels of y_test: Counter({0: 367179, 1: 4894})
Finished building X_test, y_test


The following cell is the baseline code for the SVM model.

In [102]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## SVM MODEL; linear kernel; C=0.005
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'svm' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'linear'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [.005] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: svm , Param_1: linear , tuning range: [0.005]
Total training time 0.030590295791625977
No_false: [81] 

Precision: [0.9872443814537356] 

Sensitivity: [0.9963220269718022] 

F-score: [0.9917624326248348] 




The following cell is the baseline code for the NN model.




In [103]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## NN MODEL; relu
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'nn' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'relu'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [100] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: nn , Param_1: relu , tuning range: [100]
Total training time 0.48044896125793457
No_false: [79] 

Precision: [0.9896278218425869] 

Sensitivity: [0.9942787086228034] 

F-score: [0.991947813678524] 




Ablation of NN model:
Changing to activation tp logistic and alpha parameter to 50

In [26]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## NN logistic; C=50
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'nn' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'logistic'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [50] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: nn , Param_1: logistic , tuning range: [50]
Total training time 4.082343816757202
No_false: [81] 

Precision: [0.9912226985099] 

Sensitivity: [0.9922353902738047] 

F-score: [0.9917287858674565] 




The following cell is the baseline for the LG model

In [105]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## LG MODEL; l2; C=0.2
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'lg' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'l2'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [0.2] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: lg , Param_1: l2 , tuning range: [0.2]
Total training time 0.026948213577270508
No_false: [144] 

Precision: [0.9746203037569944] 

Sensitivity: [0.996526358806702] 

F-score: [0.9854516063851283] 




Bagging performance for all baseline models.

In [106]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## ENSEMBLE CLASSIFICATION AND EVALUATION

print("BAGGING PERFORMANCE:\n")
modeltypes = ['svm', 'nn', 'lg'] 
modeltypes_2 = ['linear', 'relu', 'l2']
modelparams = [0.005, 100, 0.2]
nFold = 10
kf = KFold(n_splits=nFold)
model_raw_score = [0]*3
model_binary_score = [0]*3
model_i = 0
for model_i in range(3):
    start_bagging_time = time.time()
    modeltype = modeltypes[model_i]
    modeltype_2 = modeltypes_2[model_i]
    modelparam = modelparams[model_i]
    print(modeltype, "per fold:")
    iFold = 0
    result_fold = [0]*nFold
    final_eval_fold = [0]*nFold
    start_training_time = time.time()
    for train_index, valid_index in kf.split(X_train):
        X_train_fold = X_train[train_index]
        y_train_fold = y_train[train_index]
        md =  train_model(modeltype, modelparam, X_train_fold, y_train_fold, modeltype_2)
        result_fold[iFold] = classify(md, X_test)
        final_eval_fold[iFold] = evaluation(y_test, result_fold[iFold])
        print("Fold", str(iFold), final_eval_fold[iFold])
        iFold = iFold + 1
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    bagging_raw_score = np.average(result_fold, axis=0)
    bagging_binary_score  = np.copy(bagging_raw_score)
    bagging_binary_score[bagging_binary_score > 0.5] = 1
    bagging_binary_score[bagging_binary_score <= 0.5] = 0
    bagging_eval = evaluation(y_test, bagging_binary_score)
    print(modeltype, "bagging:", bagging_eval)
    end_bagging_time = time.time()
    print ('Total bagging time', end_bagging_time-start_bagging_time)
    print('')
    model_raw_score[model_i] = bagging_raw_score
    model_binary_score[model_i] = bagging_binary_score

BAGGING PERFORMANCE:

svm per fold:
Fold 0 {'no_false': 80, 'confusion_matrix': [4876, 62, 18, 367117], 'precision': 0.987444309437019, 'sensitivity': 0.9963220269718022, 'no_links': 4938, 'F-score': 0.9918633034987795}
Fold 1 {'no_false': 83, 'confusion_matrix': [4876, 65, 18, 367114], 'precision': 0.9868447682655332, 'sensitivity': 0.9963220269718022, 'no_links': 4941, 'F-score': 0.9915607524148449}
Fold 2 {'no_false': 84, 'confusion_matrix': [4876, 66, 18, 367113], 'precision': 0.9866450829623634, 'sensitivity': 0.9963220269718022, 'no_links': 4942, 'F-score': 0.991459943066287}
Fold 3 {'no_false': 80, 'confusion_matrix': [4876, 62, 18, 367117], 'precision': 0.987444309437019, 'sensitivity': 0.9963220269718022, 'no_links': 4938, 'F-score': 0.9918633034987795}
Fold 4 {'no_false': 85, 'confusion_matrix': [4876, 67, 18, 367112], 'precision': 0.9864454784543799, 'sensitivity': 0.9963220269718022, 'no_links': 4943, 'F-score': 0.9913591542136829}
Fold 5 {'no_false': 79, 'confusion_matrix'

In [107]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

thres = .99

print("STACKING PERFORMANCE:\n")
stack_raw_score = np.average(model_raw_score, axis=0)
stack_binary_score = np.copy(stack_raw_score)
stack_binary_score[stack_binary_score > thres] = 1
stack_binary_score[stack_binary_score <= thres] = 0
stacking_eval = evaluation(y_test, stack_binary_score)
print(stacking_eval)


STACKING PERFORMANCE:

{'no_false': 60, 'confusion_matrix': [4861, 27, 33, 367152], 'precision': 0.9944762684124386, 'sensitivity': 0.9932570494483041, 'no_links': 4888, 'F-score': 0.9938662850132898}


Experimenting NN results using PyTorch library instead of using Sklearn as by the original implementation.

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import KFold

The following code is the experimentation of implementing an MLP classifier using the PyTorch library with 4 layers and relu activation.

In [147]:
class MLP(torch.nn.Module):
        def __init__(self, input_dim, hidden_size):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_dim, 512)
            self.fc2 = nn.Linear(512, 256)
            self.fc3 = nn.Linear(256, 128)
            self.fc4 = nn.Linear(128, 1)
            
        def forward(self, x):
            output = self.fc1(x)
            output = F.relu(output)
            output = self.fc2(output)
            output = F.relu(output)
            output = self.fc3(output)
            output = F.relu(output)
            output = self.fc4(output)
            return output

In [10]:
def evaluate_performace(y_test, result):
    acc = accuracy_score(result, y_test)
    p, r, f, _ = precision_recall_fscore_support(result, y_test, average='binary')
    roc_auc = 0.
    try:
        roc_auc = roc_auc_score(result, y_test)
    except ValueError:
        pass
    # roc_auc = roc_auc_score(result, y_test)
    tn, fp, fn, tp = confusion_matrix(y_test, result).ravel()
    no_links_found = np.count_nonzero(result)
    no_false = fp + fn
    cm = [tp, fp, fn, tn]
    metrics_result = {'no_false':no_false, 'confusion_matrix':cm ,'precision':p,
                        'sensitivity':r ,'no_links':no_links_found, 'F-score': f, 'roc_auc': roc_auc, 'accuracy':acc}
    return metrics_result

In [11]:
x_batch = torch.from_numpy(X_train)
y_batch = torch.from_numpy(y_train)

batch_size=200

my_dataset = torch.utils.data.TensorDataset(x_batch,y_batch) 
train_loader = torch.utils.data.DataLoader(dataset=my_dataset, batch_size=batch_size, drop_last=True, shuffle=True)


MLP Model

In [150]:
### MLP model
input_dim=x_batch.size()[1]
hidden_size=256
num_epoch=100
model = MLP(input_dim, hidden_size)#.to(device)
print(model)

loss_fct = nn.MSELoss().cuda()
optimizer = torch.optim.LBFGS(model.parameters(), lr=0.001, tolerance_grad=0.0001)

MLP(
  (fc1): Linear(in_features=13, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=256, bias=True)
  (fc3): Linear(in_features=256, out_features=128, bias=True)
  (fc4): Linear(in_features=128, out_features=1, bias=True)
)


MLP training with 100 epochs

In [151]:
# Sources used: Improving LBFGS algorithm in pytorch. URL: http://sagecal.sourceforge.net/pytorch/index.html#: ̃:text=Closure,documentation%5C%2C%5C%20with%5C%20a%5C%20small%5C%20modification.
####
total_training_time = time.time()
times = np.array([])
#train MLP
for epoch in range(num_epoch):
        startTime = time.time()
        losses_arr = np.array([])
        for batch_idx, (data, target) in enumerate(train_loader):

            px, py = data.float(), target.float()

            def closure():
                optimizer.zero_grad()
                output = model(px)
                loss = loss_fct(output, py.unsqueeze(1))
                loss.backward()
                return loss
            l = optimizer.step(closure)
            losses_arr = np.append(losses_arr, l.item())
        avg_loss = np.average(losses_arr)
        print('Epoch {}: avg train loss: {}'.format(epoch, avg_loss))

        executionTime = (time.time() - startTime)
        times = np.append(times, executionTime)

end_training_time = (time.time() - total_training_time)
print ('Total training time', end_training_time)
avg_runtime_for_each_epoch = np.average(times)
print ('Average runtime for each epoch for MLP : ', avg_runtime_for_each_epoch)


Epoch 0: avg train loss: 0.06833279454572634
Epoch 1: avg train loss: 0.02622341076758775
Epoch 2: avg train loss: 0.018823575821112503
Epoch 3: avg train loss: 0.01761209617622874
Epoch 4: avg train loss: 0.017771575938571583
Epoch 5: avg train loss: 0.02987327900799838
Epoch 6: avg train loss: 0.023504851385951042
Epoch 7: avg train loss: 0.014021376486529003
Epoch 8: avg train loss: 0.012935584411025047
Epoch 9: avg train loss: 0.013305147347802465
Epoch 10: avg train loss: 0.013357972810891542
Epoch 11: avg train loss: 0.01331077329814434
Epoch 12: avg train loss: 0.013449008407240564
Epoch 13: avg train loss: 0.01365531354465268
Epoch 14: avg train loss: 0.013441263698041439
Epoch 15: avg train loss: 0.013355729254809294
Epoch 16: avg train loss: 0.013490675440566107
Epoch 17: avg train loss: 0.013007214123552496
Epoch 18: avg train loss: 0.013563064180991867
Epoch 19: avg train loss: 0.01351778827268969
Epoch 20: avg train loss: 0.01351767194203355
Epoch 21: avg train loss: 0.013

In [152]:
x_batch_test = torch.from_numpy(X_test)#.cuda()
y_batch_test = torch.from_numpy(y_test)#.cuda()

Using test data on the MLP classifier implemented in PyTorch.

In [153]:
# MLP model test
y_pred = model(x_batch_test.float())
after_train = loss_fct(y_pred.squeeze(), y_batch_test.float()) 
print('Test loss after Training' , after_train.item())
y_hat = (y_pred > 0.5).int()#.cpu()
stacking_eval = evaluate_performace(y_test, y_hat)
print(stacking_eval)

Test loss after Training 0.02214452251791954
{'no_false': 131, 'confusion_matrix': [4870, 107, 24, 367072], 'precision': 0.9950960359624029, 'sensitivity': 0.9785011050833836, 'no_links': 4977, 'F-score': 0.9867288015398643, 'roc_auc': 0.9892178635448081, 'accuracy': 0.9996479185536171}


Using LSTM Model - Experiment

In [66]:
# Sources used: Ahmad Anis. Pytorch LSTM: The Definitive Guide. Mar. 2022. URL: https://cnvrg.io/pytorch-lstm/.
class LSTM(nn.Module):

    def __init__(self, input_dim, hidden_size, num_layers, output_size, batch_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_dim, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        num_samples = x.size(0)
        ht = torch.zeros(self.num_layers, num_samples, self.hidden_size)
        ct = torch.zeros(self.num_layers, num_samples, self.hidden_size)
        l1, l2 = [l for l in (ht,ct)]
        output, _ = self.lstm(x, (l1, l2))
        output = output[:,-1,:]
        output = self.fc1(output)
        output = self.sigmoid(output)
        return output

In [67]:
lr = 0.0005
n_epochs = 10
input_dim = x_batch.size()[1]    
hidden_size = 128
num_layers = 2
output_size = 1

batch_size = 32

x_batch = torch.from_numpy(X_train)
y_batch = torch.from_numpy(y_train)

x_test_batch = torch.from_numpy(X_test)
y_test_batch = torch.from_numpy(y_test)

my_dataset = torch.utils.data.TensorDataset(x_batch,y_batch)
train_loader = torch.utils.data.DataLoader(dataset=my_dataset, batch_size=batch_size, drop_last=True, shuffle=True)


LSTM model

In [68]:
model = LSTM(input_dim, hidden_size, num_layers, output_size, batch_size)
print(model)
criterion = nn.BCELoss()
opt = torch.optim.Adam(model.parameters(), lr=lr)


LSTM(
  (lstm): LSTM(13, 128, num_layers=2, batch_first=True)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


LSTM model implemented using PyTorch training using 10 epochs.

In [69]:
print('Start model training')
start_training_time = time.time()
epoch_times = np.array([])
for epoch in range(0, n_epochs):
    epoch_start_time = time.time()
    losses = np.array([])
    for i, (x_batch, y_batch) in enumerate(train_loader):

        x_batch = x_batch.reshape(batch_size,  1, input_dim)
  
        opt.zero_grad()
        out = model(x_batch.float())

        loss = criterion(out, y_batch.float().unsqueeze(1))
        losses = np.append(losses, loss.item())
        loss.backward()
        opt.step()
    epoch_end_time = time.time()
    epoch_times = np.append(epoch_times, epoch_end_time-epoch_start_time)
    print('Epoch {}: train loss: {}'.format(epoch, np.average(losses)))

end_training_time = time.time()
print ('Average epoch training time', np.average(epoch_times))
print ('Total training time', end_training_time-start_training_time)


Start model training
Epoch 0: train loss: 0.581181164417002
Epoch 1: train loss: 0.17439208536719283
Epoch 2: train loss: 0.031143200732508883
Epoch 3: train loss: 0.015829219890292734
Epoch 4: train loss: 0.011918853138922714
Epoch 5: train loss: 0.01119055348681286
Epoch 6: train loss: 0.010492084235819574
Epoch 7: train loss: 0.009968274703876685
Epoch 8: train loss: 0.00975786931667244
Epoch 9: train loss: 0.00958296051572284
Average epoch training time 0.3683640003204346
Total training time 3.6905150413513184


In [70]:
x_batch_test = torch.from_numpy(X_test)
y_batch_test = torch.from_numpy(y_test)
my_dataset = torch.utils.data.TensorDataset(x_batch_test,y_batch_test) 
test_loader = torch.utils.data.DataLoader(dataset=my_dataset, batch_size=batch_size, drop_last=True, shuffle=True)

Evaluate LSTM model on test data.

In [71]:
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix

def eval_model(model, val_loader):
    model.eval()
    Y_pred = []
    Y_true = []

    model.eval()
    with torch.no_grad():
      for i, (x_batch, y_batch) in enumerate(test_loader):
          x_batch = x_batch.reshape(batch_size,  1, input_dim)
          y_hat = model(x_batch.float())
          y_hat = (y_hat > 0.5).int()
          Y_pred.extend(y_hat.tolist())
          Y_true.extend(y_batch.tolist())

    return Y_pred, Y_true

y_pred, y_true = eval_model(model, test_loader)
res = evaluate_performace(y_true, y_pred)
print (res)

{'no_false': 980, 'confusion_matrix': [4884, 970, 10, 366200], 'precision': 0.9979566816510013, 'sensitivity': 0.8343013324222753, 'no_links': 5854, 'F-score': 0.9088202456270933, 'roc_auc': 0.9171370128428518, 'accuracy': 0.997366044551475}
