

```
Reproduction of results of Scheme B in original paper
Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

Kha Vo, Jitendra Jonnagaddala, Siaw-Teng Liaw

February 2019

Jounal of Biomedical Informatics

Source of original code provided by the author:

Resources:

Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"


Additional Experiments can be found towards the end of the notebook.

```



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Download the following files and save them in your local file system where they will be retrieved at the time of uploading: febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv from the repo provided by the original.
Repo provided by authors can be found here: https://github.com/ePBRN/Medical-Record-Linkage-Ensemble

Upload febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv after running the following cell.

In [3]:
# upload febrl4_UNSW.csv, ePBRN_D_dup.csv, ePBRN_F_dup.csv, febrl3_UNSW.csv
from google.colab import files

uploaded = files.upload()

Saving febrl4_UNSW.csv to febrl4_UNSW.csv
Saving ePBRN_D_dup.csv to ePBRN_D_dup.csv
Saving ePBRN_F_dup.csv to ePBRN_F_dup.csv
Saving febrl3_UNSW.csv to febrl3_UNSW.csv


In [4]:
!pip install recordlinkage 
import recordlinkage

Collecting recordlinkage
  Downloading recordlinkage-0.14-py3-none-any.whl (944 kB)
[K     |████████████████████████████████| 944 kB 22.3 MB/s 
Collecting jellyfish>=0.5.4
  Downloading jellyfish-0.9.0.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 51.5 MB/s 
Building wheels for collected packages: jellyfish
  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
  Created wheel for jellyfish: filename=jellyfish-0.9.0-cp37-cp37m-linux_x86_64.whl size=73968 sha256=0320a7a1024adfa35566168b693a2c48bbb1bc524128c32c4afe6cd061741a86
  Stored in directory: /root/.cache/pip/wheels/fe/99/4e/646ce766df0d070b0ef04db27aa11543e2767fda3075aec31b
Successfully built jellyfish
Installing collected packages: jellyfish, recordlinkage
Successfully installed jellyfish-0.9.0 recordlinkage-0.14


In [5]:
import recordlinkage as rl, pandas as pd, numpy as np
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
from recordlinkage.preprocessing import phonetic
from numpy.random import choice
import collections, numpy
from IPython.display import clear_output
from sklearn.model_selection import train_test_split, KFold
import time

In [6]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

trainset = 'ePBRN_F_dup' 
testset = 'ePBRN_D_dup'

import recordlinkage as rl, pandas as pd, numpy as np
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
from recordlinkage.preprocessing import phonetic
from numpy.random import choice
import collections, numpy
from IPython.display import clear_output
from sklearn.model_selection import train_test_split, KFold


def generate_true_links(df): 
    # although the match_id column is included in the original df to imply the true links,
    # this function will create the true_link object identical to the true_links properties
    # of recordlinkage toolkit, in order to exploit "Compare.compute()" from that toolkit
    # in extract_function() for extracting features quicker.
    # This process should be deprecated in the future release of the UNSW toolkit.
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    processed = 0
    for match_id in df["match_id"].unique():
        if match_id != -1:    
            processed = processed + 1
            # print("In routine generate_true_links(), count =", processed)
            # clear_output(wait=True)
            linkages = df.loc[df['match_id'] == match_id]
            for j in range(len(linkages)-1):
                for k in range(j+1, len(linkages)):
                    indices_1 = indices_1 + [linkages.iloc[j]["rec_id"]]
                    indices_2 = indices_2 + [linkages.iloc[k]["rec_id"]]    
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def generate_false_links(df, size):
    # A counterpart of generate_true_links(), with the purpose to generate random false pairs
    # for training. The number of false pairs in specified as "size".
    df["rec_id"] = df.index.values.tolist()
    indices_1 = []
    indices_2 = []
    unique_match_id = df["match_id"].unique()
    unique_match_id = unique_match_id[~np.isnan(unique_match_id)] # remove nan values
    for j in range(size):
            false_pair_ids = choice(unique_match_id, 2)
            candidate_1_cluster = df.loc[df['match_id'] == false_pair_ids[0]]
            candidate_1 = candidate_1_cluster.iloc[choice(range(len(candidate_1_cluster)))]
            candidate_2_cluster = df.loc[df['match_id'] == false_pair_ids[1]]
            candidate_2 = candidate_2_cluster.iloc[choice(range(len(candidate_2_cluster)))]    
            indices_1 = indices_1 + [candidate_1["rec_id"]]
            indices_2 = indices_2 + [candidate_2["rec_id"]]  
    links = pd.MultiIndex.from_arrays([indices_1,indices_2])
    return links

def swap_fields_flag(f11, f12, f21, f22):
    return ((f11 == f22) & (f12 == f21)).astype(float)

def join_names_space(f11, f12, f21, f22):
    return ((f11+" "+f12 == f21) | (f11+" "+f12 == f22)| (f21+" "+f22 == f11)| (f21+" "+f22 == f12)).astype(float)

def join_names_dash(f11, f12, f21, f22):
    return ((f11+"-"+f12 == f21) | (f11+"-"+f12 == f22)| (f21+"-"+f22 == f11)| (f21+"-"+f22 == f12)).astype(float)

def abb_surname(f1, f2):
    return ((f1[0]==f2) | (f1==f2[0])).astype(float)

def reset_day(f11, f12, f21, f22):
    return (((f11 == 1) & (f12 == 1))|((f21 == 1) & (f22 == 1))).astype(float)

def extract_features(df, links):
    c = rl.Compare()
    c.string('given_name', 'given_name', method='levenshtein', label='y_name_leven')
    c.string('surname', 'surname', method='levenshtein', label='y_surname_leven')  
    c.string('given_name', 'given_name', method='jarowinkler', label='y_name_jaro')
    c.string('surname', 'surname', method='jarowinkler', label='y_surname_jaro')  
    c.string('postcode', 'postcode', method='jarowinkler', label='y_postcode')      
    exact_fields = ['postcode', 'address_1', 'address_2', 'street_number']
    for field in exact_fields:
        c.exact(field, field, label='y_'+field+'_exact')
    c.compare_vectorized(reset_day,('day', 'month'), ('day', 'month'),label='reset_day_flag')    
    c.compare_vectorized(swap_fields_flag,('day', 'month'), ('day', 'month'),label='swap_day_month')    
    c.compare_vectorized(swap_fields_flag,('surname', 'given_name'), ('surname', 'given_name'),label='swap_names')    
    c.compare_vectorized(join_names_space,('surname', 'given_name'), ('surname', 'given_name'),label='join_names_space')
    c.compare_vectorized(join_names_dash,('surname', 'given_name'), ('surname', 'given_name'),label='join_names_dash')
    c.compare_vectorized(abb_surname,'surname', 'surname',label='abb_surname')
    # Build features
    feature_vectors = c.compute(links, df, df)
    return feature_vectors

def generate_train_X_y(df):
    # This routine is to generate the feature vector X and the corresponding labels y
    # with exactly equal number of samples for both classes to train the classifier.
    pos = extract_features(df, train_true_links)
    train_false_links = generate_false_links(df, len(train_true_links))    
    neg = extract_features(df, train_false_links)
    X = pos.values.tolist() + neg.values.tolist()
    y = [1]*len(pos)+[0]*len(neg)
    X, y = shuffle(X, y, random_state=0)
    X = np.array(X)
    y = np.array(y)
    return X, y

def train_model(modeltype, modelparam, train_vectors, train_labels, modeltype_2):
    if modeltype == 'svm': # Support Vector Machine
        model = svm.SVC(C = modelparam, kernel = modeltype_2)
        model.fit(train_vectors, train_labels) 
    elif modeltype == 'lg': # Logistic Regression
        model = LogisticRegression(C=modelparam, penalty = modeltype_2,class_weight=None, dual=False, fit_intercept=True, 
                                   intercept_scaling=1, max_iter=5000, multi_class='ovr', 
                                   n_jobs=1, random_state=None)
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nb': # Naive Bayes
        model = GaussianNB()
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nn': # Neural Network
        model = MLPClassifier(solver='lbfgs', alpha=modelparam, hidden_layer_sizes=(256, ), 
                              activation = modeltype_2,random_state=None, batch_size='auto', 
                              learning_rate='constant',  learning_rate_init=0.001, 
                              power_t=0.5, max_iter=30000, shuffle=True, 
                              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                              nesterovs_momentum=True, early_stopping=False, 
                              validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model.fit(train_vectors, train_labels)
    elif modeltype == 'nn_ablation': # Neural Network
        model = MLPClassifier(solver='sgd', alpha=modelparam, hidden_layer_sizes=(256, ), 
                              activation = 'tanh',random_state=None, batch_size='auto', 
                              learning_rate='adaptive',  learning_rate_init=0.001, 
                              power_t=0.5, max_iter=10000, shuffle=True, 
                              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                              nesterovs_momentum=True, early_stopping=False, 
                              validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model.fit(train_vectors, train_labels)
    return model

def classify(model, test_vectors):
    result = model.predict(test_vectors)
    return result

    
def evaluation(test_labels, result):
    true_pos = np.logical_and(test_labels, result)
    count_true_pos = np.sum(true_pos)
    true_neg = np.logical_and(np.logical_not(test_labels),np.logical_not(result))
    count_true_neg = np.sum(true_neg)
    false_pos = np.logical_and(np.logical_not(test_labels), result)
    count_false_pos = np.sum(false_pos)
    false_neg = np.logical_and(test_labels,np.logical_not(result))
    count_false_neg = np.sum(false_neg)
    precision = count_true_pos/(count_true_pos+count_false_pos)
    sensitivity = count_true_pos/(count_true_pos+count_false_neg) # sensitivity = recall
    confusion_matrix = [count_true_pos, count_false_pos, count_false_neg, count_true_neg]
    no_links_found = np.count_nonzero(result)
    no_false = count_false_pos + count_false_neg
    Fscore = 2*precision*sensitivity/(precision+sensitivity)
    metrics_result = {'no_false':no_false, 'confusion_matrix':confusion_matrix ,'precision':precision,
                     'sensitivity':sensitivity ,'no_links':no_links_found, 'F-score': Fscore}
    return metrics_result

def blocking_performance(candidates, true_links, df):
    count = 0
    for candi in candidates:
        if df.loc[candi[0]]["match_id"]==df.loc[candi[1]]["match_id"]:
            count = count + 1
    return count

The following cell creates the train dataset.

In [7]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## TRAIN SET CONSTRUCTION

# Import
print("Import train set...")
df_train = pd.read_csv(trainset+".csv", index_col = "rec_id")
train_true_links = generate_true_links(df_train)
print("Train set size:", len(df_train), ", number of matched pairs: ", str(len(train_true_links)))

# Preprocess train set
df_train['postcode'] = df_train['postcode'].astype(str)

# Final train feature vectors and labels
X_train, y_train = generate_train_X_y(df_train)
print("Finished building X_train, y_train")

Import train set...
Train set size: 14078 , number of matched pairs:  3192
Finished building X_train, y_train


In [8]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

# Blocking Criteria: declare non-match of all of the below fields disagree
# Import
print("Import test set...")
df_test = pd.read_csv(testset+".csv", index_col = "rec_id")
test_true_links = generate_true_links(df_test)
leng_test_true_links = len(test_true_links)
print("Test set size:", len(df_test), ", number of matched pairs: ", str(leng_test_true_links))

print("BLOCKING PERFORMANCE:")
blocking_fields = ["given_name", "surname", "postcode"]
all_candidate_pairs = []
for field in blocking_fields:
    block_indexer = rl.BlockIndex(on=field)
    candidates = block_indexer.index(df_test)
    detects = blocking_performance(candidates, test_true_links, df_test)
    all_candidate_pairs = candidates.union(all_candidate_pairs)
    print("Number of pairs of matched "+ field +": "+str(len(candidates)), ", detected ",
         detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
          str(leng_test_true_links-detects) )
detects = blocking_performance(all_candidate_pairs, test_true_links, df_test)
print("Number of pairs of at least 1 field matched: " + str(len(all_candidate_pairs)), ", detected ",
     detects,'/'+ str(leng_test_true_links) + " true matched pairs, missed " + 
          str(leng_test_true_links-detects) )

Import test set...
Test set size: 11731 , number of matched pairs:  2653
BLOCKING PERFORMANCE:
Number of pairs of matched given_name: 252552 , detected  1567 /2653 true matched pairs, missed 1086
Number of pairs of matched surname: 33832 , detected  1480 /2653 true matched pairs, missed 1173
Number of pairs of matched postcode: 79940 , detected  2462 /2653 true matched pairs, missed 191
Number of pairs of at least 1 field matched: 362910 , detected  2599 /2653 true matched pairs, missed 54


The following cell creates a test dataset.

In [9]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## TEST SET CONSTRUCTION

# Preprocess test set
print("Processing test set...")
print("Preprocess...")
df_test['postcode'] = df_test['postcode'].astype(str)

# Test feature vectors and labels construction
print("Extract feature vectors...")
df_X_test = extract_features(df_test, all_candidate_pairs)
vectors = df_X_test.values.tolist()
labels = [0]*len(vectors)
feature_index = df_X_test.index
for i in range(0, len(feature_index)):
    if df_test.loc[feature_index[i][0]]["match_id"]==df_test.loc[feature_index[i][1]]["match_id"]:
        labels[i] = 1
X_test, y_test = shuffle(vectors, labels, random_state=0)
X_test = np.array(X_test)
y_test = np.array(y_test)
print("Count labels of y_test:",collections.Counter(y_test))
print("Finished building X_test, y_test")

Processing test set...
Preprocess...
Extract feature vectors...
Count labels of y_test: Counter({0: 360311, 1: 2599})
Finished building X_test, y_test


The following cell is the baseline code for the SVM model.

In [10]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## SVM MODEL; rbf; C=0.001
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'svm' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'rbf'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [.001] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: svm , Param_1: rbf , tuning range: [0.001]
Total training time 2.0498197078704834
No_false: [5537] 

Precision: [0.3178323412698413] 

Sensitivity: [0.9861485186610235] 

F-score: [0.4807277501641189] 




The following cell is the baseline code for the NN model.

In [11]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## NN MODEL; relu; C=2000
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'nn' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'relu'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [2000] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: nn , Param_1: relu , tuning range: [2000]
Total training time 0.9440796375274658
No_false: [1208] 

Precision: [0.6919679823350814] 

Sensitivity: [0.9646017699115044] 

F-score: [0.8058502089360334] 




The following cell is the baseline code for the LG model.

In [13]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## BASE LEARNERS CLASSIFICATION AND EVALUATION
## LG MODEL; l2; C=0.005
print("BASE LEARNERS CLASSIFICATION PERFORMANCE:")
modeltype = 'lg' # choose between 'svm', 'lg', 'nn'
modeltype_2 = 'l2'#'rbf'  # 'linear' or 'rbf' for svm, 'l1' or 'l2' for lg, 'relu' or 'logistic' for nn
modelparam_range = [0.005] # C for svm, C for lg, alpha for NN
print("Model:",modeltype,", Param_1:",modeltype_2, ", tuning range:", modelparam_range)
precision = []
sensitivity = []
Fscore = []
nb_false = []

for modelparam in modelparam_range:
    start_training_time = time.time()
    md = train_model(modeltype, modelparam, X_train, y_train, modeltype_2)
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    final_result = classify(md, X_test)
    final_eval = evaluation(y_test, final_result)
    precision += [final_eval['precision']]
    sensitivity += [final_eval['sensitivity']]
    Fscore += [final_eval['F-score']]
    nb_false  += [final_eval['no_false']]
    
print("No_false:",nb_false,"\n")
print("Precision:",precision,"\n")
print("Sensitivity:",sensitivity,"\n")
print("F-score:", Fscore,"\n")
print("")

BASE LEARNERS CLASSIFICATION PERFORMANCE:
Model: lg , Param_1: l2 , tuning range: [0.005]
Total training time 0.023532390594482422
No_false: [1827] 

Precision: [0.5905678085405913] 

Sensitivity: [0.9684494036167757] 

F-score: [0.7337122868386532] 




Bagging performance for all baseline models.

In [14]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

## ENSEMBLE CLASSIFICATION AND EVALUATION

print("BAGGING PERFORMANCE:\n")
modeltypes = ['svm', 'nn', 'lg'] 
modeltypes_2 = ['rbf', 'relu', 'l2']
modelparams = [0.001, 2000, 0.005]
nFold = 10
kf = KFold(n_splits=nFold)
model_raw_score = [0]*3
model_binary_score = [0]*3
model_i = 0
for model_i in range(3):
    start_bagging_time = time.time()
    modeltype = modeltypes[model_i]
    modeltype_2 = modeltypes_2[model_i]
    modelparam = modelparams[model_i]
    print(modeltype, "per fold:")
    iFold = 0
    result_fold = [0]*nFold
    final_eval_fold = [0]*nFold
    start_training_time = time.time()
    for train_index, valid_index in kf.split(X_train):
        X_train_fold = X_train[train_index]
        y_train_fold = y_train[train_index]
        md =  train_model(modeltype, modelparam, X_train_fold, y_train_fold, modeltype_2)
        result_fold[iFold] = classify(md, X_test)
        final_eval_fold[iFold] = evaluation(y_test, result_fold[iFold])
        print("Fold", str(iFold), final_eval_fold[iFold])
        iFold = iFold + 1
    end_training_time = time.time()
    print ('Total training time', end_training_time-start_training_time)
    bagging_raw_score = np.average(result_fold, axis=0)
    bagging_binary_score  = np.copy(bagging_raw_score)
    bagging_binary_score[bagging_binary_score > 0.5] = 1
    bagging_binary_score[bagging_binary_score <= 0.5] = 0
    bagging_eval = evaluation(y_test, bagging_binary_score)
    print(modeltype, "bagging:", bagging_eval)
    end_bagging_time = time.time()
    print ('Total bagging time', end_bagging_time-start_bagging_time)
    print('')
    model_raw_score[model_i] = bagging_raw_score
    model_binary_score[model_i] = bagging_binary_score

BAGGING PERFORMANCE:

svm per fold:
Fold 0 {'no_false': 4581, 'confusion_matrix': [2550, 4532, 49, 355779], 'precision': 0.3600677774639932, 'sensitivity': 0.9811465948441709, 'no_links': 7082, 'F-score': 0.5268050821196157}
Fold 1 {'no_false': 4545, 'confusion_matrix': [2551, 4497, 48, 355814], 'precision': 0.3619466515323496, 'sensitivity': 0.981531358214698, 'no_links': 7048, 'F-score': 0.5288690784699906}
Fold 2 {'no_false': 4660, 'confusion_matrix': [2552, 4613, 47, 355698], 'precision': 0.3561758548499651, 'sensitivity': 0.9819161215852251, 'no_links': 7165, 'F-score': 0.5227365833674724}
Fold 3 {'no_false': 4614, 'confusion_matrix': [2551, 4566, 48, 355745], 'precision': 0.3584375439089504, 'sensitivity': 0.981531358214698, 'no_links': 7117, 'F-score': 0.5251132153149444}
Fold 4 {'no_false': 4563, 'confusion_matrix': [2550, 4514, 49, 355797], 'precision': 0.36098527746319364, 'sensitivity': 0.9811465948441709, 'no_links': 7064, 'F-score': 0.5277864017385905}
Fold 5 {'no_false': 

In [15]:
# Source used: Kha Vo and Jitendra Jonnagaddala and Siaw-Teng Liaw. (2019). Medical-Record-Linkage-Ensemble. Retrieved from https://github.com/ePBRN/Medical-Record-Linkage-Ensemble. Paper: "Statistical supervised meta-ensemble algorithm for data linkage"

thres = .99

print("STACKING PERFORMANCE:\n")
stack_raw_score = np.average(model_raw_score, axis=0)
stack_binary_score = np.copy(stack_raw_score)
stack_binary_score[stack_binary_score > thres] = 1
stack_binary_score[stack_binary_score <= thres] = 0
stacking_eval = evaluation(y_test, stack_binary_score)
print(stacking_eval)


STACKING PERFORMANCE:

{'no_false': 1036, 'confusion_matrix': [2505, 942, 94, 359369], 'precision': 0.7267188859878155, 'sensitivity': 0.9638322431704501, 'no_links': 3447, 'F-score': 0.8286470393648694}


The following code is the experimentation of implementing an MLP classifier using the PyTorch library with 4 layers and relu activation.

In [16]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import KFold

In [17]:
class MLP(torch.nn.Module):
        def __init__(self, input_dim, hidden_size):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_dim, 256)
            self.fc2 = nn.Linear(256, 128)
            self.fc3 = nn.Linear(128, 1)
            
        def forward(self, x):
            output = self.fc1(x)
            output = F.relu(output)
            output = self.fc2(output)
            output = F.relu(output)
            output = self.fc3(output)
            return output

In [18]:
def evaluate_performace(y_test, result):
    acc = accuracy_score(result, y_test)
    p, r, f, _ = precision_recall_fscore_support(result, y_test, average='binary')
    roc_auc = 0.
    try:
        roc_auc = roc_auc_score(result, y_test)
    except ValueError:
        pass
    tn, fp, fn, tp = confusion_matrix(y_test, result).ravel()
    no_links_found = np.count_nonzero(result)
    no_false = fp + fn
    cm = [tp, fp, fn, tn]
    metrics_result = {'no_false':no_false, 'confusion_matrix':cm ,'precision':p,
                        'sensitivity':r ,'no_links':no_links_found, 'F-score': f, 'roc_auc': roc_auc, 'accuracy':acc}
    return metrics_result

In [19]:
x_batch = torch.from_numpy(X_train)
y_batch = torch.from_numpy(y_train)

batch_size=200

my_dataset = torch.utils.data.TensorDataset(x_batch,y_batch) 
train_loader = torch.utils.data.DataLoader(dataset=my_dataset, batch_size=batch_size, drop_last=True, shuffle=True)


MLP Model

In [20]:
### MLP model
input_dim=x_batch.size()[1]
hidden_size=256
num_epoch=100
model = MLP(input_dim, hidden_size)#.to(device)
print(model)

loss_fct = nn.MSELoss().cuda()
optimizer = torch.optim.LBFGS(model.parameters(), lr=0.001, tolerance_grad=0.0001)

MLP(
  (fc1): Linear(in_features=15, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
)


In [21]:
####
total_training_time = time.time()
times = np.array([])
#train MLP
for epoch in range(num_epoch):
        startTime = time.time()
        losses_arr = np.array([])
        for batch_idx, (data, target) in enumerate(train_loader):

            px, py = data.float(), target.float()

            def closure():
                optimizer.zero_grad()
                output = model(px)
                loss = loss_fct(output, py.unsqueeze(1))
                loss.backward()
                return loss
            l = optimizer.step(closure)
            losses_arr = np.append(losses_arr, l.item())
        avg_loss = np.average(losses_arr)
        print('Epoch {}: avg train loss: {}'.format(epoch, avg_loss))

        executionTime = (time.time() - startTime)
        times = np.append(times, executionTime)

end_training_time = (time.time() - total_training_time)
print ('Total training time', end_training_time)
avg_runtime_for_each_epoch = np.average(times)
print ('Average runtime for each epoch for MLP : ', avg_runtime_for_each_epoch)


Epoch 0: avg train loss: 0.0900833565861948
Epoch 1: avg train loss: 0.018745142065228954
Epoch 2: avg train loss: 0.01736730301091748
Epoch 3: avg train loss: 0.017382582228991292
Epoch 4: avg train loss: 0.0174736640746555
Epoch 5: avg train loss: 0.017216297076834787
Epoch 6: avg train loss: 0.017155839580922358
Epoch 7: avg train loss: 0.017333178089991692
Epoch 8: avg train loss: 0.017318070355442264
Epoch 9: avg train loss: 0.017262197850692655
Epoch 10: avg train loss: 0.017141237374274962
Epoch 11: avg train loss: 0.01726074648960944
Epoch 12: avg train loss: 0.017359069218077967
Epoch 13: avg train loss: 0.017411279582208204
Epoch 14: avg train loss: 0.01735428056769794
Epoch 15: avg train loss: 0.017247081794325384
Epoch 16: avg train loss: 0.017376036383211613
Epoch 17: avg train loss: 0.017392932497445616
Epoch 18: avg train loss: 0.017388079045039993
Epoch 19: avg train loss: 0.017181594405443437
Epoch 20: avg train loss: 0.017239679281990373
Epoch 21: avg train loss: 0.01

In [22]:
x_batch_test = torch.from_numpy(X_test)#.cuda()
y_batch_test = torch.from_numpy(y_test)#.cuda()

MLP Model

In [23]:
# MLP model
y_pred = model(x_batch_test.float())
after_train = loss_fct(y_pred.squeeze(), y_batch_test.float()) 
print('Test loss after Training' , after_train.item())
y_hat = (y_pred > 0.5).int()#.cpu()
stacking_eval = evaluate_performace(y_test, y_hat)
print(stacking_eval)

Test loss after Training 0.05276618152856827
{'no_false': 1952, 'confusion_matrix': [2522, 1875, 77, 358436], 'precision': 0.9703732204694113, 'sensitivity': 0.5735728906072322, 'no_links': 4397, 'F-score': 0.7209834190966266, 'roc_auc': 0.786679057287003, 'accuracy': 0.9946212559587777}
