# Automated metadata matching

Life sciences (LS)/ Clinical research institutes (academic research institutes, pharma companies, hospitals, clinics etc.) across the world are producing large volumes of data from patients. This can range from clinical information such as diagnostic/prognostic data, omic data such as genetic/proteomic/epigenetic screens, pathological data such as MRI scans etc. One of the main objectives in LS research (both academic and industrial) is to gain actionable insights from these data sets, that goes beyond the diagnosis/prognosis of a (group of) patient(s) and provides a deeper understanding of the diseases, as well as shine lights on new therapeutic options. It is becoming apparent, that to gain actionable insights from LS data sets, we need data from a large number of patients. This is achievable, if we could merge datasets from various institutes, which is turning out to be hugely challenging task, simply because different institutes use different standards, units, nomenclature etc. to store data. <br><br>

For instance, patient's age is a common clinical parameter recorded by almost all organisations. One institute can name the variable that records patients’ age as 'patient age', another can name the same variable as 'age', others can name it as 'age at diagnosis', 'days since birth', 'years since birth' etc. The values can also be in days, months, years etc. Therefore, to combine data from many institutes (and sometimes within same institutes), it's essential to understand that all of the above variables are recording the same thing, i.e. patient's age, also we need to make sure that the units (days, months, years) of measuring age are homogenised at the time of integration. <br><br>

To assist in the above process the National Cancer Institute (NCI) created the concept of CDE (common data element). See https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search for more details. A big data dump of about 69000 CDE elements are provided in 'cde_database\full_database' folder in XML format, if you want to further explore. They provide a standard format of representing Life Science's data. This gives us standard variable name, the permissible values , units etc. for each of these clinical parameters. Some research organizations are following this standard, but vast majority aren't. Additionally, there is huge amount of data produced until now which are not standardized using CDEs. <br><br>

To be able to integrate data from various institutes, we need to be able to match the variable names in the clinical datasets to the corresponding CDE elements. Currently, there is a drive for developing ML/AI algorithms to achieve this.<br><br>


The code below is an initial attempt in this direction. In summary, it tries to match the variables names (generally the column headers in a clinical data file) and values (the column values) of the clinical parameters in a dataset, to the long variable names, and permissible values of the CDE elements. The objective is to find the CDE elements that closely match the each clinical parameter name (i.e. the column header). To do the the following steps are performed: <br><br>

1. Converted selected aspects (e.g. long_name, permissible_values etc.) of all CDE elements into numerical vectors using a word embedding model which itself was trained on these data.

2. Coverted the clinical parameter names (headers) and values into numerical vector using the same word embedding model as above.

3. The vectors from the clinical data can be matched to CDE vectors in a few different ways: <br>
   (a) One way is to use unsupervised learning, fit a Nearest Neighbour model to the CDE vectors, and look for the nearest neighbors of each clinical parameter using this model.
   (b) Another way is to use supervised learning: Create feature vectors for all possible pairs of clinical parameters and CDE elements, consider the true pairs as positive class (target =1) and the remaining pairs as negative class (target = 0). Train classifiers to this data and use the classifier to evaluate new clinical parameters.
   
 
See more details below.
   
   
   
   





## Install custom benchmark solutions libraries

In [None]:
!pip install cde_modelling_tools/.

In [None]:
!pip install -r requirements.txt

## Import necessary libraries

In [None]:
import pandas as pd

import json
import numpy as np
import random
import mlflow
from cde_modelling.modelling import CDE_data_modeller as cdm
from cde_modelling.parsing import TCGA_data_parser as tdp
from cde_modelling.utils import Accuracy_calculations as ac
import pickle 
from sklearn.model_selection import train_test_split

## File paths

In [None]:
clinical_data_files_dir = 'tcga_training_data/'

clinical_data_test_dir = 'tcga_test_data/'

cde_database_file = 'cde_database/combined_small_dataset.json'

parameter_file = 'params_supervised.json'

model_dir = 'models/'

test_gold_standard = 'gold_standard/test_gs.json'

## Load model parameters

In [None]:
# read model parameters
params = {}

with open(parameter_file,'r') as file:
    params = json.load(file)


## Create a fasttext model for the CDE database and index the individual CDE elements in the database 

Fasttext is a word embedding algorithm developed by FaceBook. Given a corpus, it creates a model that tries to predict if a pair of words appear in the same context. The model first converts the words to a numeric vector which are used as features for the above prediction. We are interested in the feature generation part, i.e. the part which converts words to numeric vectors. For more information on the FastText model see https://radimrehurek.com/gensim/models/fasttext.html.

### FastText model training: 

To train a FastText model we first extracted the long_name and permissible_values of each CDE elements. These were then parseed and cleaned (lower cased, alphanumeric character only, splitted into bag of words). The preprocessed long names and permissible values of all CDE element was considered as the training corpus for the FastText model. The corpus was then used to train A FastText model. The parameters for the model are in the above json file. The trained model is then used to index the CDE elements (i.e. create numeric vectors representing each CDE). We created two sets of vectors for CDE elements, one for the long_names and the other for permissible values. We alo extracted the data_type information for each CDE elements. Below is an example. Let's assume that the following is a (oversimplified) CDE element .
CDE_element: 
{
'public_id': 1234
'long_name': 'received radiotherapy'
..............
'permissible_values': ['yes','no']
}.

To index the above CDE, we performed the following:

1. Vectorized the long_name entry (i.e. 'received radiotherapy') using the FastText model. To do that, we vectorized each word (i.e. 'received' and 'radiotherapy') of the long name entry separately. The vectors were then normalized by their L2 norms and averaged. Say for example, the long_name vector is [0.1, 0.345]. 

2. Vectorized the 'permissible_values' entry (i.e. 'yes', 'no') using the FastText model. To do that, we vectorized each word (i.e. 'received' and 'radiotherapy') in the permissible_values entry separately. The vectors were then normalized by their L2 norms and averaged.Say for example, the permissible vector is [0.981, 0.233]. 

3. We identified whether the permissible values are string or numbers. Note, that for the benchmark solution, we kept this simple. But for the hackathon, the participants can conder more grannular data type for example, string, binary, float, int long etc.


Combination of the above is used to numerically represent (index) each CDE. The class CDE_data_modeller, in package cde_modelling_tools does the above. Pparticipants should explore using other entries in the CDE data fields to improve their chances of finding a match.

The CDE_data_modeller class not only creates the word embedding models and index (vectorize) the CDE data elements, it can also save and load pretrained models and indexes.

In [None]:
# cde_data_modellers = cdm.CDE_data_modeller(cde_database_file, params)
# cde_data_modellers.create_model_and_cde_indexes()
# cde_data_modellers.save_model_and_indexes(model_dir+'fasttext/')

## Load a pretrained FastText model and saved indexes for CDE elements

In [None]:
cde_data_modellers = cdm.CDE_data_modeller(cde_database_file, params)
cde_data_modellers.load_model_and_cde_indexes(model_dir+'fasttext/')

## Load and parse training data

The training data are a set of clinical data files which records cinical information of patients, e.g. gender, age, disease_type, disease_sub_type, treatment received etc. It's in table format, where the rows represent patients and the columns represent colinical parameters. In case of the training data, the CDE data element corrsponding to each clinical parameter is provided. This information can be used to train machine learning algorithms to predict CDE elements for new clinical parameters.

In [None]:
tdpr = tdp.TCGA_data_processor(clinical_data_files_dir,True )
tcga_data = tdpr.get_parsed_data()


The parser returns three types of information for each clinical parameter.

1. The name of the parameter (e.g. age, gender, etc.)
2. List of values for each parameters (except id columns, continuous variabales etc.)
3. Data type of the values. For instance, data type of 'age' is 'number', data type of gender = 'string'. 
4. A dictionary containing clinical parameters and it's corresponding 

See the parsed data below.

## Create base tables for model training

To create base tables I performed the following:

1. Indexed (vectorized using the FastText model) the headers (clinical parameter names) and values of each clinical parameters parsed in the previous step.
2. for each possible pair of clinical_parameter and CDE elements we calculate the following features <br>
    (a) Difference between the embedding vectors of the CDE long_name and the clinical parameter name. <br>
    (b) Difference between the embedding vectors of the CDE permissible values and the values associated with the clinical parameters in the training dataset. <br>
    (c) A similarity measure (cosine similarity, correlation etc.) betwween the CDE long_name and clinical parameter name vectors <br>
    (d) A similarity measure (cosine similarity, correlation etc.) betwween the CDE permissible_vaue and observed clinical parameter value vectors <br>
    (e) Similaritied between the data type of the permissible and observed values of the CDE and the observed clinical oaraneters respectively <br>

3. Note that, in the base table one data point is represented by a pair (clinical parameter and a CDE ). For example: If there are 800 cinical parameters in the training data and 5000 CDE elements in the CDE dataset, the the base table will have 500*8000 = 4million entries. Each entry will have the above features. The 'target' variable is defined as follows: <br>

$
target = 1, \text{if the CDE element is manually matched to the clinical parameter} \\
target = 0, \text{otherwise}
$

In the above example, there are 800 clinical parameters, and if only 1 CDE elements is matched to each clinical parameter, the target variable can be equals to 1 in only 800 out of 4 million cases. Therefore the base table is extremely imbalanced. To counter this we need to undersample (or oversample) the abt. The create_abt function in CDE_data_modeller allows undersampling. The ratio of undersampling (number of cases target = 0 / number of cases target =1 ) can be adjusted using the params dictionary. The defalut value is 5 which means in the undersampled base tables, 16.67 % of cases have target =1 and 83.33% of cases have target = 0.





## Tune Sampling Ratio Here!!!!


In [None]:
# params["features"]["sampling_ratio"] = 3

In [None]:
abt = cde_data_modellers.create_abt(tcga_data, params)

In [None]:
abt.head()

## Train a machine learning model using the abt created above
I created a separate class called create_model where a number of supervised and unsupervised learning algorithms are implemented (from sklearn library). The type and parameters of the model can be passed using the params dictionary. 

!!! Warning: Currently the the two datatype columns in the abt (data_type_string, data_type_number) are complementary and hence redundant. Only one should be used for modelling. This needs to be corrected in the future versions.

In [None]:
# First create feature vectors
features = [c for c in abt.columns if ('feature' in c) or ('metric' in c)]

# create design matrix and target data
X = abt[features]
y= abt['target']

# create training and validation data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=42)


# import the Model class
from cde_modelling.modelling.create_models import Model

# create model
model = Model(params)

#fit the model
model.fit(X_train,y_train)

# calculate the accuracy of the model on the validation data
accuracy = model.accuracy(X_val, y_val)

In [None]:
abt

In [None]:
accuracy

## Random Forest

In [None]:
# read model parameters
params2 = {}

with open("params_supervised_RF.json",'r') as file:
    params2 = json.load(file)

model2 = Model(params2)

model2.fit(X_train, y_train)

accuracy2 = model2.accuracy(X_val, y_val)

In [None]:
accuracy2

## Logistic Regression

In [None]:
# read model parameters
params3 = {}

with open("params_supervised_LR.json",'r') as file:
    params3 = json.load(file)

print(params3)
model3 = Model(params3)

model3.fit(X_train, y_train)

accuracy3 = model3.accuracy(X_val, y_val)


In [None]:
accuracy3

## Cross Validation

### Stratified Cross Validation

In [None]:
# stratified k-fold cross validation evaluation of xgboost model
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# CV model
model = Model(params).model
kfold = StratifiedKFold(n_splits=10, random_state=66)

In [None]:
from sklearn.model_selection import cross_validate

scoring = {'acc': 'accuracy',
           'b_acc': 'balanced_accuracy',
           'f1': 'f1',
           'prec': 'precision',
           'rec': 'recall',
           'auc': 'roc_auc' 
          }
scores = cross_validate(model, X, y, scoring=scoring,
                         cv=kfold, return_train_score=False)

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (scores["test_acc"].mean()*100, scores["test_acc"].std()*100))
print("Balanced Accuracy: %.2f%% (%.2f%%)" % (scores["test_b_acc"].mean()*100, scores["test_b_acc"].std()*100))
print("F1: %.2f%% (%.2f%%)" % (scores["test_f1"].mean()*100, scores["test_f1"].std()*100))
print("Precision: %.2f%% (%.2f%%)" % (scores["test_prec"].mean()*100, scores["test_prec"].std()*100))
print("Recall: %.2f%% (%.2f%%)" % (scores["test_rec"].mean()*100, scores["test_rec"].std()*100))
print("AUC: %.2f%% (%.2f%%)" % (scores["test_auc"].mean()*100, scores["test_auc"].std()*100))

In [None]:
len(X)

## Generalise the whole thing

### Vectorise dataset

In [None]:
def vectorise_dataset(cde_data_modellers, params, sampling_ratio=5):
    
    # load and parse data
    tdpr = tdp.TCGA_data_processor(clinical_data_files_dir,True)
    tcga_data = tdpr.get_parsed_data()
    
    # Tune the unmatched:matched ratio, default set to be 5
    params["features"]["sampling_ratio"] = sampling_ratio
    
    # generate base table
    abt = cde_data_modellers.create_abt(tcga_data, params)
    
    return abt

### ML 

In [None]:
# import the Model class
from cde_modelling.modelling.create_models import Model

# stratified k-fold cross validation evaluation of xgboost model
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

In [None]:
def prep_ml(params, abt, clf="gradient boost", model_params={'n_estimators': 125}, test_size=0.2, 
            random_state=42, cv=False):
    
    # First create feature vectors
    features = [c for c in abt.columns if ('feature' in c) or ('metric' in c)]

    # create design matrix and target data
    X = abt[features]
    y= abt['target']

    # create training and validation data
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, random_state=random_state)


    # import the Model class
    from cde_modelling.modelling.create_models import Model
    
    # Manually assign value
    params['model']['name'] = clf
    params['model']['model_params'] = model_params
    
    if not(cv):
        # create model
        model = Model(params)

        #fit the model
        model.fit(X_train,y_train)

        # calculate the accuracy of the model on the validation data
        accuracy = model.accuracy(X_val, y_val)
        
        print(accuracy)
        
    else:
        print(len(X))
        # CV model
        model = Model(params).model
        kfold = StratifiedKFold(n_splits=10, random_state=66)
        scoring = {'acc': 'accuracy',
           'b_acc': 'balanced_accuracy',
           'f1': 'f1',
           'prec': 'precision',
           'rec': 'recall',
           'auc': 'roc_auc' 
          }
        print(params)
        scores = cross_validate(model, X, y, scoring=scoring, cv=kfold, return_train_score=False)
        
        print("Accuracy: %.2f%% (%.2f%%)" % (scores["test_acc"].mean()*100, scores["test_acc"].std()*100))
        print("Balanced Accuracy: %.2f%% (%.2f%%)" % (scores["test_b_acc"].mean()*100, scores["test_b_acc"].std()*100))
        print("F1: %.2f%% (%.2f%%)" % (scores["test_f1"].mean()*100, scores["test_f1"].std()*100))
        print("Precision: %.2f%% (%.2f%%)" % (scores["test_prec"].mean()*100, scores["test_prec"].std()*100))
        print("Recall: %.2f%% (%.2f%%)" % (scores["test_rec"].mean()*100, scores["test_rec"].std()*100))
        print("AUC: %.2f%% (%.2f%%)" % (scores["test_auc"].mean()*100, scores["test_auc"].std()*100))



In [None]:
abt_tst = vectorise_dataset(cde_data_modellers, params, sampling_ratio=5)

In [None]:
prep_ml(params, abt_tst, clf="gradient boost", model_params=params['model']['model_params'], cv=True)

In [None]:
params['model']['model_params']

In [None]:
model

## Test the model on the test dataset

To do that, we shall first create the base table for the test dataset by parsing and indexing the test set in the same way as was done for the training set.

In [None]:
tdp1 = tdp.TCGA_data_processor(clinical_data_test_dir,False )
test_data = tdp1.get_parsed_data()
test_abt =cde_data_modellers.create_abt(test_data)

## Make predictions for the test dataset using the trained model

Note that I have created a model.predict_and_convert_to_json function which returns the prediction in the following format: <br>
{
clinical parameter1: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
clinical parameter2: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
.....
clinical parametern: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
}

In [None]:
test_abt.fillna(0, inplace = True)

index_cols = ['headers','public_id']
header_col = index_cols[0]
id_col = index_cols [1]

results = model.predict_and_convert_to_json(test_abt,20, index_cols, header_col, id_col)


In [None]:
results

# Calculate accuracy of prediction for the test dataset

Note that the participants won't have access to the gold standard data, therefore won't be able to perform the following step. However, participants can divide the training data in to train, test, validation sets and perform the following on the test data.

In [None]:
test_gs = {}
with open(test_gold_standard, 'rb') as file:
    test_gs = json.load(file)
test_accuracy = ac.calculate_accuracy(test_gs,results)

In [None]:
test_accuracy

## Log model parameters etc. using mlflow

This will ensure reproducibility of results and will keep track of all models and results during the model development and calibration.

In [None]:
with mlflow.start_run():
    # print out current run_uuid
    run_uuid = mlflow.active_run().info.run_uuid
    print("MLflow Run ID: %s" % run_uuid)
    
    # log parameters
    mlflow.log_param("window_size", params["fasttext"]["window"])
    mlflow.log_param("min_count", params["fasttext"]["min_count"])
    mlflow.log_param("epochs", params["fasttext"]["epochs"])
    mlflow.log_param("vector_size", params["fasttext"]["vector_size"])
    
    
    mlflow.log_param("features_diference_types", params["features"]["differences"]["type"])
    mlflow.log_param("features_metrics", params["features"]["metrics"]["metric"])
    mlflow.log_param("features_metrics_sim_type", params["features"]["metrics"]["sim_type"])
    mlflow.log_param("features_metrics_scaling", params["features"]["metrics"]["scaling"])
    mlflow.log_param("features_sampling_ratio", params["features"]["sampling_ratio"])
    
    mlflow.log_param("features_samplinf_ratio", params["features"]["sampling_ratio"])
    
    mlflow.log_param("model_type", params['model']["name"])
    
    for k in params['model']['model_params'].keys():
        mlflow.log_param("model_params_"+k, params['model']["model_params"][k])
    
    # log metrics
    
        
#     mlflow.log_metric("test_accuracy",test_accuracy)
    for k in accuracy.keys():
        if 'confusion' not in k:
            mlflow.log_metric("val_accuracy_"+k,accuracy[k])
    
    #mlflow.sklearn.logmodel()
    with open('models/'+run_uuid+'.pkl','wb') as file:
        pickle.dump(model, file)
    
    mlflow.end_run()

## Use 'mlflow ui' to compare and analyze various model performances

<img src="img/mlflow_ui.png">

In [None]:
!mlflow ui

In [None]:
accuracy.keys()

### To view mlflow ui go to http://localhost:5000