# Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Over the past years many methods for similarity-based (a.k.a. knowledge-based,
guilt-by-association-based) drug repurposing, yet most of these studies do not
provide the code or the model used in the study. To improve reproducibility,
we present a Python-platform offering
- drug feature data parsing and similarity calculation
- data balancing
- (disjoint) cross validation
- classifier building

Using this platform we investigate the effect using unseen data in the test
set in similarity-based classification.


## Requiremets
The Python platform has the following dependencies:

- [Numpy](http://www.numpy.org)
- [Scikit-learn](http://scikit-learn.org)


## Data sets
The data sets used in the analysis are freely available
[online](http://astro.temple.edu/~tua87106/drugreposition.html)

We have modified these data sets slightly for parsing in Python by
- converting all drug, disease and side effect terms to lowercase
- removing the quotations and making the text tab delimited
- we also added the 'Drug' text to the header

These modified files are available under _'data/'_ folder.

## Setting up the platform

In [1]:
import random
from src import ml, utilities
from toolbox import configuration

# Get parameters
#random.seed(52345) # for reproducibility
features = ["chemical", "target", "phenotype"]
model_type = "logistic" # ML model
prediction_type = "disease" # predict drug-disease associations
output_file = "data/validation.dat.test" # file containing run parameters and corresponding AUC values
n_proportion = 2 # proportion of negative instances compared to positives
n_subset = -1 # for faster results - subsampling data
knn = 20 # number of nearest drugs to check in the pharmacological space to assing a repurposing score
n_run = 10 # number of repetitions of cross-validation analysis
n_fold = 10 # number of folds in cross-validation
recalculate_similarity = True # whether the k-NN based repurposing score should be calculated within the training/test set
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = False 

# Get data
drug_disease_file = "data/drug_disease.dat"
drug_side_effect_file = "data/drug_sider.dat"
drug_structure_file = "data/drug_structure.dat"
drug_target_file = "data/drug_protein.dat"
data = utilities.get_data(drug_disease_file, drug_side_effect_file, drug_structure_file, drug_target_file)

## Evaluating the effect of each feature in prediction performance

In [2]:
features = ["chemical", "target", "phenotype"]

for feature in features:
    features_modified = [ feature ]
    print feature
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features_modified, recalculate_similarity, disjoint_cv, output_file, model_fun = None)

chemical
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 77.4 (+/-0.2): [77.6, 77.3, 77.1, 77.2, 77.2, 77.4, 77.4, 77.6, 77.3, 77.5]
target
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 76.9 (+/-0.2): [76.8, 76.9, 76.9, 76.6, 76.9, 77.1, 77.1, 77.1, 76.9, 76.7]
phenotype
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 80.0 (+/-0.3): [80.4, 79.8, 80.3, 80.2, 80.3, 80.4, 79.8, 80.1, 79.7, 79.4]


## Evaluating the effect of data imbalance

In [None]:
n_fold = 10 # number of folds in cross-validation

# proportion of negative instances compared to positives
for n_proportion in [ 1, 2, 5, 20 ]: 
    print n_proportion
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None)

1
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 83.0 (+/-0.3): [83.1, 83.3, 83.2, 82.7, 83.1, 83.2, 83.4, 83.1, 82.4, 82.7]
2
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 84.1 (+/-0.2): [84.5, 84.1, 84.1, 83.7, 84.0, 84.3, 84.3, 83.9, 84.3, 84.2]
5
536

## Evaluating the effect of number of folds in cross validation

In [4]:
n_proportion = 2 # proportion of negative instances compared to positives

# number of folds in cross-validation
for n_fold in [ 2, 5, 10, 20 ]: 
    print n_fold
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    # Use verbose argument for per cross-validation metrics
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None, verbose = True)


2
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
Fold: 1 # train: 3343 # test: 3344 AUC: 85.5 AUPRC: 84.8
Fold: 2 # train: 3344 # test: 3343 AUC: 85.0 AUPRC: 85.2
Fold: 1 # train: 3343 # test: 3344 AUC: 85.1 AUPRC: 84.9
Fold: 2 # train: 3344 # test: 3343 AUC: 85.8 AUPRC: 85.7
Fold: 1 # train: 3343 # test: 3344 AUC: 85.5 AUPRC: 85.2
Fold: 2 # train: 3344 # test: 3343 AUC: 85.3 AUPRC: 85.7
Fold: 1 # train: 3343 # test: 3344 AUC: 86.2 AUPRC: 85.8
Fold: 2 # train: 3344 # test: 3343 AUC: 84.7 AUPRC: 84.7
Fold: 1 # train: 3343 # test: 3344 AUC: 84.8 AUPRC: 85.0
Fold: 2 # train: 3344 # test: 3343 AUC: 86.3 AUPRC: 86.0
Fold: 1 # train: 3343 # test: 3344 AUC: 84.7 AUPRC: 84.9
Fold: 2 # train: 3344 # test: 3343 AUC: 86.7 AUPRC: 87.0
Fold: 1 # train: 3343 # test: 3344 AUC: 85.7 AUPRC: 85.5
Fold: 2 # train: 3344 # test: 3343 AUC: 85.3 AUPRC: 85.4
Fold: 1 # train: 3343 # test: 3344 AUC: 85.9 AUPRC: 85.7
Fold: 2 # train: 3344 # test: 3343 AUC: 84.8 AUPRC: 84.7
Fold: 1 # train: 3343 #

## Putting it together

## Evaluating similarity-based drug repurposing via cross validation


In [5]:
n_proportion = 2 # proportion of negative instances compared to positives
n_fold = 10 # number of folds in cross-validation

# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = False 

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None)

536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 84.1 (+/-0.3): [84.2, 83.6, 83.9, 84.1, 83.7, 84.7, 84.1, 83.7, 84.2, 84.5]


('AUC: 84.1', 'AUPRC: 83.5')

## Revisiting cross-validation using disjoint folds

In [6]:
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = True

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None)

536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 64.3 (+/-1.0): [64.2, 63.7, 65.2, 63.2, 64.2, 66.1, 64.2, 63.2, 63.2, 65.7]


('AUC: 64.3', 'AUPRC: 61.8')