# Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Over the past years many methods for similarity-based (a.k.a. knowledge-based,
guilt-by-association-based) drug repurposing, yet most of these studies do not
provide the code or the model used in the study. To improve reproducibility,
we present a Python-platform offering
- drug feature data parsing and similarity calculation
- data balancing
- (disjoint) cross validation
- classifier building

Using this platform we investigate the effect using unseen data in the test
set in similarity-based classification.


## Requiremets
The Python platform has the following dependencies:

- [Numpy](http://www.numpy.org)
- [Scikit-learn](http://scikit-learn.org)


## Data sets
The data sets used in the analysis are freely available
[online](http://astro.temple.edu/~tua87106/drugreposition.html)

We have modified these data sets slightly for parsing in Python by
- converting all drug, disease and side effect terms to lowercase
- removing the quotations and making the text tab delimited
- we also added the 'Drug' text to the header

These modified files are available under _'data/'_ folder.

## Setting up the platform

In [1]:
import random
from src import ml, utilities

# Get parameters
n_seed = 52345
random.seed(n_seed) # for reproducibility
features = ["chemical", "target", "phenotype"]
model_type = "logistic" # ML model
prediction_type = "disease" # predict drug-disease associations
output_file = "data/validation.dat" # file containing run parameters and corresponding AUC values
n_proportion = 2 # proportion of negative instances compared to positives
n_subset = -1 # for faster results - subsampling data
knn = 20 # number of nearest drugs to check in the pharmacological space to assing a repurposing score
n_run = 10 # number of repetitions of cross-validation analysis
n_fold = 10 # number of folds in cross-validation
recalculate_similarity = True # whether the k-NN based repurposing score should be calculated within the training/test set
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = False 

# Get data
drug_disease_file = "data/drug_disease.dat"
drug_side_effect_file = "data/drug_sider.dat"
drug_structure_file = "data/drug_structure.dat"
drug_target_file = "data/drug_protein.dat"
data = utilities.get_data(drug_disease_file, drug_side_effect_file, drug_structure_file, drug_target_file)

## Evaluating similarity-based drug repurposing via (traditional) cross validation

In [2]:
n_proportion = 2 # proportion of negative instances compared to positives
n_fold = 10 # number of folds in cross-validation

# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = False 

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None, n_seed = n_seed)

536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 84.1 (+/-0.3): [84.5, 84.4, 84.1, 84.1, 83.8, 84.2, 83.8, 84.3, 83.4, 84.1]


('AUC: 84.1', 'AUPRC: 83.7')

## Revisiting cross-validation using disjoint folds

In [3]:
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = True

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None, n_seed = n_seed)

536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 65.6 (+/-0.5): [65.5, 65.5, 65.0, 66.1, 65.0, 65.5, 66.1, 65.7, 65.0, 66.4]


('AUC: 65.6', 'AUPRC: 62.8')

## Evaluating the effect of data imbalance

In [4]:
# proportion of negative instances compared to positives
for n_proportion in [ 1, 2, 5, 20 ]: 
    print n_proportion
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None, n_seed = n_seed)

1
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 65.8 (+/-0.5): [65.6, 65.6, 65.5, 65.8, 65.5, 65.6, 65.8, 65.4, 65.5, 67.2]
2
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 65.6 (+/-0.5): [65.5, 65.5, 65.0, 66.1, 65.0, 65.5, 66.1, 65.7, 65.0, 66.4]
5
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 67.8 (+/-0.7): [67.8, 67.8, 67.0, 69.0, 67.0, 67.8, 69.0, 67.7, 67.0, 68.2]
20
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 68.9 (+/-0.5): [68.6, 68.6, 68.5, 69.8, 68.5, 68.6, 69.8, 68.6, 68.5, 69.1]


## Evaluating the effect of number of folds in cross validation

In [5]:
n_proportion = 2 # proportion of negative instances compared to positives

# number of folds in cross-validation
for n_fold in [ 2, 5, 10, 20 ]: 
    print n_fold
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    # Use verbose argument for per cross-validation metrics
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None, verbose = False, n_seed = n_seed)


2
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 80.7 (+/-0.3): [80.6, 80.6, 80.6, 80.6, 80.6, 80.6, 80.6, 81.4, 80.6, 81.4]
5
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 73.6 (+/-0.7): [72.6, 72.6, 74.0, 74.1, 74.0, 72.6, 74.1, 74.1, 74.0, 74.0]
10
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 65.6 (+/-0.5): [65.5, 65.5, 65.0, 66.1, 65.0, 65.5, 66.1, 65.7, 65.0, 66.4]
20
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 59.1 (+/-0.6): [58.5, 58.5, 59.1, 60.1, 59.0, 58.5, 60.1, 59.1, 59.1, 59.6]


## Evaluating the effect of each feature in prediction performance

In [6]:
n_fold = 10 # number of folds in cross-validation
features = ["chemical", "target", "phenotype"]

for feature in features:
    features_modified = [ feature ]
    print feature
    # Check prediction accuracy of ML classifier on the data set using the parameters above
    ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features_modified, recalculate_similarity, disjoint_cv, output_file, model_fun = None, n_seed = n_seed)

chemical
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 62.0 (+/-0.6): [61.4, 61.4, 61.7, 62.6, 61.7, 61.4, 62.6, 62.5, 61.7, 63.2]
target
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 60.7 (+/-0.9): [61.0, 61.0, 61.7, 59.7, 61.7, 61.0, 59.7, 59.3, 61.7, 60.3]
phenotype
536 drugs, 578 diseases, 309808 pairs, 2229 known associations
AUC over runs: 64.7 (+/-0.4): [64.6, 64.6, 64.4, 65.4, 64.4, 64.6, 65.4, 64.4, 64.4, 65.4]
