# Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Over the past years many methods for similarity-based (a.k.a. knowledge-based,
guilt-by-association-based) drug repurposing, yet most of these studies do not
provide the code or the model used in the study. To improve reproducibility,
we present a Python-platform offering
- drug feature data parsing and similarity calculation
- data balancing
- (disjoint) cross validation
- classifier building

Using this platform we investigate the effect using unseen data in the test
set in similarity-based classification.


## Requiremets
The Python platform has the following dependencies:

- [Numpy](http://www.numpy.org)
- [Scikit-learn](http://scikit-learn.org)
- [Toolbox](https://github.com/emreg00/toolbox)    


## Data sets
The data sets used in the analysis are freely available
[online](http://astro.temple.edu/~tua87106/drugreposition.html)

We have modified these data sets slightly for parsing in Python by
- converting all drug, disease and side effect terms to lowercase
- removing the quotations and making the text tab delimited
- we also added the 'Drug' text to the header

These modified files are available under _'data/'_ folder.

## Setting up the platform

In [None]:
import random
from src import ml, utilities
from toolbox import configuration

# Get parameters
#random.seed(52345) # for reproducibility
features = ["chemical", "target", "phenotype"]
model_type = "logistic" # ML model
prediction_type = "disease" # predict drug-disease associations
output_file = "data/validation.dat.test" # file containing run parameters and corresponding AUC values
n_proportion = 2 # proportion of negative instances compared to positives
n_subset = -1 # for faster results - subsampling data
knn = 20 # number of nearest drugs to check in the pharmacological space to assing a repurposing score
n_run = 10 # number of repetitions of cross-validation analysis
n_fold = 10 # number of folds in cross-validation
recalculate_similarity = True # whether the k-NN based repurposing score should be calculated within the training/test set

# Get data
drug_disease_file = "data/drug_disease.dat"
drug_side_effect_file = "data/drug_sider.dat"
drug_structure_file = "data/drug_structure.dat"
drug_target_file = "data/drug_protein.dat"
data = utilities.get_data(drug_disease_file, drug_side_effect_file, drug_structure_file, drug_target_file)

## Evaluating similarity-based drug repurposing via cross validation


In [None]:
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = False 

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None)

536 drugs, 578 diseases, 309808 pairs, 2229 known associations
Fold: 1 # train: 6018 # test: 669
Fold: 2 # train: 6018 # test: 669
Fold: 3 # train: 6018 # test: 669
Fold: 4 # train: 6018 # test: 669
Fold:

## Revisiting cross-validation using disjoint folds

In [None]:
# whether the drugs in the drug-disease pairs of the cross-validation folds should be non-overlapping disjoint_cv = False 
disjoint_cv = True

# Check prediction accuracy of ML classifier on the data set using the parameters above
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, output_file, model_fun = None)