# Understanding molecular bases of drug response and drug synergy
Understanding synergistic effects of drugs is key to develop effective intervention strategies targeting diseases (such as AD or T2D or both) and provides unprecedented opportunities to repurpose existing drugs. The AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge provides a rich data source aiming to understand the synergistic drug behavior based on pretreatment data and spans cell viability data over 118 drugs and 85 cancer cell lines (primarily colon, lung, and breast). In collaborating with Dr. Baldo Oliva's group at GRIB, UPF-IMIM, we have been working on identifying the effects of confounding factors in the data set such as dosage and genetic background of the cell lines and developing algorithms that can predict the individual and synergistic effects of drugs. 

The challenge has two subtasks: predicting drug synergy *(i)* using mono synergy data *(ii)* without using mono synergy data. The participants are free to use any other data source (such as cell line gene expression, gene mutation, drug target information provided  in the challenge or external data sets) and submit their predictions in 3 rounds. 

<For the first round of the challenge, we have build machine learning models to predict the synergy of drugs for both of these tasks and choice the best performing models to submit predictions. Among various machine learning models, we found a combination of bootstrapped and ensemble tree-based predictors achieved best performance on the training data set for. 

To improve the prediction performance we have incorporated mutation data (of drug targets in a given cell line) and interactome based contribution of the drug combination compared to the effect of drugs separately. To assess interactome based contribution of a drug or combination (characterized by a set of targets), we have used GUILD, a network-based functional prioritization tool. 

Interestingly, using GUILD, only the predictions for subtask *(ii)* improved but not for subtask *(i)*. We suspect this is due to the mono therapy response data describing the synergy best and addition of new features (such as the ones based on expression, mutation, interactome) potentially causing the predictor to overfit to the training data set.>

## Data overview
<2199 samples>

In [34]:
data.dir = "../data/"
set.seed(142341)
# Data overview
file.name = "Drug_synergy_data/ch1_train_combination_and_monoTherapy.csv"
f = read.csv(paste0(data.dir, file.name))
nrow(f)
summary(f)

      CELL_LINE          COMPOUND_A       COMPOUND_B     MAX_CONC_A    
 CAMA-1    :  64   AKT        : 243   PIK3C    : 227   Min.   : 0.003  
 MDA-MB-468:  59   IAP        : 152   MTOR_1   : 220   1st Qu.: 1.000  
 HCC1187   :  57   AKT_1      : 151   MAP2K_1  : 198   Median : 1.000  
 HCC1395   :  56   BCL2_BCL2L1: 121   FGFR     : 156   Mean   : 1.721  
 HCC70     :  54   MAP2K_1    : 121   IAP      :  89   3rd Qu.: 3.000  
 MDA-MB-453:  52   ATR_4      : 107   CSNK2A1_2:  75   Max.   :10.000  
 (Other)   :1857   (Other)    :1304   (Other)  :1234                   
   MAX_CONC_B         IC50_A               H_A              Einf_A      
 Min.   : 0.003   Min.   : 0.000003   Min.   : 0.0000   Min.   :  0.00  
 1st Qu.: 1.000   1st Qu.: 0.072514   1st Qu.: 0.4777   1st Qu.: 11.82  
 Median : 1.000   Median : 0.344153   Median : 1.3273   Median : 42.90  
 Mean   : 4.956   Mean   : 0.731751   Mean   : 3.1060   Mean   : 46.72  
 3rd Qu.: 3.000   3rd Qu.: 1.000000   3rd Qu.: 3.6625   3rd

## Data filtering

- Filter samples with low quality (404 samples)

- Filter samples with low sensitivity, that is, $min((Einf_A + Einf_B) / 2) > 20$ (8 cell lines, 142 samples)

In [35]:
require(plyr)
f$cat = f[,"SYNERGY_SCORE"]
print(sprintf("Number of samples with QA < 1: %d", nrow(f[f$QA!=1,])))
f = f[f$QA==1,]
f$einf = (f$Einf_A + f$Einf_B) / 2
a = ddply(f, ~ CELL_LINE, summarize, syn.sd = sd(cat), syn.min = min(cat), syn.med = median(cat), einf.sd = sd(einf), einf.min = min(einf), einf.med = median(einf))
cutoff = 20 # 20 (8 cell lines) # 10 (22 cell lines)
cell.lines = as.vector(a[a$einf.min>cutoff,"CELL_LINE"])
print(sprintf("Insensitive cell lines: %s", paste(cell.lines, collapse=",")))
nrow(f[f$CELL_LINE %in% cell.lines,])
f = f[!f$CELL_LINE %in% cell.lines,]
# Cells with high min Einf has lower synergy
b = cor.test(a$syn.med, a$einf.min, use="complete")
print(sprintf("Correlation between einf.min and syn.med: %f %f", b$estimate[[1]], b$p.value)) # -0.235
require(caret)
# Find correlated features
cor.mat = cor(f[,4:11])
cor.idx = findCorrelation(cor.mat, cutoff = .75)
print(c("Correlated features:", colnames(f)[cor.idx]))


[1] "Number of samples with QA < 1: 404"
[1] "Insensitive cell lines: 22RV1,CAL-120,HCC1143,HCC1428,HCC1937,KU-19-19,UACC-812,VCaP"


[1] "Correlation between einf.min and syn.med: -0.235249 0.031231"
[1] "Correlated features:"


## Feature definition

- Monotherapy response based: 
    * max concentration
    * viability at max kill
    * IC50 
    * slope of the fit to the dose response curve
- Gene expression based
    * target level
    * pathway level
- Mutation information based
    * target level
    * pathway level
- GUILD prioritization based
- Drug similarity based
    * chemical formula
    * common targets