# Structured Output Prediction of Anti-Cancer Drug Activity

Anas Atmani, Benoît Choffin, Domitille Coulomb, Paul Roujansky

### Data

**Input:**

- We consider 2305 distinct molecules. Each of them has several physico-chemical and geometric properties that enables to build similarities between all molecules through a kernel. We end up with the (2305x2305) Gram matrix of the Tanimoto kernel.

**Ouput:**

- We have a total of 59 cancer cell lines for which we would like to predict the effect of each molecule (active/inactive). This last information is provided in a (2305x59) "target" matrix.

- We also have external RNA-based data for each cancer cell line. By computing the (59x59) correlation matrix based on these features, we build a similarity graph between all cancer cell lines through a *maximum weight spanning tree* (MWST). As a quick note, the graph should not necesarrily be fully-connected which should considerably reduce computation time.

### Modelling

Two approaches:

- take into account the similarities between the cancer cell lines and make use of this "structure" (goal of the article) through a MMCRF algorithm.
- perform prediction indepently for each cancer cell line, through a standard classification algortihm such as SVM.

In [3]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [4]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.sparse.csgraph import minimum_spanning_tree
import time
import seaborn as sns

## Data import

In [10]:
file_names = ['ncicancer_input_kernel.txt',
            'ncicancer_bin_targets.txt',
            'ncicancer_targets.txt',
            'ncicancer_cancerCL_corr.txt']

data = []

' We import each dataset and append it to the list "data" '
for file in file_names:
    try:
        data.append(np.loadtxt('data_clean/'+file))
        print('%s loaded.' %file)
    except:
        print('Error: %s not loaded.' %file)

ncicancer_input_kernel.txt loaded.
ncicancer_bin_targets.txt loaded.
ncicancer_targets.txt loaded.
ncicancer_cancerCL_corr.txt loaded.


In [372]:
' We define the variables '
X_gram = data[0]
Y_class = data[1]
Y_reg = data[2]
cancer_correls = data[3]

In [373]:
' We check the shape of each variable '
X_gram.shape, Y_class.shape, Y_reg.shape, cancer_correls.shape

((2305, 2305), (2305, 59), (2305, 59), (59, 59))

## First approach: SVM

In [374]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [375]:
class partition():
    
    def __init__(self, n_splits=3, shuffle=False):
        '''
        - "n_splits"  number of folds (at least 2)
        - "shuffle"   boolean which states whether to shuffle the data or not before splitting into batches
        '''
        self.n_splits = n_splits
        self.shuffle = shuffle
        
    def get_splits(self, n):
        '''
        Compute the partition (indices contained in each folds)
        '''
        self.n = n
        
        self.idx = np.arange(self.n)
        if self.shuffle:
            np.random.shuffle(self.idx)
        
        self.l = int(self.n/self.n_splits)
        
        self.partition = []
        self.partition = [self.idx[j*self.l:(j+1)*self.l] for j in range(self.n_splits-1)]
        self.partition.append(self.idx[(self.n_splits-1)*self.l:])
    
    def split(self, i, X, y):
        '''
        Performs k-fold with:
        - "i"   index of the split in the index partition
        - "X"   Gram (n*n) matrix
        - "y"   output (n*m) matrix (m outputs)
        '''
        
        idx_train = np.concatenate([self.partition[k] for k in range(self.n_splits) if k!=i])
        
        idx_test = self.partition[i]
        X_train = X[idx_train,:][:,idx_train]
        y_train = y[idx_train,:]
        X_test = X[idx_test,:][:,idx_train] # CAREFUL !
        y_test = y[idx_test,:]
        
        return X_train, y_train, X_test, y_test

In [387]:
class k_fold_CV():
    
    def __init__(self, C, n_splits, shuffle):
        '''
        - "C"         SVM hyperparameter
        - "n_splits"  number of folds of the k-fold cross validation (at least 2)
        - "shuffle"   boolean which states whether to shuffle the data or not before splitting into batches
        '''
        self.C = C
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.trained = False
        
    def fit_predict(self, X, y, verbose=True):
        
        self.n, self.m = y.shape
        ' we build "n_splits" folds '
        kFold = partition(n_splits=self.n_splits, shuffle=self.shuffle)
        kFold.get_splits(y.shape[0])
        shuffled_y = y[kFold.idx,:]
        
        model = SVC(C=100, kernel='precomputed')
        
        if verbose:
            print("Fold \t Computation time")
        
        for k in range(self.n_splits):
            ' we perform k-fold cross validation '
            startTime = time.time()
            
            ' we create the synthetic train and test datasets '
            X_train, y_train, X_test, y_test = kFold.split(k, X, y)
            
            Y_preds_j = []
            
            for j in range(y.shape[1]):
                ' we train j distinct models for each cancer cell line '
                model.fit(X_train, y_train[:,j])
                ' we stack the results iteratively '
                Y_preds_j.append(model.predict(X_test))
            
            ' We stack the results obtained for each fold '
            if k==0:
                Y_preds = np.array(Y_preds_j).T
            else:
                Y_preds = np.concatenate((Y_preds, np.array(Y_preds_j).T))
        
            runTime = time.time() - startTime
            if verbose:
                print("%d/%d \t %d" %(k+1, self.n_splits, runTime))
        
        ' we calculate the classification error '
        self.accuracies = np.array([accuracy_score(Y_preds[:,i], shuffled_y[:,i]) for i in range(Y_preds.shape[1])])
        self.trained = True
        
        
    def results(self):
        
        if self.trained:
            print("Results for %d folds on the full dataset:" %self.n_splits)
            print("Average = %.2f%%" %(np.mean(self.accuracies)*100))
            print("Standard deviation = %.2f%%" %(np.std(self.accuracies)*100))
        else:
            print("Not trained yet.")
        

In [388]:
model = k_fold_CV(C=100, n_splits=5, shuffle=True)

In [389]:
model.fit_predict(X_gram, Y_class)

Fold 	 Computation time
1/5 	 17
2/5 	 17
3/5 	 16
4/5 	 16
5/5 	 17


In [391]:
model.results()

Results for 5 folds on the full dataset:
Average = 76.63%
Standard deviation = 3.93%
