<a href="https://colab.research.google.com/github/alejogiley/Novartis-Hackaton-7/blob/master/Notebooks/Lee_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting affinities of antibiotic candidates to a DNA Gyrase

In [0]:
#!wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
#!chmod +x Anaconda3-5.1.0-Linux-x86_64.sh
#!bash ./Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p /usr/local

#!conda install -q -y --prefix /usr/local -c omnia --no-update-deps pdbfixer=1.4
#!conda install -q -y --prefix /usr/local -c conda-forge --no-update-deps xgboost=0.6a2
#!conda install -q -y --prefix /usr/local -c rdkit --no-update-deps rdkit=2017.09.1
#!conda install -q -y --prefix /usr/local -c deepchem --no-update-deps  deepchem-gpu=2.1.0

#import sys
#sys.path.append('/usr/local/lib/python3.6/site-packages/')

In [2]:
import os

import numpy  as np      # scientific computing: arrays
import scipy  as sp      # scientific computing: statistics
import pandas as pd      # data analysis tools

# Machine learning
import tensorflow as tf
import keras.backend as K

# Neural Network
from keras.models import Sequential
from keras.layers import Dropout, BatchNormalization
from keras.layers import Dense, Activation

# Data processing
from sklearn import preprocessing
from sklearn.model_selection import train_test_split 

# Visualization
import matplotlib.pyplot as plt

# Set random seed
np.random.seed(0)

Using TensorFlow backend.


### Neural Network

In [0]:
# GPU info
# !nvidia-smi

In [0]:
filepath = "https://raw.githubusercontent.com/alejogiley/Novartis-Hackaton-7/master/Data/Gyrase/AZ_Pyrrolamides_features.csv"
datasets = pd.read_csv(filepath)

Select __x__ and __y__ vectors.

In [0]:
# input and output
y = datasets['pIC50'].copy()
x = datasets.iloc[:,5:].copy()

# qualifiers classification
s1 = datasets['left' ].apply(lambda x: x*1).copy()
s2 = datasets['right'].apply(lambda x: x*1).copy()
s0 = s1 + s2

# > greater
gcutoff = datasets.loc[ datasets['qualifiers']]['pIC50'].copy()

# < lower
lcutoff = datasets.loc[~datasets['qualifiers']]['pIC50'].copy()

In [31]:
l.shape

(671,)

Split the machine-learning-ready dataset into __training__, __test__ and __validation__ subsets.

In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

The performance of common machine-learning algorithms can be very sensitive to preprocessing of the data, neural networks mostly. Here we will normalize the features and $\log{\text{IC50}}$ to have zero-mean and unit-standard-deviation.

In [0]:
# input dimensions
number_of_features = x.shape[0]

# Start neural network
network = Sequential()

# Add fully connected layer with a ReLU activation function
network.add(Dense(units=1000, use_bias=False, input_shape=(number_of_features, )))
network.add(BatchNormalization())
network.add(Activation("relu"))
network.add(Dropout(0.25))

# Add fully connected layer with a ReLU activation function
network.add(Dense(units=2000, use_bias=False, input_shape=(number_of_features, )))
network.add(BatchNormalization())
network.add(Activation("relu"))
network.add(Dropout(0.25))

# Add fully connected layer with a ReLU activation function
network.add(Dense(units=500, activation='relu'))

# Add fully connected layer with a sigmoid activation function
network.add(Dense(units=1, activation='softmax'))

The customized Loss function.

In [0]:
def custom_loss(y_true, y_pred):
    # get deltas
    z = y_pred - y_true
    g = y_pred - gcutoff
    l = y_pred - lcutoff
    # qualifiers adjusted Loss function
    return K.mean((1-s0)*K.square(z) + s1*K.relu(-g) + s2*K.relu(l), axis=-1)

Results
-----------------

Plot

In [0]:
params_dict = {"learning_rate": np.power(10., np.random.uniform(-5, -3, size=1)),
               "decay": np.power(10, np.random.uniform(-6, -4, size=1)),
               "nb_epoch": [1500] }

n_features = x_train.get_data_shape()[0]

def model_builder(model_params, model_dir):
    model = dc.models.MultitaskRegressor(
        1, n_features, layer_sizes=[2000,1000,1000,500], 
        dropouts=[.25], batch_size=50, **model_params)
    return model# Print best score | Plot 'loss vs epoch'

In [0]:
# Correlation plot

predicted_test = best_dnn.predict(valid)
true_test = valid.y
plt.scatter(predicted_test, true_test)
plt.xlabel('Predicted Log IC50')
plt.ylabel('True Log IC50')
plt.title(r'DNN - predicted vs. true log-IC50s')
plt.plot([-7, 2], [-7, 2], color='k')
plt.xlim([-7, 2])
plt.ylim([-7, 2])
plt.show()