# 1. Neural networks - MLP and DNN

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function by training on a dataset. It is a universal approximator of any function. General work principles rely on feedforward signal propagation to predict and backward error propagation to adjust weights and adapt to problems/data.

![Perceptron.png](attachment:Perceptron.png)

Source: https://pl.wikipedia.org/wiki/Perceptron#/media/Plik:Perceptron_moj.png

![ANN2.png](attachment:ANN2.png)

Source: https://en.wikipedia.org/wiki/Neural_network

A deep neural network (DNN) is an artificial neural network (ANN) with more complex architecture within a number of layers. The DNN during a learning process adapts to data and gathered knowledge is stored in weights conecting artificial neurons. Development and learning of a very complex neural network require a specific approach called deep learning which is related to technics preventing overfitting as well as reduce the time required to learn the model. DNN could have some specilized layers introduced like filters responsible for data convolution (Convolutional Neural Networks).


# 2. Problem - predict if given chemical compound is agonist of estrogen nuclear receptor alpha

The 2014 Tox21 data challenge was designed to develop models capable to predict the affinity of chemical compounds to specific receptors. General idea was to develop a tools with possible applications in searching for new active compounds and perform screening in search of new potential drugs. We will focus on the assessment of compound affinity to estrogen alpha receptors. 

qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway using the BG1 cell line

Tox21 challange: https://tripod.nih.gov/tox21/challenge/index.jsp

Dataset we will explore: https://pubchem.ncbi.nlm.nih.gov/bioassay/743079

![data.png](attachment:data.png)

In [55]:
##Import all required packages
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn import linear_model
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.regularizers import *
from keras.wrappers.scikit_learn import *
#Load the data
df_raw = pd.read_csv("estrogen_nuclear_receptor.txt", sep="\t")

#Function to calculate fingerprints with specified length of binary outcome
def morgan_fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=10000)
    npfp=np.array(list(fp.ToBitString())).astype('float').T
    return npfp

In [56]:
#Calculate fingerprints
data=np.array([np.arange(df_raw.shape[0])]*10000).T
for i, row in df_raw.iterrows():
    data[i]=morgan_fp(df_raw["SMILES"][i])

In [57]:
data.shape

(8472, 10000)

In [58]:
#Create input and output vectors of variables
X=data
y=pd.DataFrame(df_raw['Agonist'])

In [None]:
#Define model architecture and perform 10-cv calculations
def DNN_model():
    model = Sequential()
    model.add(Dense(100, input_dim=X_train.shape[1], kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(100, kernel_initializer='normal', activation='relu',    kernel_regularizer=l1_l2(l1=0.0, l2=0.5)))
    model.add(Dropout(0.3))
    model.add(Dense(100, kernel_initializer='normal', activation='relu', kernel_regularizer=l1_l2(l1=0.0, l2=0.5)))
    model.add(Dropout(0.3))
    model.add(Dense(100, kernel_initializer='normal', activation='relu', kernel_regularizer=l1_l2(l1=0.0, l2=0.5)))
    model.add(Dropout(0.3))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])    
    return model

dnn = KerasClassifier(build_fn=DNN_model, epochs=10, batch_size=1000, verbose=1)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=123)
results = cross_validate(dnn, 
                        X_train, 
                        y_train, 
                        cv=kfold, 
                        scoring=['accuracy', 'balanced_accuracy', 'roc_auc', 'f1'], 
                        return_train_score=True, 
                        return_estimator=True)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10


In [64]:
results['test_roc_auc']

{'fit_time': array([118.74217892, 111.05812097, 118.52625012, 127.69321537,
        117.79111862, 120.28922796, 122.0893321 , 126.15829253,
        125.78135157, 133.53012562]),
 'score_time': array([0.46792293, 0.37630415, 2.74190092, 0.41623378, 0.35397983,
        2.65362167, 2.59019184, 1.95253706, 2.77430415, 3.10672784]),
 'estimator': (<keras.wrappers.scikit_learn.KerasClassifier at 0x7f89850fe5c0>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89bc4e8208>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89c4557550>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89bda88860>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89880a1198>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89837fc208>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f8981ecd0b8>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f89804fc4e0>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f897eb04a20>,
  <keras.wrappers.scikit_learn.KerasClassifier at 0x7f

In [229]:
#Compare with a Linear model
fitscores = []
alphas = np.logspace(-2, 4, num=10)
for alpha in alphas:
    estimator = linear_model.LogisticRegression(C = 1/alpha)
      

    results = cross_validate(estimator, 
                                         X_train, 
                                         y_train, 
                                         cv=kfold, 
                                         scoring=['accuracy', 'roc_auc', 'f1'], 
                                         return_train_score=True, 
                                         return_estimator=True)
    
    fitscores.append(results)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation f

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation f

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation f

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation f

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [230]:
fitscores

[{'fit_time': array([7.48042917, 6.38767385, 6.08726859, 6.08746243, 6.3068459 ,
         6.13200331, 6.33558989, 6.00037503, 6.14705253, 5.99856544]),
  'score_time': array([0.09256196, 0.03329754, 0.03242826, 0.03232718, 0.0276804 ,
         0.03450823, 0.03423858, 0.0318563 , 0.03415322, 0.03790474]),
  'estimator': (LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l2',
                      random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=False),
   LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l2',
                      random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=F

Questions and suggestions are highly welcome! adam.paclawski@uj.edu.pl