# Predict activity
## A notebook for predicting the activity of a set of compounds
The top 3 models (KNN, Random Forest and XGBoost) are used here for classifying a new set of compounds, and show to be relatively good classifiers. The confusion matrices show that KNeighborsClassifier had the highest true postive rate and the highest AUC, which are both important factors in drug discovery. However, none of the three classifiers correlated with the experimental activity values (fluorescence) according to the Spearman correlation values, indicating they had low early recognition (the classifiers couldn't rank the compounds according to their likelihood of being more active than others). 
## Instructions
1. Define the name of the file that contains the compounds. Each molecule must...
    1. ...be identified by "CID";
    2. ...have an activity label (either active or inactive);
    3. ...have values for the molecular descriptors defined in [Classifiers and molecular descriptors](#classifiers).
    
## How it works
The models have already been trained and serialized (saved) to a pickle file. They are loaded and receive as input the activity data and values of molecular descriptors for a set of compounds. Finally, the models predict the activity of these compounds based on the data they have been previously trained on. 

## Table of contents
1. [Read data](#read)
2. [Classifiers and molecular descriptors](#classifiers)
3. [Confusion matrices](#confusion)
4. [ROC curves](#roc)    
5. [Class distribution](#class)    
6. [Spearman correlations](#spearman)    
7. [Error metrics](#error)    

<a id='read'></a>
## Read data

In [None]:
# File containing the compounds and their molecular descriptors
# Define the name of the input file here
compounds = None
# Define the name of the activity label here
activity_label = None

In [None]:
import pandas as pd

test_data = pd.read_csv(compounds)
test_data.head()

<a id='classifiers'></a>
## Classifiers and molecular descriptors

In [None]:
descriptors_dict = {}
# Top 3 models
descriptors_dict['RandomForestClassifier'] = ['NumRotatableBonds', 'NumHDonors', 'TPSA', 'LabuteASA']
descriptors_dict['KNeighborsClassifier'] = ['NumHAcceptors', 'TPSA', 'LabuteASA']
descriptors_dict['XGBClassifier'] = ['NumRotatableBonds', 'TPSA']

In [None]:
def load_pickle(model_name):
    import pickle
    file = open(f'../pickle/{model_name}.pickle', 'rb')
    model_fitted = pickle.load(file)
    file.close()
    return model_fitted

<a id='confusion'></a>
## Confusion matrices

In [None]:
def print_confusion_matrix(y_test, y_pred):
    from sklearn.metrics import confusion_matrix  
    conf_matrix = confusion_matrix(y_test, y_pred, normalize='true')
    matrix = pd.DataFrame(conf_matrix)
    print(matrix.round(2))

In [None]:
y_test = test_data[activity_label]
Y = pd.DataFrame(test_data['CID'])
probas = pd.DataFrame(test_data['CID'])

for model_name in descriptors_dict.keys():
    model_fitted = load_pickle(model_name)
    subset = descriptors_dict[model_name]
    X_test = test_data[subset]
    y_pred = model_fitted.predict(X_test)
    Y[model_name] = y_pred
    
    print('\n', model_name)
    print_confusion_matrix(y_test, y_pred)
    
    y_proba = model_fitted.predict_proba(X_test)
    probas[model_name] = y_proba[:,1]

<a id='roc'></a>
## ROC curves

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import roc_curve, auc

for model_name in descriptors_dict.keys():
    y_proba = probas[model_name]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{model_name}: {auc(fpr, tpr):>.3f}')
    
plt.plot([0,1], [0,1], linestyle='--')
plt.legend(title='Area Under the Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.show()

<a id='class'></a>
## Class distribution

In [None]:
def percentage_dist(values):
    values = pd.Series(values, name='values')
    distribution = values.value_counts(
        normalize=True).mul(100).reset_index().rename({'values':'Percentage','index':'Class'}, axis=1)
    return distribution

In [None]:
perc = percentage_dist(test_data[activity_label])
perc['Name'] = 'Experimental data'

for model_name in descriptors_dict.keys():
    y_pred = Y[model_name]
    dist = percentage_dist(y_pred)
    dist['Name'] = model_name
    perc = perc.append(dist)

sns.catplot(kind='bar', data=perc, x='Class', y='Percentage', hue='Name')
plt.title('Number of active/inactive compounds according to the experimental data and number of classified active/inactive compounds', y=1.05)
plt.show()

<a id='spearman'></a>
## Spearman correlations

In [None]:
from scipy.stats import spearmanr

df = pd.merge(activity[['CID', 'f_inhibition_at_50_uM']], probas, on=['CID'])
print('Spearman R')
for model_name in descriptors_dict.keys():
    print(f'{model_name}: {spearmanr(df[model_name], df["f_inhibition_at_50_uM"])[1]:.4f}')

<a id='error'></a>
## Error metrics

In [None]:
def error_metrics(y_test, y_proba):
    import numpy as np
    from sklearn.metrics import log_loss
    
    rmse = np.linalg.norm(y_proba - y_test) / np.sqrt(len(y_test))
    logl = log_loss(y_test, y_proba)
    return rmse, logl
    print('{:.4f}'.format())

In [None]:
print('{:20s}\t{:5s}\t{:5s}'.format('Model', 'RMSE', 'log_loss'))
for model_name in descriptors_dict.keys():
    rmse, logl = error_metrics(y_test, probas[model_name])
    print(f'{model_name:20s}\t{rmse:.4f}\t{logl:.4f}')