<a href="https://colab.research.google.com/github/dbetm/DeepLearningLifeSciences/blob/main/04_Molecules/Predict_solubility_using_GraphConvModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# SETUP DEEPCHEM
!pip install --pre deepchem
import deepchem
deepchem.__version__

In [None]:
import deepchem as dc

## Loading dataset

The dataset contains info. about solubility, which is a measure of how easily a molecules dissolves in water.

If it does not dissolve easily, getting enough of it into a patient’s bloodstream to have a therapeutic effect may be impossible.

[Dataset delaney](https://github.com/deepchem/deepchem/blob/master/datasets/delaney-processed.csv)

In [None]:
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

## Building the model

GraphConvModel works like an ordinary CNNs, but instead using vectors of numbers for each pixel, in this case a graph convolutional network (GCN) begins with a vector of numbers for each node and/or edge. When the graph represents a molecule, those numbers could be high-level chemical properties of each atom, such as its element, charge, and hybridization state.

GCNs have one important limitation: The calculation is based solely on the molecular graph. They receive no information about the molecule’s conformation, so they cannot hope to predict anything that is conformation-dependent. This makes them most suitable for small, mostly rigid molecules.

In [None]:
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

## TRAINING AND MEASURING PERFORMANCE



In [None]:
model.fit(train_dataset, nb_epoch=110)

### Metrics

Pearson correlation coefficient (`r`): Measures the strength and direction of a linear relationship. For example, a perfectly linear relationship with a positive/negative trend has `r=1`/`r=-1`, while no linear relationship has `r=0`.

`r^2`: The proportion of the variation in `y` is SSR (Sum of Squared Residuals) / TSS (Total Sum of Squares). 
`r^2 = 1-(SSR/TSS)`, so is the coefficient of determination to measure how close data are to fitted regression line. Closest to 1 is fittest. Unless `r^2=1`, there will always be some amount of variation in `y` that's unexplained by `x`.

In [None]:
# Metrics
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print(model.evaluate(train_dataset, [metric], transformers))
print(model.evaluate(test_dataset, [metric], transformers))

{'pearson_r2_score': 0.9112915522059409}
{'pearson_r2_score': 0.6738927468941923}


## TESTING

In [None]:
# molecules for testing, using SMILES (Simplified Molecular Input Line Entry System) format.
smiles = [
  'COC(C)(C)CCCC(C)CC=CC(C)=CC(=O)OC(C)C',
  'CCOC(=O)CC',
  'CSc1nc(NC(C)C)nc(NC(C)C)n1',
  'CC(C#C)N(C)C(=O)Nc1ccc(Cl)cc1',
  'Cc1cc2ccccc2cc1C'
]

We need to use RDKIT to parse the SMILES strings, then use a DeepChem featurizer to convert them to the format expected by the graph convolution.


In [None]:
from rdkit import Chem

mols = [Chem.MolFromSmiles(s) for s in smiles]
featurizer = dc.feat.ConvMolFeaturizer()
x = featurizer.featurize(mols)

In [None]:
# Prediction
predicted_solubility = model.predict_on_batch(x)
print(predicted_solubility)

[[-0.28259313]
 [ 1.3402852 ]
 [ 0.26678705]
 [-0.4907226 ]
 [-0.3866803 ]]


NOTE: The model is poor to predict on unseen data, we could improve the performance making tuning.