OSM Series 4 Candidates with Deep Generative Models - Round 2

A new round of series 4 candidates for the Open Source Malaria Project, including molecules generated from low-data generative models (adapted from the ETH Modlab) and molecules generated in a second round using the Reinvent 2.0 generative model with improved activity predictors. See Open Source Malaria discussion #34. Here some candidates, sorted from less active 🔴 to more active 🔵, according to predictions:

We have tried to complement OSM issue #29 opened by Evariste Technologies. In that issue, a specific region of the chemical space is exploited to identify highly active compounds. Here, we provide a rougher exploration of the chemical space with the hope to identify alternative lead compounds:

Data

All 405,766 molecules generated (with duplicates eliminated) can be found here: data_0.csv
A selection of the best 556 candidates according to the pipeline below, rendered the following molecules: data_13.csv
A final list of the best 90 candidates based on activity can be found here: eosi_s4_candidates_90.csv
Explore the 90 candidates in this app!

Results columns

Candidate molecules are listed along with the following columns:

Identifiers

EosID: ID number from Ersilia Open Source Initiative
InchiKey
Smiles

Activity predictions

IC50Pred: Activity prediction based on multiple descriptors and classical ML models. The lower the better. It is probably biased towards high values, so hopefully it is a conservative estimate. Includes confidence interval (Upper Bound (UB) and Lower Bound (LB))
DeepActivity: Activity prediction based on deep learning models (Grover and ChemProp). The higher the better. It is a composite z-score between several deep learning scores (chemprop, grover; trained on classification and regression tasks). Includes confidence interval (Upper Bound (UB) and Lower Bound (LB))
Maip: Blood-stage antimalarial activity prediction using the MAIP tool

Applicability domain

WhalesDist3Act: WHALES descriptors distance to the top-3 actives in the training set. These descriptors are used for scaffold hopping
Similarity: Tanimoto similarity to known series 4 compounds

Accessibility:

SAScore: Synthetic accessibility. The lower the better.
RAScore: Retrosynthetic accessibility as predicted by the Reymond lab. The higher the better.
SybaScore: Fragment-based accessibility score by Voršilák et al. The higher the better.

Physicochemical properties

SLogP: Solubility. Should be < 5.
QED: Drug-likeness. The higher the better.
NumRings: Number of rings in the molecule.
FractionCSP3: Number of tertiary carbons.
FrHalogen: Number of halogen groups.
HeavyAtom: Heavy atom count.
Rotatable: Number of rotatable bonds.
Heteroatoms: Number of Heteroatoms.
FrAlkylHalide: Fragments containing Halides.

Chemotype

TriazoloHeteroaryl: Contains an heteroaryl ring in the RHS.
TriazoloPhenyl: Contains a phenyl (no heteroatoms) in the RHS.
TriazoloHeteroaryl - Para / - Meta / - Orto: Contain substituents in para, meta or orto positions

Molecule generation steps

A first batch of molecules were generated in May2021 using Reinvent 2.0. A detailed explanation as well as results analysis of this first round can be found in our GitHub repo ersilia-os/osm-series4-candidates. We generated 116,728 new series 4 candidates. Code.
A second batch of molecules (209,310) has been generated for the purposes of this analysis using Reinvent 2.0 in exploration mode and optimizing for activity based on a simple QSAR model build with RDKIT descriptors. Code.
A third batch of molecules (150,365) has been generated using a low-data generative model taking as pre-training populations the ChEMBL and a large fragments library. Code.

Selection of best candidates

All unique final molecules (405,766) have undergone a recursive selection process based on physicochemical properties, synthetic accessibility and predicted activity as follows:

Scripts for the filterings applied can be found in the scripts folder in this repository.

Run pipeline

For transparency and reproducibility, we provide code to run the full pipeline for candidate selection. Please download and uncompress the following folders and files:

chemprop
grover
predictorapp
syba.pkl (save it in utils/syba.pkl)
ra_model.onnx (save it in utils/ra_model.onnx)

The notebook with the process to select the best 90 candidates can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.streamlit		.streamlit
data		data
highpredictor		highpredictor
images		images
maip		maip
notebooks		notebooks
predictor		predictor
scripts		scripts
utils		utils
whales		whales
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSM Series 4 Candidates with Deep Generative Models - Round 2

Data

Results columns

Identifiers

Activity predictions

Applicability domain

Accessibility:

Physicochemical properties

Chemotype

Molecule generation steps

Selection of best candidates

Run pipeline

About

Releases 2

Packages

Contributors 2

Languages

License

ersilia-os/osm-series4-candidates-2

Folders and files

Latest commit

History

Repository files navigation

OSM Series 4 Candidates with Deep Generative Models - Round 2

Data

Results columns

Identifiers

Activity predictions

Applicability domain

Accessibility:

Physicochemical properties

Chemotype

Molecule generation steps

Selection of best candidates

Run pipeline

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages