Skip to content

A new round of series 4 candidates, including new low-data generative models and improved predictors

License

Notifications You must be signed in to change notification settings

ersilia-os/osm-series4-candidates-2

Repository files navigation

DOI

OSM Series 4 Candidates with Deep Generative Models - Round 2

A new round of series 4 candidates for the Open Source Malaria Project, including molecules generated from low-data generative models (adapted from the ETH Modlab) and molecules generated in a second round using the Reinvent 2.0 generative model with improved activity predictors. See Open Source Malaria discussion #34. Here some candidates, sorted from less active 🔴 to more active 🔵, according to predictions:

animation_2

We have tried to complement OSM issue #29 opened by Evariste Technologies. In that issue, a specific region of the chemical space is exploited to identify highly active compounds. Here, we provide a rougher exploration of the chemical space with the hope to identify alternative lead compounds:

exploration_vs_exploitation

Data

  • All 405,766 molecules generated (with duplicates eliminated) can be found here: data_0.csv
  • A selection of the best 556 candidates according to the pipeline below, rendered the following molecules: data_13.csv
  • A final list of the best 90 candidates based on activity can be found here: eosi_s4_candidates_90.csv
  • Explore the 90 candidates in this app!

Results columns

Candidate molecules are listed along with the following columns:

Identifiers

  • EosID: ID number from Ersilia Open Source Initiative
  • InchiKey
  • Smiles

Activity predictions

  • IC50Pred: Activity prediction based on multiple descriptors and classical ML models. The lower the better. It is probably biased towards high values, so hopefully it is a conservative estimate. Includes confidence interval (Upper Bound (UB) and Lower Bound (LB))
  • DeepActivity: Activity prediction based on deep learning models (Grover and ChemProp). The higher the better. It is a composite z-score between several deep learning scores (chemprop, grover; trained on classification and regression tasks). Includes confidence interval (Upper Bound (UB) and Lower Bound (LB))
  • Maip: Blood-stage antimalarial activity prediction using the MAIP tool

Applicability domain

  • WhalesDist3Act: WHALES descriptors distance to the top-3 actives in the training set. These descriptors are used for scaffold hopping
  • Similarity: Tanimoto similarity to known series 4 compounds

Accessibility:

  • SAScore: Synthetic accessibility. The lower the better.
  • RAScore: Retrosynthetic accessibility as predicted by the Reymond lab. The higher the better.
  • SybaScore: Fragment-based accessibility score by Voršilák et al. The higher the better.

Physicochemical properties

  • SLogP: Solubility. Should be < 5.
  • QED: Drug-likeness. The higher the better.
  • NumRings: Number of rings in the molecule.
  • FractionCSP3: Number of tertiary carbons.
  • FrHalogen: Number of halogen groups.
  • HeavyAtom: Heavy atom count.
  • Rotatable: Number of rotatable bonds.
  • Heteroatoms: Number of Heteroatoms.
  • FrAlkylHalide: Fragments containing Halides.

Chemotype

  • TriazoloHeteroaryl: Contains an heteroaryl ring in the RHS.
  • TriazoloPhenyl: Contains a phenyl (no heteroatoms) in the RHS.
  • TriazoloHeteroaryl - Para / - Meta / - Orto: Contain substituents in para, meta or orto positions

Molecule generation steps

  1. A first batch of molecules were generated in May2021 using Reinvent 2.0. A detailed explanation as well as results analysis of this first round can be found in our GitHub repo ersilia-os/osm-series4-candidates. We generated 116,728 new series 4 candidates. Code.
  2. A second batch of molecules (209,310) has been generated for the purposes of this analysis using Reinvent 2.0 in exploration mode and optimizing for activity based on a simple QSAR model build with RDKIT descriptors. Code.
  3. A third batch of molecules (150,365) has been generated using a low-data generative model taking as pre-training populations the ChEMBL and a large fragments library. Code.

Selection of best candidates

All unique final molecules (405,766) have undergone a recursive selection process based on physicochemical properties, synthetic accessibility and predicted activity as follows:

Scripts for the filterings applied can be found in the scripts folder in this repository.

Run pipeline

For transparency and reproducibility, we provide code to run the full pipeline for candidate selection. Please download and uncompress the following folders and files:

The notebook with the process to select the best 90 candidates can be found here.