This is a follow-up of the previous work on generating new candidates for the Open Source Malaria project. Plase follow the thread #34 on the OSM GitHub page as well as the associated repos for Round1 and Round2 to get an overview of te previous work.
We base the selection of the best candidates for experimental synthesis on a predictive model trained using our automated ML pipeline ZairaChem. To benchmark the ML model, we reproduced the original competition for series 4 activity prediction recently published (Tse et al, 2021). The training and test set for this model are in the data folder under competition_benchmark.
The Zairachem-trained model provides both a binary classification score (0: inactive, 1: active) with a threshold of 2.5 uM and a prediction of the IC50 value itself:
- The classification model was able to predict correctly 30 out of the 33 molecules in the list
- The regression model predicted values of < 2.5 uM to 11 of the 12 real actives and > 2.5 to all real inactives
In addition, we trained a second model using a more restrictive cut-off of 1 uM for potency. Comparison of the different model performances can be found under data > competition_benchmark.
We subsequently retrained the pipeline with all available data (original training set + competition molecules + newest synthesized molecules), producing two models (with cut-offs for binary classification set at 1 and 2.5 uM respectively). Models can be downloaded (1uM and 2.5uM cut-offs), and run with the ZairaChem package
Using the methodology described by the ETH Modlab for low data generative models, we generated 683 new series 4 molecules using for transfer learning the best experimentally validated molecules (IC50 <= 1) (89 molecules) and the best molecules from Round 2 (90 molecules)
We use the Zairachem model to select the highest active molecules (predicted active by both models as described in the notebooks) from:
- Set of pre-selected candidates in round 2 (17.876 molecules): 1094 molecules
- New synthesis round based on high actives (683 molecules): 201 molecules
- Best compounds selected in round 2 (90 molecules): 35 molecules
We therefore provide the list of 35 candidates for experimental testing and an additional pool of 1295 molecules which can be screened for interesting molecules as well.
These .csv files contain the smiles, the probability of being active with cut-off at 1uM and cut-of at 2.5uM, respectively, and the predicted IC50 using the regression model.