Skip to content

Compound embedding obtained with few-shot learning (FS-Mol), Morgan fingerprints, Grover embeddings, and Mordred descriptors

License

Notifications You must be signed in to change notification settings

ersilia-os/compound-embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ersilia compound embedding

This library provides bioactivity aware chemical embeddings for small molecules. The lite directory contains the eosce library intended for end-user use. The compound-embedding library is intended for development use.

Quick start guide

1. Clone the repository

git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding/lite

2. Create a conda environment

conda create -n eosce python=3.8
conda activate eosce

3. Install the package with pip

pip install -e .

or if you have a GPU

pip install -e .[gpu]

4. Programatically generate embeddings

from eosce import ErsiliaCompoundEmbeddings
model = ErsiliaCompoundEmbeddings()
embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"])

# Optionally if you want grid embeddingd
grid_embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"], grid=True)

4. Generate embeddings using the cli

For a single smiles:

eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" 

or, to save the results in a .csv file:

eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv

For multiple smiles, pass an input file with a single column as a smiles list. An example is provided in lite/data

eosce embed -i data/input.csv  -o data/output.csv

For grid embedding

eosce embed --grid "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv

Get support by running

eosce embed --help

Developing

1. Clone the repository

git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding

2. Create a conda environment and activate it

conda config --set channel_priority flexible
conda env create -f env.yaml
conda activate crux

3. Install Groverfeat library

bash install_grover.sh

Project Architecture

Phase 1

We generated a unique dataset that contains grover descriptors, mordred descriptors and Assay labels for the molecules present in the FS-MOL dataset. This dataset is then used to train a ProtoNET with euclidean distance as the metric.

Phase 1

Phase 2

We then used the trained Protonet to generate 2 million embedding for the molecules in the reference library generated using CHEMBL.

Phase 2

Phase 3

Finally we used the generated training dataset generated in Phase 2 to train a simple and fast neural network that maps ECFPs to embeddings generated by the ProtoNET.

Phase 3

Getting required dataset

FS-Mol Dataset

The FS-Mol dataset is available as a download, FS-Mol Data, split into train, valid and test folders. Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:

{
    "SMILES": "SMILES_STRING",
    "Property": "ACTIVITY BOOL LABEL",
    "Assay_ID": "CHEMBL ID",
    "RegressionProperty": "ACTIVITY VALUE",
    "LogRegressionProperty": "LOG ACTIVITY VALUE",
    "Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
    "AssayType": "TYPE OF ASSAY",
    "fingerprints": [...],
    "descriptors": [...],
    "graph": {
        "adjacency_lists": [
           [... SINGLE BONDS AS PAIRS ...],
           [... DOUBLE BONDS AS PAIRS ...],
           [... TRIPLE BONDS AS PAIRS ...]
        ],
        "node_types": [...ATOM TYPES...],
        "node_features": [...NODE FEATURES...],
    }
}

Preprocessing pipelines

Generate training data for ProtoNET

Merge Grover descriptors with FS-Mol Data

crux gen grover --inp "path/to/fs-mol/data" --out "path/to/save/output" 

Merge Mordred descriptors with FS-Mol Data

crux gen grover --inp "path/to/fs-mol-merged-grover/data" --out "path/to/save/output" 

Run qc checks to fix corrupted files

crux qc --inp "path/to/fs-mol/data" --out "path/to/fs-mol-merged-grover-mordred/data"

Training protonet

crux train protonet \
    --save_dir path/to/save/trained_model \
    --data_dir "path/to/fs-mol-merged-grover-mordred/data" \
    --num_train_steps 10000

Move fully trained model to package root

cp path/to/save/trained_model/FSMOL_protonet_{run identifier}/fully_trained.pt ./src/compound_embedding/

Download and move reference library to package root

Reference libraray can be downloaded from here. Move it to the package root as we did in the last step.

Generate training data for Ersilia Compound Embeddings model

mpiexec -n 4 python gen_efp_train.py

This will create a efp_training.hdf5 file in the directory where the command is executed.

Train Ersilia Compound Embeddings model

crux train efp --save_dir /path/to/save/checkpoints --data_file /path/to/efp_training.hdf5

License

This repository is open-sourced under the GPL-3 License. Please cite us if you use it.

About Us

The Ersilia Open Source Initiative is a Non Profit Organization (1192266) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.

Help us achieve our mission or volunteer with us!

About

Compound embedding obtained with few-shot learning (FS-Mol), Morgan fingerprints, Grover embeddings, and Mordred descriptors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages