Ersilia compound embedding

This library provides bioactivity aware chemical embeddings for small molecules. The lite directory contains the eosce library intended for end-user use. The compound-embedding library is intended for development use.

Quick start guide

1. Clone the repository

git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding/lite

2. Create a conda environment

conda create -n eosce python=3.8
conda activate eosce

3. Install the package with pip

pip install -e .

or if you have a GPU

pip install -e .[gpu]

4. Programatically generate embeddings

from eosce import ErsiliaCompoundEmbeddings
model = ErsiliaCompoundEmbeddings()
embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"])

# Optionally if you want grid embeddingd
grid_embeddings = model.transform(["CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"], grid=True)

4. Generate embeddings using the cli

For a single smiles:

eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1"

or, to save the results in a .csv file:

eosce embed "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv

For multiple smiles, pass an input file with a single column as a smiles list. An example is provided in lite/data

eosce embed -i data/input.csv  -o data/output.csv

For grid embedding

eosce embed --grid "CCOC(=O)C1=CC2=CC(OC)=CC=C2O1" -o output.csv

Get support by running

eosce embed --help

Developing

1. Clone the repository

git clone https://github.com/ersilia-os/compound-embedding.git
cd compound-embedding

2. Create a conda environment and activate it

conda config --set channel_priority flexible
conda env create -f env.yaml
conda activate crux

3. Install Groverfeat library

bash install_grover.sh

Project Architecture

Phase 1

We generated a unique dataset that contains grover descriptors, mordred descriptors and Assay labels for the molecules present in the FS-MOL dataset. This dataset is then used to train a ProtoNET with euclidean distance as the metric.

Phase 2

We then used the trained Protonet to generate 2 million embedding for the molecules in the reference library generated using CHEMBL.

Phase 3

Finally we used the generated training dataset generated in Phase 2 to train a simple and fast neural network that maps ECFPs to embeddings generated by the ProtoNET.

Getting required dataset

FS-Mol Dataset

The FS-Mol dataset is available as a download, FS-Mol Data, split into train, valid and test folders. Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:

{
    "SMILES": "SMILES_STRING",
    "Property": "ACTIVITY BOOL LABEL",
    "Assay_ID": "CHEMBL ID",
    "RegressionProperty": "ACTIVITY VALUE",
    "LogRegressionProperty": "LOG ACTIVITY VALUE",
    "Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
    "AssayType": "TYPE OF ASSAY",
    "fingerprints": [...],
    "descriptors": [...],
    "graph": {
        "adjacency_lists": [
           [... SINGLE BONDS AS PAIRS ...],
           [... DOUBLE BONDS AS PAIRS ...],
           [... TRIPLE BONDS AS PAIRS ...]
        ],
        "node_types": [...ATOM TYPES...],
        "node_features": [...NODE FEATURES...],
    }
}

Preprocessing pipelines

Generate training data for ProtoNET

Merge Grover descriptors with FS-Mol Data

crux gen grover --inp "path/to/fs-mol/data" --out "path/to/save/output"

Merge Mordred descriptors with FS-Mol Data

crux gen grover --inp "path/to/fs-mol-merged-grover/data" --out "path/to/save/output"

Run qc checks to fix corrupted files

crux qc --inp "path/to/fs-mol/data" --out "path/to/fs-mol-merged-grover-mordred/data"

Training protonet

crux train protonet \
    --save_dir path/to/save/trained_model \
    --data_dir "path/to/fs-mol-merged-grover-mordred/data" \
    --num_train_steps 10000

Move fully trained model to package root

cp path/to/save/trained_model/FSMOL_protonet_{run identifier}/fully_trained.pt ./src/compound_embedding/

Download and move reference library to package root

Reference libraray can be downloaded from here. Move it to the package root as we did in the last step.

Generate training data for Ersilia Compound Embeddings model

mpiexec -n 4 python gen_efp_train.py

This will create a efp_training.hdf5 file in the directory where the command is executed.

Train Ersilia Compound Embeddings model

crux train efp --save_dir /path/to/save/checkpoints --data_file /path/to/efp_training.hdf5

License

This repository is open-sourced under the GPL-3 License. Please cite us if you use it.

About Us

The Ersilia Open Source Initiative is a Non Profit Organization (1192266) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.

Help us achieve our mission or volunteer with us!

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
lite		lite
notebooks		notebooks
src		src
.darglint		.darglint
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
gen_efp_train.py		gen_efp_train.py
install_grover.sh		install_grover.sh
phase_1.png		phase_1.png
phase_2.png		phase_2.png
phase_3.png		phase_3.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ersilia compound embedding

Quick start guide

1. Clone the repository

2. Create a conda environment

3. Install the package with pip

4. Programatically generate embeddings

4. Generate embeddings using the cli

Developing

1. Clone the repository

2. Create a conda environment and activate it

3. Install Groverfeat library

Project Architecture

Phase 1

Phase 2

Phase 3

Getting required dataset

FS-Mol Dataset

Preprocessing pipelines

Generate training data for ProtoNET

Merge Grover descriptors with FS-Mol Data

Merge Mordred descriptors with FS-Mol Data

Run qc checks to fix corrupted files

Training protonet

Move fully trained model to package root

Download and move reference library to package root

Generate training data for Ersilia Compound Embeddings model

Train Ersilia Compound Embeddings model

License

About Us

About

Releases

Packages

Contributors 3

Languages

License

ersilia-os/compound-embedding

Folders and files

Latest commit

History

Repository files navigation

Ersilia compound embedding

Quick start guide

1. Clone the repository

2. Create a conda environment

3. Install the package with pip

4. Programatically generate embeddings

4. Generate embeddings using the cli

Developing

1. Clone the repository

2. Create a conda environment and activate it

3. Install Groverfeat library

Project Architecture

Phase 1

Phase 2

Phase 3

Getting required dataset

FS-Mol Dataset

Preprocessing pipelines

Generate training data for ProtoNET

Merge Grover descriptors with FS-Mol Data

Merge Mordred descriptors with FS-Mol Data

Run qc checks to fix corrupted files

Training protonet

Move fully trained model to package root

Download and move reference library to package root

Generate training data for Ersilia Compound Embeddings model

Train Ersilia Compound Embeddings model

License

About Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages