Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to predict DTI of new compounds with trained protein models? #5

Open
Bigrock-dd opened this issue Nov 23, 2020 · 10 comments
Open

Comments

@Bigrock-dd
Copy link

Thanks!

@Bigrock-dd
Copy link
Author

Another question,you have proposed three network options in your article. Where are the files of the other two networks?
Looking forward to your reply!

@tuncadogan
Copy link
Collaborator

tuncadogan commented Nov 29, 2020

Hi, thank you for your interest and sorry for the late reply. I cannot see your original question at the moment but I believe you asked how can you produce predictions with a DEEPScreen model you trained.

The answer: We have not prepared our ready-to-use models yet; however, you can easily do this by modifying the "train_val_test_dict.json" file of the target protein that you are interested in. In this file, there are ChEMBL ids of the compounds that are in the training set, validation set, and the hold-out test set, including their labels as either 0 or 1 (meaning inactive or active).

Let's say you wish to produce predictions for 100 compounds that are not known to interact with the target of interest. Include their ids in the test dataset split of that target (inside the "train_val_test_dict.json" file, to be included in the test dataset -and not train or validation- they should be added after the expression "test": [ in the file, also give them fake labels as 0 or 1 similar to the compounds already included in that file, it does not matter which label you gave to them).

After that, train/test the model using the desired hyperparameters. When the process is finished check the prediction results in the "best_val_test_predictions-....txt" file. Here you can see the prediction results of the compounds you have added to the test dataset (in the right-most column).

The only downside is that the reported predictive performance of this model will change according to the concordance between your fake labels and the prediction results of these compounds, as a result, do not trust these performance results. If you wish to have a reliable performance calculation for your model, please first train/test the model without the compounds that you wish to produce predictions for. Check the performance results from the original model and if you are satisfied with the performance, re-train/test your model this time with your compound additions.

Two important things:

  1. 2D images of your prediction compounds should be in: "target_training_datasets/chembl_id_of_your_target/imgs/"
  2. the compounds ids you added to "train_val_test_dict.json" file should be the same as the filenames of the image files of these compounds (e.g. compound id: "CHEMBL123" and image filename: "CHEMBL123.png").

Since these compound images are generated with a specific library (RDkit), the same library and parameters should be used to generate the images of your new compounds. If those compounds are already in ChEMBL, you may easily find them in this file containing nearly 409K ChEMBL compounds: https://drive.google.com/file/d/1E7ZpLN_fMdXmPJPP7WH3IPWPceleP_3a/view?usp=sharing

If not, please follow the instructions that we explained in our paper to construct the images of your compounds of interest. Please let me know if you have further questions.

Answer to your second question: We presented those 3 architectures in the first version of the DEEPScreen tool. This is the new and the better version (re-coded using modern frameworks and with GPU support). In this one, there is only one CNN architecture. However, its predictive performance is on par with the old ones.

@Bigrock-dd
Copy link
Author

Thank you very much for your patient answer! Very helpful to me!
In the first version of DEEPScreen, a compressed package file was damaged and the code could not be reproduced. If possible, can you upload the compressed package file named chembl_23_chemreps.txt.zip again? Thanks!

@tuncadogan
Copy link
Collaborator

No problem at all, glad that it helped.

Yes, we could not retrieve that damaged package and after that, we moved to the new version of DEEPScreen. This means that even if have all files you cannot directly use those pre-trained models (since the optimal threshold information is missing). But you can train your own model by following the instructions in the old version of our repository.

To find these instructions and the file you are looking for please switch to the master branch in our repository (the new and the default version of DEEPScreen if the "PyTorch" branch). In the master branch, you can find the file you are looking for in this path: "DEEPScreen/trainingFiles/chembl_23_chemreps.txt.zip"

@tuncadogan tuncadogan reopened this Dec 2, 2020
@ljh433
Copy link

ljh433 commented Dec 9, 2020

Hi,i want to construct the images of my compounds of interest,Do you have the code to construct the image?

@tuncadogan
Copy link
Collaborator

tuncadogan commented Dec 10, 2020

Hi, we currently do not have a ready-to-use module inside the platform for molecule image drawing. However, we are using RDkit for this, as explained here:

https://www.rdkit.org/docs/GettingStartedInPython.html#drawing-molecules

Here is a piece to display the settings we used for molecule drawing over a simple example SMILES:

>>> IMG_SIZE = 200
>>> smiles="CCc1nc(N)nc(N)c1-c1ccc(Cl)cc1"
>>> mol = Chem.MolFromSmiles(smiles)
>>> d = rdMolDraw2D.MolDraw2DCairo(IMG_SIZE, IMG_SIZE)
>>> d.drawOptions().bondLineWidth = 1
>>> d.DrawMolecule(mol)
>>> d.FinishDrawing()
>>> d.WriteDrawingText('comp_id_2.png')

Also, please find the pre-computed 2-D images of all compounds in ChEMBL v27 here:

https://drive.google.com/file/d/16T8NI1Umf8A0qeLu90Akbx3ic-vdAbUO/view?usp=sharing

and, pre-computed 2-D images of all compounds in DrugBank v5.1.7 here:

https://drive.google.com/file/d/11vSqg1SgX7y25TbX4EzNOjWNkSFVZzek/view?usp=sharing

@ljh433
Copy link

ljh433 commented Dec 11, 2020

Hi, we currently have a ready-to-use module inside the platform for molecule image drawing. However, we are using RDkit for this, as explained here:

https://www.rdkit.org/docs/GettingStartedInPython.html#drawing-molecules

Here is a piece to display the settings we used for molecule drawing over a simple example SMILES:

>>> IMG_SIZE = 200
>>> smiles="CCc1nc(N)nc(N)c1-c1ccc(Cl)cc1"
>>> mol = Chem.MolFromSmiles(smiles)
>>> d = rdMolDraw2D.MolDraw2DCairo(IMG_SIZE, IMG_SIZE)
>>> d.drawOptions().bondLineWidth = 1
>>> d.DrawMolecule(mol)
>>> d.FinishDrawing()
>>> d.WriteDrawingText('comp_id_2.png')

Also, please find the pre-computed 2-D images of all compounds in ChEMBL v27 here:

https://drive.google.com/file/d/16T8NI1Umf8A0qeLu90Akbx3ic-vdAbUO/view?usp=sharing

and, pre-computed 2-D images of all compounds in DrugBank v5.1.7 here:

https://drive.google.com/file/d/11vSqg1SgX7y25TbX4EzNOjWNkSFVZzek/view?usp=sharing

Thank you very much for your patient answer!

@Bigrock-dd
Copy link
Author

Excuse me again. Does this model only need to prepare labeled data sets for each protein instead of inputting protein information?

@tuncadogan
Copy link
Collaborator

Excuse me again. Does this model only need to prepare labeled data sets for each protein instead of inputting protein information?

Yes, proteins are used as labels in DEEPScreen, and the actual input of the models are the 2-D images of the compounds. We train an individual classifier for each target protein, which allows us to optimize the model parameters specific to that protein.

@Bigrock-dd
Copy link
Author

Excuse me again. Does this model only need to prepare labeled data sets for each protein instead of inputting protein information?

Yes, proteins are used as labels in DEEPScreen, and the actual input of the models are the 2-D images of the compounds. We train an individual classifier for each target protein, which allows us to optimize the model parameters specific to that protein.

Thank you very much for your patient answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants