Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to screen for drugs with a protein not found in the file? #4

Closed
hima111997 opened this issue Sep 25, 2020 · 9 comments
Closed

How to screen for drugs with a protein not found in the file? #4

hima111997 opened this issue Sep 25, 2020 · 9 comments

Comments

@hima111997
Copy link

Greetings sir,

I want to use your model to screen for drugs for a protein not found in the file that contains the protein names.
Could you please help me?

and if I want to screen certain drugs from databases how can I do this?

thanks

@ahmetrifaioglu
Copy link
Collaborator

ahmetrifaioglu commented Oct 5, 2020

Dear @hima111997,

Thank you for your interest in DEEPScreen. Below you can see the steps you need to follow to create a new model for a new target and test your drugs:

  1. Assume that your target name (or id) is "MYTARGETID". First, you need to create all the images the target of interest. For this, you need to use save_comp_imgs_from_smiles method in the data_processing module. This method takes 3 arguments which are target_id (tar_id), compound_id (comp_id) and smiles (smiles string of a compound). Please note that this method creates an image for a single compound. Therefore, you should call this function for all of your compounds. The images will be saved under "training_files/target_training_datasets/MYTARGETID/imgs".

  2. You should create a json file named "train_val_test_dict.json" under "training_files/target_training_datasets/MYTARGETID". This file should include a dictionary where keys are "training", "validation" and "test" (you can put your test drugs under "test" key as described below) and the values of each key is a list of lists where each inner list consist of two elements which are target id and label (1 for active compound and 0 for inactive compound). You may put your test drugs as elements of the list of "test" list. Below there is a sample dictionary. You may also check the the content for the target "CHEMBL286" in the GitHub page.

{"training": [["CHEMBLTRAIN1", 1], ["CHEMBLTRAIN2", 0], ["CHEMBLTRAIN3", 0]], "test": [["CHEMBLTEST1", 1], ["CHEMBLTEST2", 0], ["MYDRUGTOBETESTED", 0]], "validation": [["CHEMBLVAL1", 1], ["CHEMBLVAL2", 0]]}

After you complete step 1 and 2, I guess you are good to go and you can call the command given in the GitHub page. Of course, you should try different hyper-parameters to get best predictive model. A sample command is given below.

python main_training.py --targetid MYTARGETID --model CNNModel1 --fc1 256 --fc2 128 --lr 0.01 --bs 64 --dropout 0.25 --epoch 100 --en mytargetid_training

The test predictions and performance results will be generated as described in the GitHub page. I hope this helps. Please let us know if you have further questions.

@hima111997
Copy link
Author

thank you for this description. but I have three questions.

1- from the second step, I understood that I should know whether the drugs in the CHEMBL database interact or not with the protein of interest (experimental data). so let's say I don't know this information. what should I do?

2- in the second step, you said that I should add my drugs as a value to the "test" key ("test": [["CHEMBLTEST1", 1], ["CHEMBLTEST2", 0], ["MYDRUGTOBETESTED", 0]]). my question is, why should I add "MYDRUGTOBETESTED" as a value for the test key? i want to predict whether it binds or not, so i think i should use them in the prediction, is that right?

3- is the splitting of the drugs for train, test, validation based on some knowledge or is it random splitting?

@ahmetrifaioglu
Copy link
Collaborator

1- If you do not know the labels of the drugs to be tested, then you can just assign 0 or 1 as their labels and use only their predictions. For the performance calculation, you may just use the ones with known labels.

2- The reason for adding the "MYDRUGTOBETESTED" under the key "test" is to get the predictions for it without using it in training. The predictions for the drugs will be written in the output file and you can just ignore the label column.

3- The splitting in this version is random. However, you can create different types of splitting and create your training, test and validation compounds accordingly.

@hima111997
Copy link
Author

1- so, you mean to randomly assign 0 or 1 to the drugs in the train_val_test_dict.json file in case I don't know whether they interact or not, right? will this random assigning affect the training? and what do you mean by "use only their predictions" in point 1?

" For the performance calculation, you may just use the ones with known labels." you mean to calculate the metrics manually using only the drugs with known labels?

2- from my knowledge in deep learning, we split the data into train-val-test, and after training, we use the model to predict "y" for data points not in the dataset we used during training. from my understanding of point 2, you said that we put "MYDRUGTOBETESTED" in the test data to get the predictions. should not we use it in the prediction step?

and thanks for your help

@ahmetrifaioglu
Copy link
Collaborator

1- so, you mean to randomly assign 0 or 1 to the drugs in the train_val_test_dict.json file in case I don't know whether they interact or not, right?

Yes, I meant this.

will this random assigning affect the training? and what do you mean by "use only their predictions" in point 1?

It will not affect the training as test compounds are not used in the training steps. They are only used for independent testing.

what do you mean by "use only their predictions" in point 1?
I just meant that you can just ignore the dummy labels that you put for the compounds with unknown labels. You can use the model predictions to predict the activity of a compound against the trained target.

For the performance calculation, you may just use the ones with known labels." you mean to calculate the metrics manually using only the drugs with known labels?

Yes, in this case you can calculate the performance for the compounds with known labels. I am aware that this is a quick and dirty solution for now. Alternatively, you may save the model and use that model just to predict new compounds.

2- from my knowledge in deep learning, we split the data into train-val-test, and after training, we use the model to predict "y" for data points not in the dataset we used during training. from my understanding of point 2, you said that we put "MYDRUGTOBETESTED" in the test data to get the predictions. should not we use it in the prediction step?

Yes, your knowledge is correct. I guess we are on the same page. Here is the explanation: We do split data as training-validation and test dataset. Training dataset is used to train the model. Validation dataset is used for hyper-parameter tuning. Therefore, training and validation datasets are somehow affects the training of the models as they are used for the training and hyper-parameter tuning of the model. However, test dataset are just used for independent testing. So, the model does not see these compounds and their labels during training and validation at all. This is the same thing that you mentioned as "prediction step." What I meant before is that you can also add the compounds that you want to test under this test set and get your predictions. If you do not know their labels, that is fine then just put some random label as it has nothing to do with the training and validation.

I hope it is more clear now. Please let me know if you have further questions.

@hima111997
Copy link
Author

1- "It will not affect the training as test compounds are not used in the training steps. They are only used for independent testing." from this, I can understand that you meant that I should add random labels only during the test but not during training and validation. However, I was asking about the training.

For example, let's say that I have a protein and want to screen a database, DrugBank for example, however, I don't know which drug binds to this protein. from your previous answer, I understand that when I create the file, I should add random labels in the test data (all drugbank compounds). what about the training data? should I train on CHEMBL database? However, in this case, i don't know which drug binds with the protein. So should I also add random labels?

2- "What I meant before is that you can also add the compounds that you want to test under this test set and get your predictions. If you do not know their labels, that is fine then just put some random label as it has nothing to do with the training and validation." but in this case, i will not be able to know whether the model is doing good or not.

NOTE: I don't have any experimental data about the binding of DrugBanks drugs with this protein.

and sorry for these long questions.

@ahmetrifaioglu
Copy link
Collaborator

1- OK , I see your point now. If you do not have enough training data for the target of interest, then DEEPScreen or similar methods will not be useful as you need to have certain amount of known active and inactive compounds against your target so that the model can be trained and used for prediction of new compounds. If your target protein has active and inactive data points in ChEMBL, DrugBank or in a similar database, then you can create a training dataset and use DEEPScreen as described above. Otherwise, it is not possible to train a model. I think the methods based on guilt-by-association idea or similarity-based virtual screening methods could be helpful in your case.

2- Of course, if you do not know the labels in your test set, then you cannot determine whether your are doing good or bad on that particular compounds without doing experiments or find an evidence from a study. The reason to create a completely independent test other than validation is to evaluate the performance of the system on the unseen compounds with known labels. On the other hand, the one of the main reasons to provide predictions for the compounds with unknown targets is to provide insight before conducting time-consuming and expensive experiments. For example, suppose you want to test a set of drugs against a protein. In such a case, you can use predictive models to get predictions before conducting experiment to increase your hit rate and perform a pre-elimination.

@hima111997
Copy link
Author

hima111997 commented Oct 6, 2020 via email

@ahmetrifaioglu
Copy link
Collaborator

You are welcome! I am closing the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants