-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✍️ Contribution period: Leila Yesufu #820
Comments
Task 3: Install the Ersilia Model Hub and test the simplest model Following the instructions here. I installed the Ersilia Model hub from the Command line Interface. There are prerequisites to be installed before downloading the Model hub. This can be seen as follows.
Conda has been successfully installed. This can be tested using
Main Installation.
Check that the CLI works on your terminal, and explore the available commands We have successfully installed the Ersilia model hub. |
Task 3: Install the Ersilia Model Hub and test the simplest model
The first two ran without errors The calculate however gave me some errors
I tried it at first and got an error such as this Upon double checking the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json i confirmed that there was no calculate key only a run key. I then updated the schema as thus using values from the logs. I then ran it again and got another error but this wasn't due to key error, it was due to an issue with reading input columns in the code cause the object had no len Opening the code at /home/leila/ersilia/ersilia/io/readers/file.py, In the read_input_columns function at line 321. This change ensures that the length check is only performed when h is not None. If h is None,the code proceeds accordingly.
and i got this I have gotten the input section but not the output. upon further troubleshooting i found out my mistake in the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json when creating the calculate i used "result" instead of "mw" I'm still finding it difficult to get the output value |
Hello @leilayesufu, |
Seen, Thank you very much |
After a correction by @HellenNamulinda Following this, I removed the environment and created it again. NOTE: I didn't alter the codebase after installing it back. The log file can be seen here run.logTask 3 has been completed successfully. Thank you @HellenNamulinda |
Task 4: Write a motivation statement to work at Ersilia Motivation Statement to work at ErsiliaHi, my name is Leila Yesufu, and I completed an Electrical Engineering degree in 2022. Ersilia's Open Source Initiative also resonates with me on a personal level, as it reflects my passion in both Healthcare and Technology. I am from Nigeria, a middle-income country, I've experienced firsthand challenges that communities face in accessing tools for disease research such as the Ebola crisis of 2014. I really appreciate the value that this project can bring to the world and i would love to work towards bringing that value. The project's roadmap, focuses on access to models and building capacity in data science, I also read that Ersilia is currently setting up a sustainable cloud infrastructure (AWS) to enable online ML model inference, which I am experienced in. This presents a chance for me to showcase and enhance my skills in Python, JavaScript, Git, Machine learning, Deep learning, conda, Google colab, Django, Docker, Kubernetes AWS ML such as AWS sagemaker. I am also open and eager to learn more technologies during the internship thus expanding my knowledge beyond what I currently know. During the internship if granted , I will be committed to fully immersing myself in projects, collaborating with the Ersilia team, and contributing meaningfully. I hope to actively engage to maximize my learning and impact.I hope to learn more about the Artificial Intelligence, Machine learning field and hopefully build a career in it. Post-internship, I hope to not only have made valuable contributions to Ersilia but also to continue being part of Ersilia’s team and other projects and communities that uses technology for social good. I see this internship as a stepping stone towards a life where I can actively contribute to projects that address real-world challenges in both healthcare and technology. |
Task 5: Submit your first contribution to the Outreachy site I submitted an application to Ersilia through the Outreachy website and linked this issue as part of my contribution. |
Hi @leilayesufu, great work! Thanks for the updates. :) |
Thank youuu! @DhanshreeA |
@DhanshreeA I've successfully completed Week 1 - Get to know the community, Do i go ahead to Week 2 - Install and run an ML model? |
Hi @leilayesufu yes absolutely. Thanks for the updates. You can go ahead and get started with week 2 tasks. |
Week 2 - Install and run an ML modelTask 6: Select a model from the suggested list After going through the proposed models, i selected Plasma Protein Binding (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt After reading this publication for the model https://pubs.acs.org/doi/10.1021/acs.jcim.2c00297 This sparked my interest because the use of interpretable deep learning method the model will not only provide predictions but it will also provide understanding into factors that influence PPB. |
TASK 7: Install the model in your systemTo install this model I created a conda environmnt named 'IDL-PPBopt' and activated the environment using the following commands:
Then I installed the following packages which the model requires:
Other required packages
Although, the repository authory didn't add panda and matplotlib, i also installed them;
I viewed all my installed packages with the command the output here |
Hi @leilayesufu thanks for the updates. However there's a slight confusion in your comment. You mention selecting "NCATS Rat Liver Microsomal Stability", however it appears you have worked with "Plasma Protein Binding (IDL-PPBopt)" instead. These are two different models. A side note, |
Ah, Thank you so much for the correction!!! I'll update it now. Thank you for the information on scikit-learn as well |
To test the IDL-PPBopt model, I located ipynb jupyter notebook that is used run the code seen here , i then extracted all the steps needed to predict values and saved it in a python file i created and named model.py, I did this because i want to use python for the predictions. The code was still dependent on cuda and gpu, so i had to edit the model.py file i created. I first defined the device to CPU using the following command
I then edited the all the files dependent on cuda to load the file, i changed the loading command from Then i tried running the file, i got the error shown here, This was because the files made use of imports that is dependent on IPython from jupyter. So i had to comment out all the lines dependent on IPython. After this was done, I tried running it again and i got the error shown in the log here To fix this, i changed the line of code
Then i ran it again and it finally worked. view log file here. View the CSV output of the default input file in the Plasma Protein Binding (IDL-PPBopt) model here The IDL-PPBopt model has been installed and run on my system. |
To Run predictions for the Essential Medicines List with the IDL-PPBopt model. I inputted the given EML list into my model.py i editted the heading from smiles to cano_smiles then i ran the command This is the successful log, and this is the predictions output IDL-PPBopt model_eml_canonical_output.csv Explanation of the result The model uses an Interpretable Deep Learning Method to help us understand how likely a compound is to stick to plasma proteins, |
The model should be a regression model because it outputs continuous values. |
Task 4: understand Ersilia's backend and running the EML list with Ersilia
I decided to challenge myself and run the model with Docker containers I located the model in the in the Ersilia Model Hub. Then i went to docker hub and got the command for pulling the image. "docker pull ersiliaos/eos22io" view the log here I ran this in a detached mode as seen . After running the container, i inspected it here and found that the container has a custom entry point specified, and it's using the shell sh so i accessed the shell terminal with the following command here are the logs explanation of logs i ran this two commands as seen in the Dockerfile The model has already been served then i used this command to download the EML list into the Docker shell
then i ran this command to run the predictions on the downloaded to run the file and save the result in a new file called ersilia_output.csv The file can be seen here. I exited the docker interactive shell and used the below command to copy the output generated from inside the docker container to my home directory
Comparison of the ouput from IDL-PPBopt and using ersilia to run itThe IDL-PPBopt model with EML: IDL-PPBopt.model_eml_canonical_output.csv I checked the output of every single result and they were the same apart from number 97, 98 in the IDL-PPBopt model which had an output of 0.93916506, 0.9521944 while in the ersilia one had an output of 0.9391651,0.95219433 so a slight rounding up difference. It should also be noted that the running with the ersilia model produced 9 nan results making it 442 in number while the IDL-PPBopt model result left the nan results out making it 433 in number. This is because as seen in the log of the IDL-PPBopt model here, 9 compounds could not be featured. From my understanding, the ersiliaos/eos22io featured them and gave a nan result while IDL-PPBopt model left them out completely I have successfully completed week 2 tasks.
|
Hi @leilayesufu thank you for the detailed updates. Proteins are organic compounds (containing C-H or C-C bonds), whereas the nine compounds that are not featurized within the original code are inorganic compounds or salts. Since the input and output file for a model within ersilia needs to be of the same length, ersilia simply outputs null corresponding to the molecules that don't produce an output (because they don't get featurized) |
Thank you very much! |
WEEK 3Suggest a new model and document it (1)Slug: CLAMP Model title: Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language Publication: https://arxiv.org/abs/2303.03363 Github repo: https://github.com/ml-jku/clamp#clamp-clamp license: GPLv3 Code: Python checkpoints provided: yes Year: 2023 Description: CLAMP (Contrastive Language-Assay Molecule Pre-Training) is a ML model trained on pairs molecule-bioassay pairs. It can be communicated to in natural language. CLAMP can be used to make predictions about which molecules are most relevant to a particular bioassay when given a textual description of the bioassay. Datasets used:
Relevance to Ersilia: I believe it'll be relevant to Ersilia because it is an AI/ML tool that can be used in drug discovery. CLAMP's also designed for zero shot transfer learning in drug discovery which means that the model can make predictions for molecules on bioassays it wasn't pretrained with. This can be used in drug discovery to explore new molecules for drugs. |
Suggest a new model and document it (2)Slug: DrugApp Title: Machine learning-based prediction of drug approvals using molecular, physicochemical, clinical trial, and patent-related features Publication: https://www.tandfonline.com/doi/abs/10.1080/17460441.2023.2153830 Source Code: https://github.com/fulyaciray/DrugApp Description: DrugApp is a tool that uses machine learning and different data sources to predict the likelihood/potential of regulatory approval for drugs. License: GPL-3.0 license Code: Python Year: 2022 |
Suggest a new model and document it (3)Slug: RLBind Title: RLBind: a deep learning method to predict RNA–ligand binding sites Publication: https://www.researchgate.net/publication/365587894_RLBind_a_deep_learning_method_to_predict_RNA-ligand_binding_sites Source Code: https://github.com/KailiWang1/RLBind/tree/main Description: This model uses a deep convolutional neural network (CNN) to predict locations on RNA molecules where ligands are most likely to bind. This can be useful when designing drugs and in drug discovery and also in using RNAs as targets for treating diseases. License: Apache-2.0 license Code: Python |
Running the models.Model 1: CLAMP I set up the environment using the commands on the GitHub repository, a yaml file was given with all the dependencies so running the I encountered the following error cliperror.txt, This is due to clip being a dependency not mentioned in the environment file, this was fixed by installing (Contrastive Language-Image Pretraining) model via then i ran the file again and i got the following required output here output.txt. I then changed the content of the pretrained model to have the first 4 smiles of the EML file with the same bioassay and got this result here.
|
Model 2: DrugAppTo test this model, i followed the instructions to set up in the github repository and downloaded the required dependencies. Then i navigated to the scripts directory and i ran the following command I ran the command That just shows the date and time but it saved a results csv file as seen here. results_prospective_analysis_Rare.csv I also ran the command The model also has commands to run the scripts for evaluation metrics and feature importancesThis can be done with
CSV FILE CREATED: |
For this model, i cloned the git repository and i installed the listed requirements one after the other. I didn't use the environment yaml file given because it is for a gpu enabled version.
To test the model i ran I got the following error cuda.txt due to the use of cuda in the file, so i edited the predict.py file so it wouldn't depend on cuda. then i ran |
@DhanshreeA HI, please i'd love some feedback on the models i provided |
Hello, Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again! |
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application
The text was updated successfully, but these errors were encountered: