-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🦠 Model Request: CReM - chemically reasonable mutations framework for structure generation #505
Comments
@miquelduranfrigola please take a look. |
The SA (synthetic accessibility) score is on a scale of 1 to 10 (lower is better) while the SC (synthetic complexity) score is on a scale of 1 to 5 (again, lower is better). I am unsure how SCscore 2 compares to SAscore 2. |
Hi @DhanshreeA SC score uses the number of reactions needed to synthesize a new molecule as a measure of complexity, whereas SA score also takes into account the presence of unusual structural features. They are related (usually, a molecule with high SC score will also have high SA score and viceversa) Do you think it would be useful to incorporate the models for SA and SC independently as well? SA score is already in the Hub (eos9ei3) but SC score I think is not there: https://pubs.acs.org/doi/10.1021/acs.jcim.7b00622 |
@GemmaTuron I see. Thanks for that explanation, it makes a lot of sense. Incorporating SAscore 2 and SCscore 2 independently is going to be very straightforward since the author has provided different fragment replacement databases for each of them. Only difference is that SA2 is 3GB whiile SC2 is 721MB but I understand that git LFS handles that. |
Hi @DhanshreeA ! I am unsure about what SA2 is and what is the difference between the SA Score from the original publication, which is already in the Hub? Unless there is a siginificant improvement we are good with the one we have |
Hi @GemmaTuron, I'll quote from their documentation:
The model we have in the hub computes SA score for a given molecule input, while this framework uses fragments of this score for generating new molecules. In fact the author only uses the SA score as defined in the original publication https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-1-8 |
A few more details about this package relevant to incorporating it within the hub. The library API provides three functions for generating mutations in a given input molecule:
Other configurable parameters that can affect the speed of execution include, from the documentation and as addressed in the original repo :
|
Nice initiative! Short comments:
I can share a SA2.5 database for ChEMBL, but it will be even larger (10-20Gb). If you will have further questions regarding CReM do not hesitate to contact me. |
Thanks @DrrDom this is fantastic feedback. Greatly appreciated. @DhanshreeA, for the deployed version in the Ersilia Model Hub, let's use a set of reasonable parameters and relatively small database size, just to avoid frustration from users. But let's make sure we encourage users to visit the actual CReM repo, which is well maintained and easy to install. We can reflect this nicely in the README file. What do you all think? |
@DhanshreeA I've made a few edits to your initial comment, in particular to mention more explicitly CReM. Let's hear @GemmaTuron 's feedback and then it is good to go! |
It is difficult to recommend certain settings because I do not know your use case. |
/approve |
New Model Repository Created! 🎉@DhanshreeA ersilia model respository has been successfully created and is available at: Next Steps ⭐Now that your new model respository has been created, you are ready to start contributing to it! Here are some brief starter steps for contributing to your new model repository:
Additional Resources 📚If you have any questions, please feel free to open an issue and get support from the community! |
Associated PR: ersilia-os/eos4q1a#1 |
Hi @DhanshreeA I have added the model to the AirTable and have merged the PR |
Note: For sake of simplicity we are not including the Further action items:
Update (Dec 21):
|
Updates:
|
Further action items: |
@miquelduranfrigola / @GemmaTuron could one of you merge ersilia-os/eos4q1a#4? |
Updates:
Associated PR: ersilia-os/eos4q1a#5 @miquelduranfrigola @GemmaTuron Results on eml_canonical: |
To understand fingerprint generation and vectorization into Numpy arrays for the downstream task of KMeans clustering I took inspiration from https://github.com/PatWalters/kmeans/blob/master/kmeans.py (particularly for using the inbuilt ( |
In understanding RDKit documentation I came across a similar functionality that RDKit offers in terms of picking a subset of diverse molecules https://www.rdkit.org/docs/GettingStartedInPython.html#picking-diverse-molecules-using-fingerprints. I have not tried using so I am unsure if it is better optimized for the task at hand than using regular machine learning libraries like scikit learn, which I have used right now. |
Updates with integration testing with
Traceback (most recent call last):
File "/home/dee/miniconda3/envs/ersilia-env/bin/ersilia", line 33, in <module>
sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 138, in wrapper
return func(*args, **kwargs)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 115, in wrapper
return_value = func(*args, **kwargs)
File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
return func(*args, **kwargs)
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/cli/commands/api.py", line 36, in api
result = mdl.api(
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 334, in api
return self.api_task(
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 349, in api_task
for r in result:
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 178, in _api_runner_iter
for result in api.post(input=input, output=output, batch_size=batch_size):
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/serve/api.py", line 329, in post
self.output_adapter.adapt(
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 283, in adapt
df = self._to_dataframe(result)
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 227, in _to_dataframe
dtypes = [self.__pure_dtype(k) for k in output_keys]
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 227, in <listcomp>
dtypes = [self.__pure_dtype(k) for k in output_keys]
File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 156, in __pure_dtype
t = self._schema[k]["type"]
KeyError: 'outcome' |
Generating an example file as suggested here and using that I still run into the same error. |
Thanks @DhanshreeA. I think both options should be fine. If the scikit-learn option is already in place, then let's use that one. |
As for the error above, as discussed, let's make sure that the output file has a fixed number of columns (100). Comma-separated is fine. We may or may not include a header. |
Thanks @miquelduranfrigola I implemented this fix and the API works with the CLI now. Here is an output CSV file generated with five inputs. |
This PR is now good to be merged: ersilia-os/eos4q1a#5 |
Hi @DhanshreeA! I've tried the model in this GitHub workflow It failed, unfortunately. Aparently, it is unable to find Did you try to fetch the model from your computer and check what is inside the |
Hi @DhanshreeA, Update on testI tested the model multiple times but it failed to fetch on both the CLI and colad. Error returned on both is link to colab Test Environment
|
Hi @miquelduranfrigola this is fixed by pushing the crem replacement database onto the LFS server. I re ran the build action in the GitHub workflow. and can confirm that the model is fetched successfully. It was an issue with Git LFS being unable to find the crem replacement database on the LFS server. |
Thank you for your help @pauline-banye. The model works in Colab now. |
One final TODO (bonus):
|
Fantastic work @DhanshreeA ! I've marked the model as Ready in the AirTable. Closing issue! |
Before closing this issue, let's make sure the model is tested. |
Hi @DhanshreeA To predict it takes a long time, I tried it with a single molecule. |
We are doing the testing in ersilia-os/eos4q1a#7 |
Model Name
CReM fragment based structure generation
Model Description
The framework is an open source implementation of fragment based generative approaches for exploring chemical space while ensuring chemical validity. This framework utilizes a database of known compounds to come up with interchangable fragments based on the context radius of an input molecule to generate new molecules.
Slug
crem-structure-generation
Tags
Generative
Publication
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00431-w
Code
License
BSD 3-clause License
https://github.com/DrrDom/crem/blob/master/LICENSE.txt
The text was updated successfully, but these errors were encountered: