-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean UP & Dockerization eos46ev #3
Comments
Hello @GemmaTuron @DhanshreeA
to this
With this change it reading the smiles string from csv file properly. |
Hello @GemmaTuron |
Hello @GemmaTuron @DhanshreeA What you suggests @GemmaTuron in this regard? Because i'm confused it is not giving me error, it just keep skipping the smiles that causes fuss. |
Hi @ZakiaYahya ! This is the expected behaviour, those smiles that cannot be parsed are skipped, but it should not be failing on so many. If you pass as a single smiles one of the list that did not get calculated, does it return empty ? |
Hello @GemmaTuron @DhanshreeA When i tried different single smile inputs, for instance the smile string at
Even i tried smile string at
For checking purpose, i've passed a smile string that produces
PS: I've updated Ersilia by pulling the latest code few hours ago, and after that i tested it. Plus from this behaviour, it seems like it is not depending on the size of inputs we are giving to the model neither the input smile string we are giving to predict. |
Hi @ZakiaYahya |
Hi @GemmaTuron |
I am unsure I am reproducing all your errors, I get null outputs for example for
|
Hello @GemmaTuron |
@ZakiaYahya moving our conversation from Slack to here and adding a little more to what @GemmaTuron has already suggested:
Now, what you can do is upload the eml_canonical file in this server and see what it returns. If Detect is 1, it's likely the probability is > 0.5, and vice-versa. From this approach we can also see how many smiles their server is skipping/unable to parse and compare that with the results we are getting. The model is also available to download from the link I shared above, which is likely where the original contributor also got it from. |
Hello @DhanshreeA Thanks, |
Hello @GemmaTuron @miquelduranfrigola @DhanshreeA Apart from these smiles, when i deleted these smiles from eml_canonical.csv the model is working fine and giving me the probabilities, here are the eml_canonical.csv file and the corresponding output/probability file eml_canonical_remaining.csv Let's go in detail of the descriptors against each smile string (1) For each smile, Here are descriptors for both descriptors_ remaining.csv Now i'm stuck at why these 7 smiles are not making it to probabilities. Going through the error message it seems like may be there is some undefined values in descriptors that causing the problem.
The result is I'm stuck here now, Any suggestions. How to proceed it further?? Thanks. |
Hi @ZakiaYahya Great work thanks for all the details. It can happen that some descriptors cannot be calculated for specific molecules (very large or very small molecules, molecules with certain metals...). What we need to do is add a try - except so that these molecules are skipped without making the whole model fail. |
Ok right @GemmaTuron |
Hello @ZakiaYahya, I don't know whether you tried adding print statements to examine the source of the error, when you pass the problematic smiles,
Is the code in the repo the latest? I can clone it and look more into it. |
Oh okay @HellenNamulinda |
Hello @GemmaTuron @HellenNamulinda Any suggestion @GemmaTuron @miquelduranfrigola @DhanshreeA to resolve this as null condition is not working here. |
@ZakiaYahya, The model uses sklearn package and it is the one pointing out that error, meaning the input to the model contains either NAN or inf.
For the problematic file shared, problematic_smiles.csv, it prints
All the above is just for debugging; not part of solution. We can handle those NAN values, forexample;
using run.sh Here is the edited code for main.py file; main.txt This is not tested within ersilia like using --repo_path |
Thankyou so much @HellenNamulinda |
Hello @GemmaTuron @HellenNamulinda
So for eml_canonoical.csv, input smiles are 443 and it is giving output probabilities also 443. Here are the output file when tested with But when i tested it within ersilia using |
Thanks @ZakiaYahya This is strange. @ZakiaYahya - Could you share the current code of this model? |
Hello @miquelduranfrigola |
Hello @miquelduranfrigola @GemmaTuron @DhanshreeA
And these are not even those smiles that was giving NAN values. I still not able to identify the source of problem. Any suggestion would be helpful. |
Hello @GemmaTuron |
Hi @ZakiaYahya
Whereas if I pass a .csv file containing the molecule I get For the molecules you pasted above I reproduce what you describe I am more puzzled by the CLI vs .csv file differences at this point, can you check if that is happening to you as well? |
@GemmaTuron I have certainly been able to reproduce what's happening with you and @ZakiaYahya:
It seems the CLI is running into an internal server error. Eg, I tried using the bento server endpoint from the swagger UI, and I see an error there: |
What's more interesting is, when I looked at the serve.log in the temp directories that ersilia creates for a particular run, I see that only 434 molecules are being correctly processed (not even the 437 that we're getting from repo_path) Attaching the serve.log here, as well as plain molecules eos46ev_mols.txt @ZakiaYahya I need your help, can you check if some of the molecules in Output_Ersilia.csv file (in your comment here: #3 (comment)) are being repeated? |
@miquelduranfrigola how do I update the log level for the ersilia CLI if I want to see what is happening at every step? |
Thanks @DhanshreeA , @ZakiaYahya and @GemmaTuron This is an issue on the Ersilia CLI side, most likely. Let me set up everything to troubleshoot this model. I will keep you posted. |
Yes @GemmaTuron, |
@DhanshreeA Hmm, yes i can check it. But if although this is the case, the smiles are repeated in eml_canonical.csv, i don't think so it is the problem. Right? |
This is extremely surprising. Let us think a bit about it. |
Hi @miquelduranfrigola |
A few more updates here, @ZakiaYahya and I spent some time with this model today during our 1:1 and it would appear that before the refactor, the model worked with single inputs (using its predict api) |
Thanks @DhanshreeA and @ZakiaYahya - this is helpful. |
I think I've fixed it: 19178b1 The problem was with the The following code was skipping the first SMILES from the input file, assuming it was the header: smiles_df = pd.read_csv(smiles_file)
smiles = smiles_df[smiles_df.columns[0]].tolist()
mol = [Chem.MolFromSmiles(x) for x in smiles] I have replaced it by: smiles = []
with open(smiles_file, "r") as f:
reader = csv.reader(f)
for r in reader:
smiles += [r[0]]
mol = [Chem.MolFromSmiles(x) for x in smiles] Please note that in the As a result, the old code was skipping the first smiles provided.
This may explain the weird number of outputs discussed in this thread. The model was not failing at fetch time because fetch tries more than one molecule and does not do stringent checks on the output. It is becoming very apparent that we need to do tests before storing the model. Today we will discuss this (@pittmanriley is setting up the placeholders for this). |
@miquelduranfrigola |
Hello @miquelduranfrigola The problem still remains the same. |
good catch about the header, probably this is happening in more than one model. @ZakiaYahya I'll try large files see how many outputs we get. |
Right @GemmaTuron |
I have re fetched the model (in a MacOS Ventura M2) and I was able to produce good predictions for the whole EML list (Except a couple of molecules which is expected) - I am at a loss as to why you could not. @miquelduranfrigola do you have an idea? might there be an internal checkpoint in the model where it skips molecules if they take too long ? |
Hi @GemmaTuron |
I pulled the latest image from dockerhub... |
@GemmaTuron |
I think one thing to point out is that |
Thanks @DhanshreeA for the clarification, I am puzzled as to why would the model behave differently, because if predictions can be obtained, it should not appear as null. |
Hello @GemmaTuron |
Hello @GemmaTuron @DhanshreeA |
fantastic, thanks @ZakiaYahya for the persistence in this issue! |
No description provided.
The text was updated successfully, but these errors were encountered: