Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean UP & Dockerization eos46ev #3

Closed
GemmaTuron opened this issue Jun 27, 2023 · 49 comments
Closed

Clean UP & Dockerization eos46ev #3

GemmaTuron opened this issue Jun 27, 2023 · 49 comments
Assignees

Comments

@GemmaTuron
Copy link
Member

No description provided.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jun 28, 2023

Hello @GemmaTuron @DhanshreeA
I was encountering an error [15:04:04] SMILES Parse Error: syntax error while parsing: smiles and IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed after that.
After going into code main.py details, it seems like there is some problem in the code, the smiles are not reading properly as it is considering smiles tag in csv file as a smile string too, which RDkit inturns gives Null descriptor for that and it causes trouble. So what i've done to fix this is to change the line of code that reading smiles string from csv file
from this

sml = pd.read_csv(smiles_file, names=['Smiles'])
smiles = sml['Smiles'].to_numpy()
mol = [Chem.MolFromSmiles(x) for x in smiles]

to this

smiles_df = pd.read_csv(smiles_file)
smiles = smiles_df[smiles_df.columns[0]].tolist()
mol = [Chem.MolFromSmiles(x) for x in smiles]

With this change it reading the smiles string from csv file properly.

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron
I've open PR on it. Kindly check it.
Thanks.

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron @DhanshreeA
I've tried running eos46ev model for eml_canonical.csv in a latest ersilia version. It is not giving me Index-Out-of-Range error but it is not giving output for all smile strings in a csv file. For your revire i'm attaching the output file.

eos46ev_output.csv

What you suggests @GemmaTuron in this regard? Because i'm confused it is not giving me error, it just keep skipping the smiles that causes fuss.
Thanks.

@GemmaTuron
Copy link
Member Author

Hi @ZakiaYahya !

This is the expected behaviour, those smiles that cannot be parsed are skipped, but it should not be failing on so many. If you pass as a single smiles one of the list that did not get calculated, does it return empty ?

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron @DhanshreeA
Yes, i've tested it for both single inputs for which it gave probability and for which it didn't when passing the whole csv file, and i observed a strange behaviour, didn't able to figure out why it is showing this behaviour.
For reference here i'm again attaching the output prediction file for whole emal_canonical.csv
eos46ev_output.csv

When i tried different single smile inputs, for instance the smile string at 102th index (row), the output prediction file showing that model predicts probability against that smile string which is 0.348619204921108 but when i passed it as an single input to model it is giving me null output

ersilia -v api run -i "Clc1ccccc1C(n2ccnc2)(c3ccccc3)c4ccccc4"
{
    "input": {
        "key": "VNFPBHJOKIVQEB-UHFFFAOYSA-N",
        "input": "Clc1ccccc1C(n2ccnc2)(c3ccccc3)c4ccccc4",
        "text": "Clc1ccccc1C(n2ccnc2)(c3ccccc3)c4ccccc4"
    },
    "output": {
        "probability": null
    }
}

Even i tried smile string at 438th index and it's probability is shown in output prediction file is 0.102768955346816 , while it givesnullwhen passing as single input

ersilia -v api run -i "CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O"
{
    "input": {
        "key": "QTXVAVXCBMYBJW-UHFFFAOYSA-N",
        "input": "CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O",
        "text": "CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O"
    },
    "output": {
        "probability": null
    }
}

For checking purpose, i've passed a smile string that produces null in output prediction file, it also producesnullwhen passed as single input to model. for instance taken 319th index smile.

ersilia -v api run -i "CC(C)(S)[C@H](N)C(O)=O"
{
    "input": {
        "key": "VVNCNSJFMMFHPL-GSVOUGTGSA-N",
        "input": "CC(C)(S)[C@H](N)C(O)=O",
        "text": "CC(C)(S)[C@H](N)C(O)=O"
    },
    "output": {
        "probability": null
    }
}

PS: I've updated Ersilia by pulling the latest code few hours ago, and after that i tested it. Plus from this behaviour, it seems like it is not depending on the size of inputs we are giving to the model neither the input smile string we are giving to predict.

@GemmaTuron
Copy link
Member Author

Hi @ZakiaYahya
Thanks for this finding, this is preoccupying since it might be the outputs are being mixed up when passing long lists. I'll do some tests and add this for general discussion on the Tuesday meeting

@ZakiaYahya
Copy link
Collaborator

Hi @GemmaTuron
Oh right, sure.
Thanks.

@GemmaTuron
Copy link
Member Author

GemmaTuron commented Jul 3, 2023

@ZakiaYahya

I am unsure I am reproducing all your errors, I get null outputs for example for Clc1ccccc1C(n2ccnc2)(c3ccccc3)c4ccccc4 on the CLI, and the molecule gets completely eliminated from the output file (see attached) - this is not the behaviour expected, where an empty line should appear if the molecule cannot be calculated.
Steps I suggest:

  • Install the model from source (its original repo) and get the "ground truth" for the eml canonical file
  • Do the tests again using the run.sh commands directly (so, use the eos model outside ersilia) and compare the results
  • Finally compare the above two tests with running the model inside ersilia

test.csv
out.csv

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron
Sure, i think it best to first check the original repo and it's output results. I'll test it from original repo and then with run.sh and will let you know as soon as possible.
Thanks.

@DhanshreeA
Copy link
Member

DhanshreeA commented Jul 4, 2023

@ZakiaYahya moving our conversation from Slack to here and adding a little more to what @GemmaTuron has already suggested:

  1. I haven't been able to find any GitHub link for this paper, and the link shared by the authors does indeed have the model behind a web server. However, the pretrained model as well as the training dataset are shared by the authors on the same webpage as well. Linking it here: http://cadd.zju.edu.cn/chemtb/document

  2. When you put in some SMILES as inputs to the web server, you see the following three fields in the results: Detect, logP, and Weight. Since this model is formulated as a binary classification problem, it gives me some clues as to whether the header Detect is what's giving me that binary class result. I input the SMILE CCCC and got -1 (because it's not a molecule that's active against TB). Then I looked up a known drug for TB (eg: https://pubchem.ncbi.nlm.nih.gov/compound/Ethambutol#section=3D-Conformer) and found its SMILE and input that on their web server, and got Detect as 1. Hence I can confirm that the server returns the final class label and not the probabilities.

Now, what you can do is upload the eml_canonical file in this server and see what it returns. If Detect is 1, it's likely the probability is > 0.5, and vice-versa. From this approach we can also see how many smiles their server is skipping/unable to parse and compare that with the results we are getting.

The model is also available to download from the link I shared above, which is likely where the original contributor also got it from.

@ZakiaYahya
Copy link
Collaborator

Hello @DhanshreeA Thanks,
Yes, i've uploaded the eml_canonical.csv and it returns logP and Weight parameters that's why i was confused why it is not giving probabilities, But that makes sense now that Detect returns the class instead of probability, we can check it with that file also, taking probabilities from ersilia's repo and map it to classes and then compare it with that results.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 7, 2023

Hello @GemmaTuron @miquelduranfrigola @DhanshreeA
I've dig into the detail why i'm getting "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')" when passing whole eml_canonical.csv to the model. After checking all the smiles in detail, there are 7 smiles on which model is failing to give output/probability. And these are those smiles which are failing

problematic_smiles.csv

Apart from these smiles, when i deleted these smiles from eml_canonical.csv the model is working fine and giving me the probabilities, here are the eml_canonical.csv file and the corresponding output/probability file

eml_canonical_remaining.csv
output_ remaining.csv

Let's go in detail of the descriptors against each smile string

(1) For each smile, RDKit2DNormalized gives descriptor len of 201
(2) For each smile mol, CalculateECFP4Fingerprint gives fingerprints/descriptor len of 1024
(3) That means, after concatenation of these two types of descriptors we get a total len of 1224 descriptors against each smile

Here are descriptors for both eml_canonical_remaining.csv and Problematic_smiles.csv

descriptors_ remaining.csv
descriptors_problem.csv

Now i'm stuck at why these 7 smiles are not making it to probabilities. Going through the error message it seems like may be there is some undefined values in descriptors that causing the problem.
So for this i've used following snippet to check NAN or infinity values in the descriptors_problematic.csv, but couldn't find any

import pandas as pd

import numpy as np
# Load your dataset
data = pd.read_csv('descriptors_problem.csv')

# Check for NAN or missing values
print(data.isnull().sum())

# Check for infinity values
print(np.isinf(data).sum())

The result is
descriptors 0
descriptors 0

I'm stuck here now, Any suggestions. How to proceed it further??

Thanks.

@GemmaTuron
Copy link
Member Author

Hi @ZakiaYahya

Great work thanks for all the details. It can happen that some descriptors cannot be calculated for specific molecules (very large or very small molecules, molecules with certain metals...). What we need to do is add a try - except so that these molecules are skipped without making the whole model fail.
@HellenNamulinda can point yo to a model where she did that!

@ZakiaYahya
Copy link
Collaborator

Ok right @GemmaTuron
@HellenNamulinda Can you kindly let me know in which model did you that work, it would be very helpful. Thanks

@HellenNamulinda
Copy link

Hello @ZakiaYahya,
I worked on model eos2lqb, though for it was skipping invalid smiles(smiles that can't be converted) i.e if mol is None:; main.py
While it doesn't have exactly the same issue; the idea of skipping can be applied.

I don't know whether you tried adding print statements to examine the source of the error, when you pass the problematic smiles,
But from the code in your main.py, the model is likely to be failing at this point, ecfp = ecfp.astype('float64') where you are attempting to convert the ecfp array to the 'float64' data type. And this is where we need to catch the error from

## produce ECFP fingerprint
ecfp = np.array(fingerprint.CalculateECFP4Fingerprint(mol[0])[0])
for i in range(1,len(mol)):
    fp = fingerprint.CalculateECFP4Fingerprint(mol[i])
    fp = np.array(fp[0])
    ecfp = np.vstack((ecfp,fp))
ecfp = ecfp.astype('float64')

Is the code in the repo the latest? I can clone it and look more into it.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 7, 2023

Oh okay @HellenNamulinda
But it is not giving me null descriptors, i've checked that. But i'll give it a try.
Secondly no, it is not failing at ecfp = ecfp.astype('float64') but it is failing at model prediction time i.e. pred = model.predict_proba(input_des). The main.py in the repo is not updated, i've put a lot of many print statements in the main.py now. I'll try your solution of ignorning problematic smiles but i don't think the condition of null will work but let me try. I'll surely share my code with you if i need further help but first let me try to put ignore condition in the code . Thanks for the help :)

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 7, 2023

Hello @GemmaTuron @HellenNamulinda
I've just checked it the mols are not None in my case.
Oh i just add that for your information @GemmaTuron it is failing locally for those smiles and giving ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). But when i tested it within ersilia --repo-path it is not failing, not giving any error it is giving output prediction, it just skips the probabilities for problematic smiles and the probability column infront of problematic smile is empty but it returns the same number of rows as we give as an input.

Any suggestion @GemmaTuron @miquelduranfrigola @DhanshreeA to resolve this as null condition is not working here.
Thanks

@HellenNamulinda
Copy link

@ZakiaYahya,
Sorry, let me give a more insight.
For this model, our main issue is not checking if mol is None:(can be added ), but we need to check the input that the model is accepting.
And sorry I had pointed to the wrong model input. The model input is input_des, not ecfp.

The model uses sklearn package and it is the one pointing out that error, meaning the input to the model contains either NAN or inf.
So we need to check whether input_desc contains NaN, infinity or a value too large before passing to the model.predict_proba
Since the features are so many, we can try with only the problematic _smiles file and print those values for the first input

problematic_values = np.isnan(input_des[0]) | np.isinf(input_des[0]) | (np.abs(input_des[0]) > np.finfo(np.float64).max)

if np.any(problematic_values):
    problematic_indices = np.where(problematic_values)[0]
    problematic_data = input_des[0][problematic_values]
    for index, value in zip(problematic_indices, problematic_data):
        print("Index:", index, "Value:", value)

For the problematic file shared, problematic_smiles.csv, it prints

Index: 39 Value: nan
Index: 41 Value: nan
Index: 43 Value: nan
Index: 45 Value: nan

All the above is just for debugging; not part of solution.

We can handle those NAN values, forexample;

# replacing NAN with 0
input_des[np.isnan(input_des)] = 0.0

using run.sh
Input file: eml_canonical.csv
Output: eml_output_all.csv

Here is the edited code for main.py file; main.txt

This is not tested within ersilia like using --repo_path

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 8, 2023

Thankyou so much @HellenNamulinda
This very helpful indeed. I've checked before the problematic_smiles.csv descriptors either it is NAN or infinity but it didn't gave me any, pointing out that there is no NAN or infinity value, May be the commands in the code i've used couldn't picked up the values properly.
But i tried your solution, yes it is pointing out the NAN or infinity that's the whole thing we need to do. Thanks for pointing out that. But it is still not processing it whole, i don't know how it is working on your side because the main.txt you shared handles only the "0" index input of input_des of eml_canonical which in my case is fine, So, how this main.txt processed all the smiles from eml_canonical and give output as well, i didn't undersatnd that. Let me iterate your solution over all input_des.
Thanks.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 8, 2023

Hello @GemmaTuron @HellenNamulinda
I've iterate that solution over all input_des and it is working fine locally, giving me output probabilities for all smile strings including the the problematic smiles (As we replace NAN with zeros).

for i in range(len(input_des)):
    problematic_values = np.isnan(input_des[i]) | np.isinf(input_des[i]) | (np.abs(input_des[i]) > np.finfo(np.float64).max)
    if np.any(problematic_values):
        problematic_indices = np.where(problematic_values)[0]
        problematic_data = input_des[i][problematic_values]
        for index, value in zip(problematic_indices, problematic_data):
            print("Index:", index, "Value:", value)
            
        
    # Handle NaN, replacing NAN with 0.0
    nan_indices = np.isnan(input_des[i])
    input_des[i][nan_indices] = 0.0

So for eml_canonoical.csv, input smiles are 443 and it is giving output probabilities also 443. Here are the output file when tested with run.sh
Output_ locally.csv

But when i tested it within ersilia using repo-path, it seems like it is skipping some of the input smiles, don't know why,. I didn't get it why for 443 input smiles, the output file contain only 438 entries. Here the output file

Output_ Ersilia.csv

@miquelduranfrigola
Copy link
Member

Thanks @ZakiaYahya

This is strange. @ZakiaYahya - Could you share the current code of this model?

@ZakiaYahya
Copy link
Collaborator

Hello @miquelduranfrigola
Yes it is very strange, i'm trying to figure out which smiles Ersilia is skipping. Because i'm not discarding problematic smiles with NAN values, i simply replace NAN with zeroes , it is working fine with run.sh, giving same number of output entries as input entries but with --repo-path it is showing this behaviour, i've just push the latest code to my repo, here's the link https://github.com/ZakiaYahya/eos46ev
Thanks.

@ZakiaYahya
Copy link
Collaborator

Hello @miquelduranfrigola @GemmaTuron @DhanshreeA
I've compared the ouput file i got from run.sh and --repo-path and these are the values that got missing when run the repo using --repo-path

Index: 100, Row: COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl
Index: 202, Row: CN1CC[C@@]23[C@H]4CCC(=O)[C@@H]2Oc5c(O)ccc(C[C@@H]14)c35
Index: 279, Row: CCCCC(C)(O)C/C=C/[C@H]1[C@H](O)CC(=O)[C@@H]1CCCCCCC(=O)OC
Index: 305, Row: [Pt++].NC1CCCCC1N.[O-]C(=O)C([O-])=O
Index: 408, Row: O=C1CCC(N2C(=O)c3ccccc3C2=O)C(=O)N1
Index: 442, Row: O.OC(Cn1ccnc1)([P](O)(O)=O)[P](O)(O)=O

And these are not even those smiles that was giving NAN values. I still not able to identify the source of problem. Any suggestion would be helpful.
Thanks.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 10, 2023

Hello @GemmaTuron
Kindly have a look on above comment, i still can't figure out the problem why it is working fine and giving same number of ouputs as inputs when test with run.sh and when i test it with --repo-path it just skipped some of smiles as pointed in above comment giving outputs lesser as compared to inputs passed. The updated code is here https://github.com/ZakiaYahya/eos46ev
Can you plz try if you able to reproduce the behaviour
Thanks.

@GemmaTuron
Copy link
Member Author

Hi @ZakiaYahya
After doing some tests I have identified discrepancies between passing single inputs in the CLI vs full .csv files.
For example, for this molecule, when I calculate it directly through the CLI I get the following:
ersilia api -i "CC(=O)O"

{
    "input": {
        "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
        "input": "CC(=O)O",
        "text": "CC(=O)O"
    },
    "output": {
        "probability": null
    }
}

Whereas if I pass a .csv file containing the molecule I get
QTBSBXVTEAMEQO-UHFFFAOYSA-N CC(=O)O 0.0754274023259567

For the molecules you pasted above I reproduce what you describe

I am more puzzled by the CLI vs .csv file differences at this point, can you check if that is happening to you as well?

@DhanshreeA
Copy link
Member

DhanshreeA commented Jul 11, 2023

@GemmaTuron I have certainly been able to reproduce what's happening with you and @ZakiaYahya:

  1. When I run the model with run.sh I indeed get the output csv of length 443 vs when I run it with repo_path and I get 438 entries.
  2. However, when I run the model for a single input through ersilia CLI, I get probability null for every input molecule from the eml_canonical file (including the ones @ZakiaYahya has mentioned here: Clean UP & Dockerization eos46ev #3 (comment))

It seems the CLI is running into an internal server error. Eg, I tried using the bento server endpoint from the swagger UI, and I see an error there:
Screenshot from 2023-07-11 12-10-57

@DhanshreeA
Copy link
Member

What's more interesting is, when I looked at the serve.log in the temp directories that ersilia creates for a particular run, I see that only 434 molecules are being correctly processed (not even the 437 that we're getting from repo_path)

Attaching the serve.log here, as well as plain molecules

eos46ev_mols.txt
eos_serve.log

@ZakiaYahya I need your help, can you check if some of the molecules in Output_Ersilia.csv file (in your comment here: #3 (comment)) are being repeated?

@DhanshreeA
Copy link
Member

@miquelduranfrigola how do I update the log level for the ersilia CLI if I want to see what is happening at every step?

@miquelduranfrigola
Copy link
Member

Thanks @DhanshreeA , @ZakiaYahya and @GemmaTuron

This is an issue on the Ersilia CLI side, most likely. Let me set up everything to troubleshoot this model. I will keep you posted.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 11, 2023

Hi @ZakiaYahya After doing some tests I have identified discrepancies between passing single inputs in the CLI vs full .csv files. For example, for this molecule, when I calculate it directly through the CLI I get the following: ersilia api -i "CC(=O)O"

{
    "input": {
        "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
        "input": "CC(=O)O",
        "text": "CC(=O)O"
    },
    "output": {
        "probability": null
    }
}

Whereas if I pass a .csv file containing the molecule I get QTBSBXVTEAMEQO-UHFFFAOYSA-N CC(=O)O 0.0754274023259567

For the molecules you pasted above I reproduce what you describe

I am more puzzled by the CLI vs .csv file differences at this point, can you check if that is happening to you as well?

Yes @GemmaTuron,
I am experiencing the same thing and i mentioned that as well in this comment #3 (comment)
And i didn't figure it out yet because it is working perfectly fine with run.sh, it just behaving weird when run it within Ersilia CLI. It's more likely Ersilia is not processing those smiles i think rather than the model, because if it's the model issue it shows the same thing when tested it with run.sh

@ZakiaYahya
Copy link
Collaborator

What's more interesting is, when I looked at the serve.log in the temp directories that ersilia creates for a particular run, I see that only 434 molecules are being correctly processed (not even the 437 that we're getting from repo_path)

Attaching the serve.log here, as well as plain molecules

eos46ev_mols.txt eos_serve.log

@ZakiaYahya I need your help, can you check if some of the molecules in Output_Ersilia.csv file (in your comment here: #3 (comment)) are being repeated?

@DhanshreeA Hmm, yes i can check it. But if although this is the case, the smiles are repeated in eml_canonical.csv, i don't think so it is the problem. Right?

@GemmaTuron
Copy link
Member Author

@ZakiaYahya and @DhanshreeA

This is extremely surprising. Let us think a bit about it.

@ZakiaYahya
Copy link
Collaborator

Hi @miquelduranfrigola
Right, no problem at all. Once you done, let me know, i'll test it on my side. Thanks

@DhanshreeA
Copy link
Member

A few more updates here, @ZakiaYahya and I spent some time with this model today during our 1:1 and it would appear that before the refactor, the model worked with single inputs (using its predict api)
We checked out the last commit before @ZakiaYahya 's refactor (e3119ba), and did the usual fetch > serve > api run/predict.

@miquelduranfrigola
Copy link
Member

Thanks @DhanshreeA and @ZakiaYahya - this is helpful.

@miquelduranfrigola
Copy link
Member

I think I've fixed it: 19178b1

The problem was with the main.py.

The following code was skipping the first SMILES from the input file, assuming it was the header:

smiles_df = pd.read_csv(smiles_file)
smiles = smiles_df[smiles_df.columns[0]].tolist()
mol = [Chem.MolFromSmiles(x) for x in smiles]

I have replaced it by:

smiles = []
with open(smiles_file, "r") as f:
    reader = csv.reader(f)
    for r in reader:
        smiles += [r[0]]
mol = [Chem.MolFromSmiles(x) for x in smiles]

Please note that in the run.sh, the input file is always assumed to have no header.

As a result, the old code was skipping the first smiles provided.

  • Therefore, if one smiles was passed, we were obtaining a null
  • If >100 smiles were passed, we were obtaining a smaller number of outputs. The reason for this is that, internally, ersilia works in batches of 100. So, if we pass 300 compounds, we loose 3 molecules; if we pass 3000, we loose 30 molecules, etc.

This may explain the weird number of outputs discussed in this thread.

The model was not failing at fetch time because fetch tries more than one molecule and does not do stringent checks on the output.

It is becoming very apparent that we need to do tests before storing the model. Today we will discuss this (@pittmanriley is setting up the placeholders for this).

@ZakiaYahya
Copy link
Collaborator

@miquelduranfrigola
Alright @miquelduranfrigola i think that could be the reason, Let me check with these changes and test it locally.
Thanks.

@ZakiaYahya
Copy link
Collaborator

ZakiaYahya commented Jul 11, 2023

Hello @miquelduranfrigola
I just fetch the latest code from eos46ev model and tested it with whole eml_canonical.csv and it is skipping the probabilities for a lot of smiles, Here's the output file. Kindly have a look
eos46ev_new_output.csv

The problem still remains the same.

@GemmaTuron
Copy link
Member Author

@miquelduranfrigola

good catch about the header, probably this is happening in more than one model. @ZakiaYahya I'll try large files see how many outputs we get.

@ZakiaYahya
Copy link
Collaborator

Right @GemmaTuron
Let us know, once you done testing. For today, i think i should leave this model and waiting for your response as i have done a lot working on it but couldn't able to catch the problem yet.

@GemmaTuron
Copy link
Member Author

GemmaTuron commented Jul 13, 2023

@ZakiaYahya

I have re fetched the model (in a MacOS Ventura M2) and I was able to produce good predictions for the whole EML list (Except a couple of molecules which is expected) - I am at a loss as to why you could not. @miquelduranfrigola do you have an idea? might there be an internal checkpoint in the model where it skips molecules if they take too long ?
out_eml.csv

@ZakiaYahya
Copy link
Collaborator

Hi @GemmaTuron
Did you test it with code before refactoring or the one i pushed recently??? I don't know why i'm not getting probabilities for all smiles.

@GemmaTuron
Copy link
Member Author

I pulled the latest image from dockerhub...

@ZakiaYahya
Copy link
Collaborator

@GemmaTuron
Let me check it on colab and docker as well. Because on y side it is causing issue on CLI. it is working fine with run.sh. Let me test it on colab and docker (btw just take this point from today's meeting).
Plus can you try it with eml_canonical as well? just for checking purpose, it would be really helpful.
Thanks.

@DhanshreeA
Copy link
Member

I think one thing to point out is that run.sh only runs the main.py file. It bypasses the bentoML serving, and how ersilia processes the output from the server. So the issue can be on either of these three fronts: the main.py file itself, how the service.py file is processing the output.csv, and finally what ersilia cli is getting and how it is refactoring it.

@GemmaTuron
Copy link
Member Author

Thanks @DhanshreeA for the clarification, I am puzzled as to why would the model behave differently, because if predictions can be obtained, it should not appear as null.
@ZakiaYahya to be sure, could you add a print statement on your code so that when running the CLI it prints each result, so we see if it is actually being able to predict (it should, since it works with run.sh) and the error is elsewhere?
I have already tried with eml canonical, see the attached result above

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron
Sure, Let me add the print statements and check it. Don't know i'm so confused in this model, As using the latest code after refactoring it is giving me 437 outputs in the files out of 442 (it is totally skipping the smiles, not just the output), but when i tried to run the code from the commit befor refactoring it is not skipping the smiles but skipping the probabilities, it return null for many of smile string. Let me stick to refactored code to avoid confusion. I'll trying to check the behaviour using print statements.
Thanks

@ZakiaYahya
Copy link
Collaborator

Hello @GemmaTuron @DhanshreeA
I've tested it from scratch. I was so confused in different versions of code (the refactored one, the original one and the refactored+Miquel changes) and that causes a lot of fuss in testing. I clone the repo again from ersilia, incorporated miquel changes of reading smiles to avoid skipping of smiles and then incorporated NAN values handling to avoid null outputs but before that i pull the latest ersilia code as well. and finally the code is working perfectly fine. I've push the changes and open PR on it. Here's is the eml_canonical outputs
eos46ev_test.csv
Thanks

@GemmaTuron
Copy link
Member Author

fantastic, thanks @ZakiaYahya for the persistence in this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants