Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol #503

Merged
merged 6 commits into from
Jan 19, 2024

Conversation

DomInvivo
Copy link
Collaborator

@DomInvivo DomInvivo commented Jan 18, 2024

A major bug was discovered when working with node, nodepair and edge-level tasks.

The bug is due to RDKIT's weird behaviour in that it takes the AtomMapping and displays them correctly, but doesn't use them as indices. Unless you specify that you want to canonicalize the molecule. In case you have AtomMaps defined, it will NOT search for a canonical form, instead it will use the AtomMaps.

Below is the weird behaviour by RDKIT, the big numbers next to the carbons are the AtomMapping that are used to order the atoms with the same ordering as the labels. The small numbers are the indices. When visualizing the molecules to validate that the code was working, I didn't not look at the intrinsic indices, but only at the big number, assuming that the indices were correct.

image

I'm sorry for the inconvenience I have caused everyone... This bug was discovered this morning when trying to fix #467 .

I have validated the fix experimentally. Simply adding ordered=True in dm.to_mol yields massive improvements on both the PCQM4M and the PM6 datasets. In the plots below, I randomly subsample 20k molecules and run an MPNN model on the node-level tasks only. You can see drastic improvements.

What surprises me is that the metrics on PCQM are not that bad, especially since it can reach r2=0.85 on the 4M large-mix dataset. This led me to believe that the ordering was correct. But as you can see here, we can reach r2=0.95 with only 20k molecules and 100 epochs.

image

@DomInvivo DomInvivo added bug Something isn't working ASAP priority Major thing to do or fix labels Jan 18, 2024
Copy link

codecov bot commented Jan 18, 2024

Codecov Report

Merging #503 (7fd861c) into main (f698df4) will decrease coverage by 1.81%.
The diff coverage is 28.57%.

❗ Current head 7fd861c differs from pull request most recent head 17861c9. Consider uploading reports for the commit 17861c9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
- Coverage   71.52%   69.72%   -1.81%     
==========================================
  Files          93       93              
  Lines        8707     8710       +3     
==========================================
- Hits         6228     6073     -155     
- Misses       2479     2637     +158     
Flag Coverage Δ
unittests 69.72% <28.57%> (-1.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
ipu 29.75% <ø> (-19.40%) ⬇️

@DomInvivo DomInvivo changed the title MAJOR-BUG: Added ordered=True to to_mol MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol Jan 19, 2024
Copy link
Contributor

@s-maddrellmander s-maddrellmander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looking through the rest of the code base we have a number of other places where this pattern is used that haven't been updated, is that intentional or do we need the same logic applied. i.e. is this only needed to be applied to the mol objects when using dm.to_mol or with the smiles as well? I've listed a few examples below, but searching the repo gives more.
graphs = []
--
for s in tqdm(smiles):
   mol = dm.to_mol(s)
   graphs.append(mol_to_graph_dict(mol, **featurizer))
 
print(graphs[0])
mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)

if isinstance(mol, str):
    mol = dm.to_mol(mol)
  • Most obviously in the test_featurizer.py - which if that needs changing to be ordered we could add an additional check to make sure that molecules are ordered, or if only mol objects need ordering a separate check could be added please?
err_msg = f"\n\tError for params:\n\t\tSMILES: {s}"
mol = dm.to_mol(s)
  • I would appreciate it if a test could be added that confirms the ordering and to ensure this can't silently break?

Copy link
Contributor

@s-maddrellmander s-maddrellmander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for checking!

@DomInvivo
Copy link
Collaborator Author

Thanks @s-maddrellmander for reviewing that. I have added comments on why the other calls to to_mol don't need ordered=True.

In your examples, in order, here are the reasons:

  1. This is just some profiling code. It could be added or not, but it's just to test the speed. Doesn't serve a use case.
  2. This function only returns the unique_id, and this doesn't depend on the atom ordering, only on the graph structure.
  3. I added a depreciation warning for all functions in graphium.features.properties.py to use datamol.to_fp instead. Also, fingerprints don't depend on the node ordering.

Regarding the testing, I have create issue #504 . I currently do not have time to test it properly and believe I should merge the fix, and test it later. I am confident that it will work thanks to the much better training dynamics observed above. If you disagree, and think that it should be tested before being merged, let me know.

@DomInvivo DomInvivo merged commit 7771164 into main Jan 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP priority Major thing to do or fix bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong ordering of nodes/edges in multitasking for node/edge level tasks
2 participants