MAJOR-BUG for node/edge predictions: Added ordered=True to `to_mol` #503

DomInvivo · 2024-01-18T23:18:34Z

A major bug was discovered when working with node, nodepair and edge-level tasks.

The bug is due to RDKIT's weird behaviour in that it takes the AtomMapping and displays them correctly, but doesn't use them as indices. Unless you specify that you want to canonicalize the molecule. In case you have AtomMaps defined, it will NOT search for a canonical form, instead it will use the AtomMaps.

Below is the weird behaviour by RDKIT, the big numbers next to the carbons are the AtomMapping that are used to order the atoms with the same ordering as the labels. The small numbers are the indices. When visualizing the molecules to validate that the code was working, I didn't not look at the intrinsic indices, but only at the big number, assuming that the indices were correct.

I'm sorry for the inconvenience I have caused everyone... This bug was discovered this morning when trying to fix #467 .

I have validated the fix experimentally. Simply adding ordered=True in dm.to_mol yields massive improvements on both the PCQM4M and the PM6 datasets. In the plots below, I randomly subsample 20k molecules and run an MPNN model on the node-level tasks only. You can see drastic improvements.

What surprises me is that the metrics on PCQM are not that bad, especially since it can reach r2=0.85 on the 4M large-mix dataset. This led me to believe that the ordering was correct. But as you can see here, we can reach r2=0.95 with only 20k molecules and 100 epochs.

codecov · 2024-01-18T23:23:11Z

Codecov Report

Merging #503 (7fd861c) into main (f698df4) will decrease coverage by 1.81%.
The diff coverage is 28.57%.

❗ Current head 7fd861c differs from pull request most recent head 17861c9. Consider uploading reports for the commit 17861c9 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #503      +/-   ##
==========================================
- Coverage   71.52%   69.72%   -1.81%     
==========================================
  Files          93       93              
  Lines        8707     8710       +3     
==========================================
- Hits         6228     6073     -155     
- Misses       2479     2637     +158

Flag	Coverage Δ
unittests	`69.72% <28.57%> (-1.81%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
ipu	`29.75% <ø> (-19.40%)`	⬇️

s-maddrellmander

Looking through the rest of the code base we have a number of other places where this pattern is used that haven't been updated, is that intentional or do we need the same logic applied. i.e. is this only needed to be applied to the mol objects when using dm.to_mol or with the smiles as well? I've listed a few examples below, but searching the repo gives more.

graphs = []
--
for s in tqdm(smiles):
   mol = dm.to_mol(s)
   graphs.append(mol_to_graph_dict(mol, **featurizer))
 
print(graphs[0])

mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)

if isinstance(mol, str):
    mol = dm.to_mol(mol)

Most obviously in the test_featurizer.py - which if that needs changing to be ordered we could add an additional check to make sure that molecules are ordered, or if only mol objects need ordering a separate check could be added please?

err_msg = f"\n\tError for params:\n\t\tSMILES: {s}"
mol = dm.to_mol(s)

I would appreciate it if a test could be added that confirms the ordering and to ensure this can't silently break?

s-maddrellmander

Cool, thanks for checking!

DomInvivo · 2024-01-19T14:54:36Z

Thanks @s-maddrellmander for reviewing that. I have added comments on why the other calls to to_mol don't need ordered=True.

In your examples, in order, here are the reasons:

This is just some profiling code. It could be added or not, but it's just to test the speed. Doesn't serve a use case.
This function only returns the unique_id, and this doesn't depend on the atom ordering, only on the graph structure.
I added a depreciation warning for all functions in graphium.features.properties.py to use datamol.to_fp instead. Also, fingerprints don't depend on the node ordering.

Regarding the testing, I have create issue #504 . I currently do not have time to test it properly and believe I should merge the fix, and test it later. I am confident that it will work thanks to the much better training dynamics observed above. If you disagree, and think that it should be tested before being merged, let me know.

Added ordered=True to to_mol

c315e50

DomInvivo added bug Something isn't working ASAP priority Major thing to do or fix labels Jan 18, 2024

DomInvivo changed the title ~~MAJOR-BUG: Added ordered=True to to_mol~~ MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol Jan 19, 2024

s-maddrellmander suggested changes Jan 19, 2024

View reviewed changes

DomInvivo added 4 commits January 19, 2024 09:36

Added comments explaining why ordered=True is not needed

df8d193

Depreciating the properties.py file

19055b2

Black linting

7fd861c

Documented why profile_mol_to_graph doesn't need ordered

4c19068

s-maddrellmander approved these changes Jan 19, 2024

View reviewed changes

black linting

17861c9

DomInvivo merged commit 7771164 into main Jan 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAJOR-BUG for node/edge predictions: Added ordered=True to `to_mol` #503

MAJOR-BUG for node/edge predictions: Added ordered=True to `to_mol` #503

DomInvivo commented Jan 18, 2024 •

edited

Loading

codecov bot commented Jan 18, 2024 •

edited

Loading

s-maddrellmander left a comment

s-maddrellmander left a comment

DomInvivo commented Jan 19, 2024

MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol #503

MAJOR-BUG for node/edge predictions: Added ordered=True to to_mol #503

Conversation

DomInvivo commented Jan 18, 2024 • edited Loading

codecov bot commented Jan 18, 2024 • edited Loading

Codecov Report

s-maddrellmander left a comment

Choose a reason for hiding this comment

s-maddrellmander left a comment

Choose a reason for hiding this comment

DomInvivo commented Jan 19, 2024

MAJOR-BUG for node/edge predictions: Added ordered=True to `to_mol` #503

MAJOR-BUG for node/edge predictions: Added ordered=True to `to_mol` #503

DomInvivo commented Jan 18, 2024 •

edited

Loading

codecov bot commented Jan 18, 2024 •

edited

Loading