unique_id doesn't require MD5 hashing? #220

DomInvivo · 2023-12-13T17:19:58Z

In the unique_id function, an MD5 hash is computed on the InChi key.

According to the InChI API reference for GetINCHIKeyFromINCHI, the InChi keys are zero-terminated strings that are written into a 28-byte buffer: https://www.inchi-trust.org/download/104/InChI_API_Reference.pdf (page 24) The full InChI string can be longer, but the key is apparently made from SHA-256 hashes of parts of the InChI string. Wikipedia describes the format of the key being XXXXXXXXXXXXXX-YYYYYYYYFV-P , with 14 uppercase characters, a dash, 10 uppercase characters, a dash, and 1 uppercase character, so it could probably be reduced further, though I'd want to find a source from IUPAC for that info before relying on it.

The MD5 hashing has 32 characters, whether the InChi has 27. So the hashing becomes longer than the actual text. Is there a purpose to having it?

ndickson-nvidia · 2023-12-13T17:35:30Z

The IUPAC source describing the InChIKey format is here: https://www.inchi-trust.org/technical-faq/#13.1

An InChIKey should always be 27 characters, the 2 dash characters are redundant, and the "FV" characters might always be "NA" in the unique_id function (N for non-standard, and A for InChI version 1). Even without removing the FV characters, 25 uppercase characters can be encoded in 16 bytes fairly easily (or 15 bytes slightly less easily).

maclandrol · 2023-12-13T18:58:39Z

Main purpose is to differentiate it from InChIKey and especially prevent people from interpreting it as InChIKey since we are using a non-standard InChI, which try to improve uniqueness of tautomeric forms of molecules. As the original InChIKey format interpretation and structure might not be respected here, we re-hash to avoid any misinterpretation.

hadim · 2023-12-14T18:20:13Z

@maclandrol said it all - the only reason for it is to avoid confusion with the regular inchikey.

@DomInvivo I would actually recommend you to move to dm.hash_mol instead, which is a natively supported way of hashing a molecule from rdkit (contribution from folks at Schrödinger).

hadim · 2023-12-14T18:20:22Z

Closing here but feel to re-open!

ndickson-nvidia · 2023-12-14T20:31:45Z

No worries! The context was that Graphium currently uses:

mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)

to get a unique ID from a SMILES string for identifying if a molecule occurs in multiple datasets. I'm in the process of trying to move most of Graphium's dataset preparation to C++. It wouldn't need to use Datamol for this anymore, so I don't really need any changes in Datamol, and creating an InChIKey from a molecule is probably much slower than doing an MD5 hash. My current plan is to compact the 25 letters of each InChIKey into a pair of 64-bit integers, (26^13 < 2^64, so it's pretty simple), and use the pairs as keys, instead of using 32-character strings as keys.

hadim · 2023-12-14T21:21:07Z

@ndickson-nvidia ok, sounds good, and thanks for the context. Your approach seems legit and fits into a 64-bit integer but I just want to raise the point about using "standard" inchikey.

It's very subtle, but "standard" inchikey cannot differentiate in between tautomer (two molecules that look very similar but where a few hydrogens are located at different positions of the graph). In certain case, it's totally fine, but in other cases it might not be what you want depending on the downstream application.

To give you more context about the above, I am copying/pasting here a small blurb I have made in the past:

[... the topic of the convo is about finding a good way to generate unique ID for a molecular dataset described as SMILES..]

First, SMILES is likely a suboptimal choice because of the variable length, the SMILES generation algorithm might change depending on your chemoinformatics lib but also will change across different rdkit versions (see for example rdkit/rdkit#4919 (comment)).

Inchi and inchikey are an international standard that is well respected within the field and any implementation (rdkit or other softwares) will generate the same inchi/inchikey for the same molecule.

That being said, inchi and inchikey in their default implementation do not preserve the hydrogen layer information. In consequence, two tautomer of the same molecule will generate the exact same inchikey. See the example below:
import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.to_inchikey(mol1)  # VERCQLOBLOLFMW-UHFFFAOYSA-N
dm.to_inchikey(mol2)  # VERCQLOBLOLFMW-UHFFFAOYSA-N
If you want to differentiate between tautomer (and here it really depends on the use case because sometime this is not wanted), then you can use a non-standard Inchikey version that considers the hydrogen layers. Most of the software does not allow generating those inchikey, and you should be very careful as they look similar to the standard inchikey (so be sure to document this not a standard inchikey).

In datamol, we've added a function to generate such non-standard InChIKey and added a simple md5 function on top of it to prevent the confusion with standard inchikey. This is what we call the unique_id:
import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.unique_id(mol1)  # defe59076164fa24a3bbc365dc965b4b
dm.unique_id(mol2)  # 6316e004feb731d6981b5e28d3ae3f98
We've developed unique_id a few years ago, but more recently folks at Schrödinger added a new molecular hashing function directly within rdkit that provides the same types of features. PR at rdkit/rdkit#5360. You can also find a RDKIT UGM talk about it at https://github.com/rdkit/UGM_2022/blob/main/Presentations/nealschneider_introducing_registrationhash_lightning.pdf

That function is also provided on datamol:
mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.hash_mol(mol1)  # 94f493d321cbd83c888e5f3f7e2e8054f956499f
dm.hash_mol(mol2)  # 0ef289740c9ba4d7af02c5d86fc2139fe938739a
So to summarize, before deciding on an ID I would recommend first defining exactly what it means for two molecules to be identical in your particular context because this is a loose definition (due of the intrinsic physical, 3D and dynamic aspect of molecular systems).

I would also recommend storing a SMILES for every data points, so you can easily reconstruct the molecular object in addition to one or multiple ID columns (inchikey, unique_id, hashmol, etc).

Hope it helps!

hadim · 2023-12-14T21:24:52Z

Now back to your use case. Assuming you want to generate mol ID that are sensitive the hydrogens layer. Then I don't think you can use molhash from rdkit since the implementation is on the Python side.

That being said, the inchikey from rdkit lives on the C++ side, so you could easily tune the options to generate a non-standard inchikey if you want to from C++. See the equivalent datamol function for help choosing the right parameters:

datamol/datamol/convert.py

Line 279 in 5d1cde1

def to_inchikey_non_standard(

Hope it helps!

ndickson-nvidia · 2023-12-14T21:54:52Z

Yep, I'm calling the same RDKit C++ function that eventually gets called by to_inchikey_non_standard, with the same options, so the only difference from unique_id should be that I'm skipping the MD5 hash:

RDKit::SmilesParserParams params;
std::unique_ptr<RDKit::RWMol> mol{ RDKit::SmilesToMol(smiles_string, params) };
const std::string inchiKeyString = MolToInchiKey(*mol, "/FixedH /SUU /RecMet /KET /15T");

DomInvivo changed the title ~~unique_id doesn't require MD5 hashing~~ unique_id doesn't require MD5 hashing? Dec 13, 2023

hadim closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique_id doesn't require MD5 hashing? #220

unique_id doesn't require MD5 hashing? #220

DomInvivo commented Dec 13, 2023

ndickson-nvidia commented Dec 13, 2023

maclandrol commented Dec 13, 2023

hadim commented Dec 14, 2023 •

edited

Loading

hadim commented Dec 14, 2023

ndickson-nvidia commented Dec 14, 2023

hadim commented Dec 14, 2023

hadim commented Dec 14, 2023

ndickson-nvidia commented Dec 14, 2023

unique_id doesn't require MD5 hashing? #220

unique_id doesn't require MD5 hashing? #220

Comments

DomInvivo commented Dec 13, 2023

ndickson-nvidia commented Dec 13, 2023

maclandrol commented Dec 13, 2023

hadim commented Dec 14, 2023 • edited Loading

hadim commented Dec 14, 2023

ndickson-nvidia commented Dec 14, 2023

hadim commented Dec 14, 2023

hadim commented Dec 14, 2023

ndickson-nvidia commented Dec 14, 2023

hadim commented Dec 14, 2023 •

edited

Loading