Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique_id doesn't require MD5 hashing? #220

Closed
DomInvivo opened this issue Dec 13, 2023 · 8 comments
Closed

unique_id doesn't require MD5 hashing? #220

DomInvivo opened this issue Dec 13, 2023 · 8 comments

Comments

@DomInvivo
Copy link
Contributor

In the unique_id function, an MD5 hash is computed on the InChi key.

According to the InChI API reference for GetINCHIKeyFromINCHI, the InChi keys are zero-terminated strings that are written into a 28-byte buffer: https://www.inchi-trust.org/download/104/InChI_API_Reference.pdf (page 24) The full InChI string can be longer, but the key is apparently made from SHA-256 hashes of parts of the InChI string. Wikipedia describes the format of the key being XXXXXXXXXXXXXX-YYYYYYYYFV-P , with 14 uppercase characters, a dash, 10 uppercase characters, a dash, and 1 uppercase character, so it could probably be reduced further, though I'd want to find a source from IUPAC for that info before relying on it.

The MD5 hashing has 32 characters, whether the InChi has 27. So the hashing becomes longer than the actual text. Is there a purpose to having it?

@DomInvivo DomInvivo changed the title unique_id doesn't require MD5 hashing unique_id doesn't require MD5 hashing? Dec 13, 2023
@ndickson-nvidia
Copy link

The IUPAC source describing the InChIKey format is here: https://www.inchi-trust.org/technical-faq/#13.1

An InChIKey should always be 27 characters, the 2 dash characters are redundant, and the "FV" characters might always be "NA" in the unique_id function (N for non-standard, and A for InChI version 1). Even without removing the FV characters, 25 uppercase characters can be encoded in 16 bytes fairly easily (or 15 bytes slightly less easily).

@maclandrol
Copy link
Member

Main purpose is to differentiate it from InChIKey and especially prevent people from interpreting it as InChIKey since we are using a non-standard InChI, which try to improve uniqueness of tautomeric forms of molecules. As the original InChIKey format interpretation and structure might not be respected here, we re-hash to avoid any misinterpretation.

@hadim
Copy link
Contributor

hadim commented Dec 14, 2023

@maclandrol said it all - the only reason for it is to avoid confusion with the regular inchikey.

@DomInvivo I would actually recommend you to move to dm.hash_mol instead, which is a natively supported way of hashing a molecule from rdkit (contribution from folks at Schrödinger).

@hadim
Copy link
Contributor

hadim commented Dec 14, 2023

Closing here but feel to re-open!

@hadim hadim closed this as completed Dec 14, 2023
@ndickson-nvidia
Copy link

No worries! The context was that Graphium currently uses:

mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)

to get a unique ID from a SMILES string for identifying if a molecule occurs in multiple datasets. I'm in the process of trying to move most of Graphium's dataset preparation to C++. It wouldn't need to use Datamol for this anymore, so I don't really need any changes in Datamol, and creating an InChIKey from a molecule is probably much slower than doing an MD5 hash. My current plan is to compact the 25 letters of each InChIKey into a pair of 64-bit integers, (26^13 < 2^64, so it's pretty simple), and use the pairs as keys, instead of using 32-character strings as keys.

@hadim
Copy link
Contributor

hadim commented Dec 14, 2023

@ndickson-nvidia ok, sounds good, and thanks for the context. Your approach seems legit and fits into a 64-bit integer but I just want to raise the point about using "standard" inchikey.

It's very subtle, but "standard" inchikey cannot differentiate in between tautomer (two molecules that look very similar but where a few hydrogens are located at different positions of the graph). In certain case, it's totally fine, but in other cases it might not be what you want depending on the downstream application.

To give you more context about the above, I am copying/pasting here a small blurb I have made in the past:


[... the topic of the convo is about finding a good way to generate unique ID for a molecular dataset described as SMILES..]

First, SMILES is likely a suboptimal choice because of the variable length, the SMILES generation algorithm might change depending on your chemoinformatics lib but also will change across different rdkit versions (see for example rdkit/rdkit#4919 (comment)).

Inchi and inchikey are an international standard that is well respected within the field and any implementation (rdkit or other softwares) will generate the same inchi/inchikey for the same molecule.

That being said, inchi and inchikey in their default implementation do not preserve the hydrogen layer information. In consequence, two tautomer of the same molecule will generate the exact same inchikey. See the example below:

import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.to_inchikey(mol1)  # VERCQLOBLOLFMW-UHFFFAOYSA-N
dm.to_inchikey(mol2)  # VERCQLOBLOLFMW-UHFFFAOYSA-N

If you want to differentiate between tautomer (and here it really depends on the use case because sometime this is not wanted), then you can use a non-standard Inchikey version that considers the hydrogen layers. Most of the software does not allow generating those inchikey, and you should be very careful as they look similar to the standard inchikey (so be sure to document this not a standard inchikey).

In datamol, we've added a function to generate such non-standard InChIKey and added a simple md5 function on top of it to prevent the confusion with standard inchikey. This is what we call the unique_id:

import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.unique_id(mol1)  # defe59076164fa24a3bbc365dc965b4b
dm.unique_id(mol2)  # 6316e004feb731d6981b5e28d3ae3f98

We've developed unique_id a few years ago, but more recently folks at Schrödinger added a new molecular hashing function directly within rdkit that provides the same types of features. PR at rdkit/rdkit#5360. You can also find a RDKIT UGM talk about it at https://github.com/rdkit/UGM_2022/blob/main/Presentations/nealschneider_introducing_registrationhash_lightning.pdf

That function is also provided on datamol:

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.hash_mol(mol1)  # 94f493d321cbd83c888e5f3f7e2e8054f956499f
dm.hash_mol(mol2)  # 0ef289740c9ba4d7af02c5d86fc2139fe938739a

So to summarize, before deciding on an ID I would recommend first defining exactly what it means for two molecules to be identical in your particular context because this is a loose definition (due of the intrinsic physical, 3D and dynamic aspect of molecular systems).

I would also recommend storing a SMILES for every data points, so you can easily reconstruct the molecular object in addition to one or multiple ID columns (inchikey, unique_id, hashmol, etc).

Hope it helps!

@hadim
Copy link
Contributor

hadim commented Dec 14, 2023

Now back to your use case. Assuming you want to generate mol ID that are sensitive the hydrogens layer. Then I don't think you can use molhash from rdkit since the implementation is on the Python side.

That being said, the inchikey from rdkit lives on the C++ side, so you could easily tune the options to generate a non-standard inchikey if you want to from C++. See the equivalent datamol function for help choosing the right parameters:

def to_inchikey_non_standard(

Hope it helps!

@ndickson-nvidia
Copy link

Yep, I'm calling the same RDKit C++ function that eventually gets called by to_inchikey_non_standard, with the same options, so the only difference from unique_id should be that I'm skipping the MD5 hash:

RDKit::SmilesParserParams params;
std::unique_ptr<RDKit::RWMol> mol{ RDKit::SmilesToMol(smiles_string, params) };
const std::string inchiKeyString = MolToInchiKey(*mol, "/FixedH /SUU /RecMet /KET /15T");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants