-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unique_id doesn't require MD5 hashing? #220
Comments
The IUPAC source describing the InChIKey format is here: https://www.inchi-trust.org/technical-faq/#13.1 An InChIKey should always be 27 characters, the 2 dash characters are redundant, and the "FV" characters might always be "NA" in the |
Main purpose is to differentiate it from InChIKey and especially prevent people from interpreting it as InChIKey since we are using a non-standard InChI, which try to improve uniqueness of tautomeric forms of molecules. As the original InChIKey format interpretation and structure might not be respected here, we re-hash to avoid any misinterpretation. |
@maclandrol said it all - the only reason for it is to avoid confusion with the regular inchikey. @DomInvivo I would actually recommend you to move to |
Closing here but feel to re-open! |
No worries! The context was that Graphium currently uses:
to get a unique ID from a SMILES string for identifying if a molecule occurs in multiple datasets. I'm in the process of trying to move most of Graphium's dataset preparation to C++. It wouldn't need to use Datamol for this anymore, so I don't really need any changes in Datamol, and creating an InChIKey from a molecule is probably much slower than doing an MD5 hash. My current plan is to compact the 25 letters of each InChIKey into a pair of 64-bit integers, ( |
@ndickson-nvidia ok, sounds good, and thanks for the context. Your approach seems legit and fits into a 64-bit integer but I just want to raise the point about using "standard" inchikey. It's very subtle, but "standard" inchikey cannot differentiate in between tautomer (two molecules that look very similar but where a few hydrogens are located at different positions of the graph). In certain case, it's totally fine, but in other cases it might not be what you want depending on the downstream application. To give you more context about the above, I am copying/pasting here a small blurb I have made in the past:
|
Now back to your use case. Assuming you want to generate mol ID that are sensitive the hydrogens layer. Then I don't think you can use molhash from rdkit since the implementation is on the Python side. That being said, the inchikey from rdkit lives on the C++ side, so you could easily tune the options to generate a non-standard inchikey if you want to from C++. See the equivalent datamol function for help choosing the right parameters: Line 279 in 5d1cde1
Hope it helps! |
Yep, I'm calling the same RDKit C++ function that eventually gets called by
|
In the unique_id function, an MD5 hash is computed on the InChi key.
According to the InChI API reference for GetINCHIKeyFromINCHI, the InChi keys are zero-terminated strings that are written into a 28-byte buffer: https://www.inchi-trust.org/download/104/InChI_API_Reference.pdf (page 24) The full InChI string can be longer, but the key is apparently made from SHA-256 hashes of parts of the InChI string. Wikipedia describes the format of the key being XXXXXXXXXXXXXX-YYYYYYYYFV-P , with 14 uppercase characters, a dash, 10 uppercase characters, a dash, and 1 uppercase character, so it could probably be reduced further, though I'd want to find a source from IUPAC for that info before relying on it.
The MD5 hashing has 32 characters, whether the InChi has 27. So the hashing becomes longer than the actual text. Is there a purpose to having it?
The text was updated successfully, but these errors were encountered: