You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to generate isomeric or canonical smiles from SDF with large numbers of atoms the error:
An InChI could not be generated and used to canonise SMILES: null
Could not generate InChI Numbers: Too many atoms [did you forget 'LargeMolecules' switch?]
Attaching detailed explanation from Andrew Dalke of the issue below:
CDK uses InChI to generate absolute SMILES. Here's a comment from the code:
* Create a absolute SMILES generator. Unique SMILES uses the InChI to
* canonise SMILES and encodes isotope or stereo-chemistry. The InChI
* module is not a dependency of the SMILES module but should be present
* on the classpath when generation absolute SMILES.
If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag from your output flavor then you'll get a SMILES, though it won't be an absolute SMILES.
More specifically, CDK uses InChI to generate the atom labels used during canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code path which looks like:
// apply the canonical labelling
if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) {
// determine the output order
int[] labels = labels(flavour, molecule);
Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to do the atom order assignments, via the 'auxiliary information':
public static long[] getNumbers(IAtomContainer atomContainer) throws CDKException {
String aux = auxInfo(atomContainer, new InchiFlag[0]);
...
static String auxInfo(IAtomContainer container, InchiFlag... flags) throws CDKException {
InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
boolean org = factory.getIgnoreAromaticBonds();
factory.setIgnoreAromaticBonds(true);
InChIGenerator gen = factory.getInChIGenerator(container, flags);
factory.setIgnoreAromaticBonds(org); // an option on the singleton so we should reset for others
if (gen.getStatus() == InchiStatus.ERROR)
throw new CDKException("Could not generate InChI Numbers: " + gen.getMessage());
return gen.getAuxInfo();
That calls into the InChI, which has the check (actually, it's in a few places, all with the same idea):
/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating beta status of resulting identifiers]*/
so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49 from:
String aux = auxInfo(atomContainer, new InchiFlag[0]);
to have LargeMolecules in that 'new InchiFlag' would make this work.
However, I'm not a Java developer and don't know how to make this change nor test it. I can say it does not seem to be user-configurable.
I am a Python developer, and I can reproduce the error using my 'chemfp translate' tool, which uses a Java/Python bridge to work with the CDK. The following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1079 1232 0 0 0
I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to produce the SMILES generation failure:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI could not be generated and used to canonise SMILES: null, file '', line 1, record #1: first line is '>megatryp'. Skipping.
(the --via defaults to 'sdf' so I'll omit that in the rest).
I can configure CDK SMILES writer to use the Default flavor, but without the 'Canonical' option, to show that work-around gives a (non-canonical) SMILES:
When trying to generate isomeric or canonical smiles from SDF with large numbers of atoms the error:
is thrown. See Mailing List
Attaching detailed explanation from Andrew Dalke of the issue below:
CDK uses InChI to generate absolute SMILES. Here's a comment from the code:
If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag from your output flavor then you'll get a SMILES, though it won't be an absolute SMILES.
More specifically, CDK uses InChI to generate the atom labels used during canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code path which looks like:
where the labels() is:
Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to do the atom order assignments, via the 'auxiliary information':
That calls into the InChI, which has the check (actually, it's in a few places, all with the same idea):
where
#define MAX_ATOMS 32766
#define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024
The InChI flag is enabled with the flag 'LargeMolecules', https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47
/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating beta status of resulting identifiers]*/
so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49 from:
to have LargeMolecules in that 'new InchiFlag' would make this work.
However, I'm not a Java developer and don't know how to make this change nor test it. I can say it does not seem to be user-configurable.
I am a Python developer, and I can reproduce the error using my 'chemfp translate' tool, which uses a Java/Python bridge to work with the CDK. The following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --out sdf | head -6
megatryp
RDKit
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1079 1232 0 0 0
I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to produce the SMILES generation failure:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI could not be generated and used to canonise SMILES: null, file '', line 1, record #1: first line is '>megatryp'. Skipping.
(the --via defaults to 'sdf' so I'll omit that in the rest).
I can configure CDK SMILES writer to use the Default flavor, but without the 'Canonical' option, to show that work-around gives a (non-canonical) SMILES:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
Here I'll disable Isomeric instead, so it should be canonical but not isomeric, which might be okay for you:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2
O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
That's the flavor you pass into SmilesGenerator().
The text was updated successfully, but these errors were encountered: