Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InChi LargeMolecules Switch #974

Closed
cubbardo opened this issue May 26, 2023 · 0 comments · Fixed by #979
Closed

InChi LargeMolecules Switch #974

cubbardo opened this issue May 26, 2023 · 0 comments · Fixed by #979

Comments

@cubbardo
Copy link

When trying to generate isomeric or canonical smiles from SDF with large numbers of atoms the error:

An InChI could not be generated and used to canonise SMILES: null
Could not generate InChI Numbers: Too many atoms [did you forget 'LargeMolecules' switch?]

is thrown. See Mailing List

Attaching detailed explanation from Andrew Dalke of the issue below:

CDK uses InChI to generate absolute SMILES. Here's a comment from the code:

 * Create a absolute SMILES generator. Unique SMILES uses the InChI to
 * canonise SMILES and encodes isotope or stereo-chemistry. The InChI
 * module is not a dependency of the SMILES module but should be present
 * on the classpath when generation absolute SMILES.

If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag from your output flavor then you'll get a SMILES, though it won't be an absolute SMILES.

More specifically, CDK uses InChI to generate the atom labels used during canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code path which looks like:

        // apply the canonical labelling
        if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) {

            // determine the output order
            int[] labels = labels(flavour, molecule);

where the labels() is:

private static int[] labels(int flavour, final IAtomContainer molecule) throws CDKException {
    // FIXME: use SmiOpt.InChiLabelling
    long[] labels = SmiFlavor.isSet(flavour, SmiFlavor.Isomeric) ? inchiNumbers(molecule)
            : Canon.label(molecule,
                          GraphUtil.toAdjList(molecule),
                          createComparator(molecule, flavour));

Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to do the atom order assignments, via the 'auxiliary information':

public static long[] getNumbers(IAtomContainer atomContainer) throws CDKException {
    String aux = auxInfo(atomContainer, new InchiFlag[0]);
  ...

static String auxInfo(IAtomContainer container, InchiFlag... flags) throws CDKException {
    InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
    boolean org = factory.getIgnoreAromaticBonds();
    factory.setIgnoreAromaticBonds(true);
    InChIGenerator gen = factory.getInChIGenerator(container, flags);
    factory.setIgnoreAromaticBonds(org); // an option on the singleton so we should reset for others
    if (gen.getStatus() == InchiStatus.ERROR)
        throw new CDKException("Could not generate InChI Numbers: " + gen.getMessage());
    return gen.getAuxInfo();

That calls into the InChI, which has the check (actually, it's in a few places, all with the same idea):

max_num_at = ip->bLargeMolecules ? MAX_ATOMS : NORMALLY_ALLOWED_INP_MAX_ATOMS;
if (nNumAtoms >= max_num_at)
{
    TREAT_ERR( *err, 0, "Too many atoms [did you forget 'LargeMolecules' switch?]" );
    *err = 70;
    orig_inp_data->num_inp_atoms = -1;
    goto err_exit;
}

where

#define MAX_ATOMS 32766
#define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024

The InChI flag is enabled with the flag 'LargeMolecules', https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47

/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating beta status of resulting identifiers]*/

so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49 from:

    String aux = auxInfo(atomContainer, new InchiFlag[0]);

to have LargeMolecules in that 'new InchiFlag' would make this work.

However, I'm not a Java developer and don't know how to make this change nor test it. I can say it does not seem to be user-configurable.

I am a Python developer, and I can reproduce the error using my 'chemfp translate' tool, which uses a Java/Python bridge to work with the CDK. The following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --out sdf | head -6
megatryp
RDKit

0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1079 1232 0 0 0

I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to produce the SMILES generation failure:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI could not be generated and used to canonise SMILES: null, file '', line 1, record #1: first line is '>megatryp'. Skipping.

(the --via defaults to 'sdf' so I'll omit that in the rest).

I can configure CDK SMILES writer to use the Default flavor, but without the 'Canonical' option, to show that work-around gives a (non-canonical) SMILES:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)

Here I'll disable Isomeric instead, so it should be canonical but not isomeric, which might be okay for you:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2
O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(

That's the flavor you pass into SmilesGenerator().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant