In [1]:
%run notebook_setup.py

sys.path.append('../..')

In [2]:
from sqlalchemy import create_engine

import random

# Rules for keto-enol tautomerism

## Introduction

Keto-enol tautomerism is unusual in the current context as the proton shift involves a carbon atom...

![keto-enol tautomerism](http://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Keto-Enol-Tautomerie.png/320px-Keto-Enol-Tautomerie.png)

The equilibrium here normally lies far to the left (_i.e._ the ketone is the dominant tautomer), but can be affected by a number of factors, the most pertinent here being the nature of the substituents.

For example, the enol form may be stabilised by conjugation of the double-bond with an adjacent pi-system and/or by H-bonding of the hydroxyl. These factors come together in some beta-dikones, where the equilibrium may lie towards the enol form...

![beta-diketone](http://upload.wikimedia.org/wikipedia/commons/thumb/9/99/AcacH.png/400px-AcacH.png)

Even this is not ambiguous, however. For example, where one of the carbonyls of the diketone is part of an ester or amide moiety the equilibrium shifts back towards the diketone. Similarly, steric effects or ring membership might disallow the formation of a hydrogen bond or disrupt conjugation. It is thus difficult to formulate clear-cut rules that may be applied to generate 'correct' tautomers in all cases.

Another issue is that it is difficult to find definitive information on tautomeric preferences for more these systems, and available data is sometimes contradictory (perhaps due to the differeing solvent systems and measurement techniques used). For example, Carey & Sundbery (3e, p. 420) suggests pentane-2,4-dione (shown above) favours the keto form (in H<sub>2</sub>O), wheras March (6e, p.99) suggests the enol form predominates. 

Thus, it is an open question as to what would be the ideal strategy for handling these groups in the context of compound standardisation. Some thoughts are...

* Use a simple rule that transforms all enol groups to ketones. This has the virtue of simplicity, although it would have to be accepted that it would be applied inappropriately in some cases (_i.e._ where the enol was more stable).


* Use a variant of this rule that excludes those cases where the enol might be considered to be the more stable form. This could be made as fine-grained as necessary, although the problem then, of course, is in deciding which cases to handle. A further complication of this strategy would be that rules handling the opposite transformation would become necessary. In other words, where the enol was more stable, molecules containing the keto form would need to be transformed to reflect this.


* Do not apply any rule, but flag compounds containing enols such that the user may inspect them individually. This might be particularly appropriate in some contexts such as literature data, where the author of the document may have had goodreason for assigning a 'non-standard' tautomer.

### Note

Another issue that make keto-enol distinct from most heteroatom-based tautomerism is that the change in hybridisation state of the hydrogen-accepting carbon can more radically change the 3D geometry of a molecule (and hence it's conformational behaviour), and might also introduce a stereogenic centre. Both of these might need consideration in some modelling contexts.

## Simple examples

As an illustration, some simple examples are shown below...

In [3]:
# Some simple illustrative test cases...

examples = pd.DataFrame(columns=["smiles", "description"], data=[
      ["OC(C)=CC",               "Simple enol"]
    , ["OC(C)=CC(=O)C",          "Enol conjugated with carbonyl: H-bond possible (acyclic)"]
    , ["OC1=C(C(=O)C)CCCC1",     "Enol conjugated with carbonyl: H-bond possible (cyclic 1)"]
    , ["OC1=C(C(=O)CCC2)C2CCC1", "Enol conjugated with carbonyl: H-bond possible (cyclic 2)"]
    , ["O=C1C(=C(O)C)CCCC1",     "Enol conjugated with carbonyl: H-bond possible (cyclic 3)"]
    , ["O=C1C(=C(O)CCC2)C2CCC1", "Enol conjugated with carbonyl: H-bond possible (cyclic 4)"]
    , ["OC1=CC(=O)CCC1",         "Enol conjugated with carbonyl: no H-bond possible (cyclic)"]
    , ["OC(C)=CC(=O)OC",         "Enol conjugated with carbonyl: ester/amide"]
])

examples['old'] = [Chem.MolFromSmiles(x) for x in examples['smiles']]

del examples['smiles']

# Helper function to run a transformation on a molecule...

def run_rxn(rxn, old):
    try:
        new = rxn.RunReactants((old,))[0][0]
        Chem.SanitizeMol(new)
        return new
    except:
        return Chem.MolFromSmiles("*") # dummy mol to indicate no transform applied

### Example 1

Simplest enol -> keto transformation

In [4]:
smarts1 = "[OH:1][C:2]=[C:3]>>[O:1]=[C:2][C:3]"

rxn1 = AllChem.ReactionFromSmarts(smarts1)

examples["new"] = [run_rxn(rxn1, x) for x in examples["old"]]

examples

Unnamed: 0,description,old,new
0,Simple enol,,
1,Enol conjugated with carbonyl: H-bond possible (acyclic),,
2,Enol conjugated with carbonyl: H-bond possible (cyclic 1),,
3,Enol conjugated with carbonyl: H-bond possible (cyclic 2),,
4,Enol conjugated with carbonyl: H-bond possible (cyclic 3),,
5,Enol conjugated with carbonyl: H-bond possible (cyclic 4),,
6,Enol conjugated with carbonyl: no H-bond possible (cyclic),,
7,Enol conjugated with carbonyl: ester/amide,,


### Example 2

Exclude beta-keto-enols, except where the ketone is 'deactivated' by conjugation with a ether or amine

In [5]:
smarts2 = "[OH!$(*C=C[C!$(*[O,NX3])]=O):1][C:2]=[C:3]>>[O:1]=[C:2][C:3]"

rxn2 = AllChem.ReactionFromSmarts(smarts2)

examples["new"] = [run_rxn(rxn2, x) for x in examples["old"]]

examples

Unnamed: 0,description,old,new
0,Simple enol,,
1,Enol conjugated with carbonyl: H-bond possible (acyclic),,
2,Enol conjugated with carbonyl: H-bond possible (cyclic 1),,
3,Enol conjugated with carbonyl: H-bond possible (cyclic 2),,
4,Enol conjugated with carbonyl: H-bond possible (cyclic 3),,
5,Enol conjugated with carbonyl: H-bond possible (cyclic 4),,
6,Enol conjugated with carbonyl: no H-bond possible (cyclic),,
7,Enol conjugated with carbonyl: ester/amide,,


### Example 3

Exclude beta-keto-enols, except where...

* the ketone is 'deactivated' by conjugation with a ether or amine
* the moiety is in a ring, thus precluding the formation of the enol-stabilising intramolecular H-bond

In [6]:
smarts3 = "[OH!$([*!$(*C1=CC(~*~*~*~1)=O)!$(*C1=CC(~*~*~1)=O)]C=C[C!$(*[O,NX3])]=O):1][C:2]=[C:3]>>[O:1]=[C:2][C:3]"

rxn3 = AllChem.ReactionFromSmarts(smarts3)

examples["new"] = [run_rxn(rxn3, x) for x in examples["old"]]

examples

Unnamed: 0,description,old,new
0,Simple enol,,
1,Enol conjugated with carbonyl: H-bond possible (acyclic),,
2,Enol conjugated with carbonyl: H-bond possible (cyclic 1),,
3,Enol conjugated with carbonyl: H-bond possible (cyclic 2),,
4,Enol conjugated with carbonyl: H-bond possible (cyclic 3),,
5,Enol conjugated with carbonyl: H-bond possible (cyclic 4),,
6,Enol conjugated with carbonyl: no H-bond possible (cyclic),,
7,Enol conjugated with carbonyl: ester/amide,,


### Note

It should be noted that the two more complex examples are included mainly for illustrative purposes at present. No attempt as yes has been made to encode the necessary reverse  (_e.g._ 'beta-diketone -> beta-keto-enol') transformations that would be necessary for consistency if these transforms were to be used in practice.

In the simplest case (Example 1), no such reverse transforms are necessary as all enols should be replaced with ketones.

## Examples in ChEMBL

Enols in ChEMBL were inspected in order to investigate the effect of different versions of the transform on real molecules. Only the simplest transform is illustrated here, however.

In [7]:
# Local myChEMBL instance...

database = {
      'host':     'localhost'
    , 'port':      5432
    , 'database': 'mychembl'
    , 'user':     'chembl_21'
    , 'password': 'chembl_21'
}

db_url = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**database)

engine = create_engine(db_url)

In [8]:
# Substructure query on parents with fewer than 30 heavy atoms...

sql = """
select
      md.chembl_id
    , mol_send(rm.m) as mol
from
      (select distinct parent_molregno as molregno from molecule_hierarchy) parents
    , compound_structures cs
    , molecule_dictionary md
    , compound_properties cp
    , mols_rdkit rm
where
    parents.molregno = cs.molregno
and parents.molregno = md.molregno
and parents.molregno = cp.molregno
and parents.molregno = rm.molregno
and cp.heavy_atoms <= 30
and rm.m @> %s::qmol
"""

In [9]:
# Helper function to replace buffer with mol...

buffer_to_bytes = (lambda x: x.tobytes()) if six.PY3 else (lambda x: str(x))

def fix_mol(old):
    new = dict(old)
    new['mol'] = Chem.Mol(buffer_to_bytes(old['mol']))
    return new

### Enols

This shows the result of running the simplest 'enol -> ketone' transformation on enols from ChEMBL.

In [10]:
# Very general substructure query for enols...

smarts = "[OH]-C=[CX3]"

records = engine.execute(sql, (smarts, )).fetchall()

len(records)

7439

In [11]:
# Create dataframe and run the simplest transform on the mols...

df = pd.DataFrame([fix_mol(x) for x in records]).set_index("chembl_id")

df['mol2'] = [run_rxn(rxn1, x) for x in df['mol']]

In [12]:
# Visualise a sample...

df.iloc[random.sample(range(len(df)), 50),]

Unnamed: 0_level_0,mol,mol2
chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CHEMBL474988,,
CHEMBL1797806,,
CHEMBL187821,,
CHEMBL125948,,
CHEMBL1364621,,
CHEMBL3248288,,
CHEMBL1383318,,
CHEMBL1540053,,
CHEMBL552365,,
CHEMBL320720,,


In most cases the simple transform seems to do what is expected. In some more complex molecular environments, though, especially where there are pi-systems in conjugatuion with the enol, the keto form can look awkward. 

The question is whether this is enough to warrant the more complicated ruleset that would be required to handle it, or conversely, to warrant the removal of keto-enol tautomerism from the rules entirely.

### _beta_-diketones

As stated above, if the beta-keto enol form is to be preferred in some contexts, rules that would transform the beta-diketone moiety in the appropriate context would be necessary. ChEMBL was searched for cases where that might be the case...

In [13]:
# Find (acyclic) beta-diketones without substituents that might disfavour enol formation

smarts = "O=[C!$(*[O,NX3])]-!@[C!H0]-!@[C!$(*[O,NX3])]=O"

records = engine.execute(sql, (smarts, )).fetchall()

len(records)

885

In [14]:
# Regenerate molecules and create data frame...

df2 = pd.DataFrame([fix_mol(x) for x in records]).set_index("chembl_id")

In [15]:
# Visualise a sample...

df2.iloc[random.sample(range(len(df2)), 50),]

Unnamed: 0_level_0,mol
chembl_id,Unnamed: 1_level_1
CHEMBL1454827,
CHEMBL3248910,
CHEMBL1469446,
CHEMBL2283072,
CHEMBL1969112,
CHEMBL1983319,
CHEMBL2408590,
CHEMBL3087332,
CHEMBL1568747,
CHEMBL145280,


It seems clean that some of these at least would likely favour the beta-keto enol form to some extent. Rules to convert these would thus be necessary if a more sophisticated scheme were to be implemented. 

### Conclusion

Only the simplest 'enol -> ketone' transform is implemented at present. Thus, as all enols are converted to ketones, no 'reverse' rules such as 'beta-keto-enol -> beta-diketone' are necessary at this time. However, if a more sophisticated scheme were to be implemented, where not all enols are converted to the keto form, then rules to convert ketones to enols in the appropriate circumstances would be required.

Although the beta-diketone/beta-keto-enol system is the most obvious case where these complications could arise, there are a few other systems where the equilibrium may be significantly displaced towards the enol form. 