```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2022-06-01: Made available by the Cambridge Crystallographic Data Centre.

```

# Molecular Geometry Analysis

[Mogul](https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/mogul_2020_1.pdf) uses a knowledge-base of intramolecular geometric parameters dervided from the CSD to perform geometric analyses on small molecules.
Similar molecular geometry analyses may be performed using the [Conformer API](https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/molecular_geometry_analysis.html).

In [None]:
import logging
from pathlib import Path
from platform import platform
import sys
import os
from time import time

import warnings

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    import pandas as pd

    import plotly.express as px

In [None]:
from IPython.display import HTML

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [None]:
import ccdc
from ccdc.conformer import GeometryAnalyser
from ccdc.io import MoleculeReader
from ccdc.diagram import DiagramGenerator

### Configuration

### Initialization

In [None]:
logger = logging.getLogger(__name__)

if not logger.hasHandlers():
    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter('[%(asctime)s %(levelname)-7s] %(message)s', datefmt='%y-%m-%d %H:%M:%S'))
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

In [None]:
logger.info(f"""
Platform:                     {platform()}

Python exe:                   {sys.executable}
Python version:               {'.'.join(str(x) for x in sys.version_info[:3])}

CSD version:                  {ccdc.io.csd_version()}
CSD directory:                {ccdc.io.csd_directory()}
API version:                  {ccdc.__version__}

CSDHOME:                      {os.environ.get('CSDHOME', 'Not set')}
CCDC_LICENSING_CONFIGURATION: {os.environ.get('CCDC_LICENSING_CONFIGURATION', 'Not set')}
""")

Set up a CCDC Diagram Generator...

In [None]:
diagram_generator = DiagramGenerator()

diagram_generator.settings.return_type = 'SVG'
diagram_generator.settings.explicit_polar_hydrogens = False
diagram_generator.settings.shrink_symbols = False

Utility to help with display in JupyterLab...

In [None]:
show_df = lambda df: HTML(df.to_html(escape=False).replace(r'\n', ''))

# show_df = lambda df: df.style.set_properties(**{'text-align': 'left'})

### Geometry analysis of a small molecule

First, set up a CCDC [Geometry Analyser](https://downloads.ccdc.cam.ac.uk/documentation/API/modules/conformer_api.html#ccdc.conformer.GeometryAnalyser)...

In [None]:
analyser = GeometryAnalyser()

analyser.settings.generalisation = False  # Use only fully-defined distributions
analyser.settings.ring.analyse   = False  # Can be slow, so disable it for now

Next, we load a molecule to analyse. This is a local copy of the ligand [4QQ](https://www.ebi.ac.uk/pdbe/entry/pdb/1ett/bound/4QQ) from the PDBe structure [1ETT](https://www.ebi.ac.uk/pdbe/entry/pdb/1ett) (Bovine Thrombin).

In [None]:
ligand_file = '1ett.mol2'

In [None]:
with MoleculeReader(ligand_file) as reader:
    
    molecule = reader[0]

If we depict the molecule, we see that benzene rings are not shown as aromatic as would be expected...

In [None]:
HTML(diagram_generator.image(molecule))

We thus standardise the molecule to CSD conventions...

_N.B._ this is not always necessary, but is quick and can't hurt for structures taken from outside the CSD ecosystem.

In [None]:
molecule.remove_hydrogens()
molecule.assign_bond_types(which='unknown')
molecule.standardise_aromatic_bonds()  
molecule.add_hydrogens()

In [None]:
HTML(diagram_generator.image(molecule))

Analyse our molecule of interest...

In [None]:
analysed_mol = analyser.analyse_molecule(molecule)

len(analysed_mol.analysed_torsions)  # Number of torsions found

Make a dataframe of the analysis results...

* `value` is the value of the torsion angle in the molecule being analysed.
* `unusual` indicates whether the geometric feature is considered unusual or not.
* `enough_hits` indicates whether there are enough hits in the CSD for a sound judgement.
* `d_min` is the distance to the nearest value in the CSD.
* `local_density` is the percentage of CSD values within 10 degrees of query value.
* `depiction` is a 2D depiction with the torsion highlighted.
* `object` is the API torsion object, cached here for later reference.

Local utility to depict a molecule with a torsion highlighted...

In [None]:
def depict_torsion(torsion):

    return diagram_generator.image(molecule, highlight_atoms=[molecule.atoms[x] for x in torsion.atom_indices])

In [None]:
torsions_df = pd.DataFrame(
                [('-'.join(x.atom_labels), x.value, x.unusual, x.enough_hits, x.d_min, x.local_density, depict_torsion(x), x) for x in analysed_mol.analysed_torsions],
                columns=['atom_labels', 'value', 'unusual', 'enough_hits', 'd_min', 'local_density', 'depiction', 'object']
            ).sort_values('d_min', ascending=False).reset_index(drop=True)

torsions_df.shape

For convenience, we will examine further only the subset of torsions considered 'unusual' and with enough hits to be reasonably certain of the result...

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    unusual_df = torsions_df.query("unusual and enough_hits").drop(['unusual', 'enough_hits'], axis=1).reset_index(drop=True)

unusual_df.shape

In [None]:
show_df(unusual_df.drop('object', axis=1).head(3))  # Top three

### Plotting distributions of CSD values

Plotting a histogram of the CSD values used in the geometry analysis can be a great help in evaluating the result.

We will illustrate plotting with one of the unusual torsions shown above...

In [None]:
n = 1

torsion = unusual_df.iloc[n]['object']  # Extract the cached API torsion object from dataframe

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)  # Ignore current 'distutils Version classes are deprecated' warning
    
    (px.histogram(
            x=torsion.distribution,
            range_x=(0, 180),
            title='Distribution of torsions in CSD with observed value marked'
        )
        .update_xaxes(title_text="Torsion Angle")
        .update_yaxes(title_text="Number of observations")
        .add_vline(x=abs(torsion.value))
        .show()
    )