# Cheminformatics Background and Concepts

This notebook will introduce you to some background concepts in Cheminformatics and get you started working with the software RDKit.
[Cheminformatics](https://en.wikipedia.org/wiki/Cheminformatics) involves the use of computational techniques, including machine learning and statistical methods, for managing, analyzing, and predicting chemical data and properties.
Cheminformatics techniques are used commonly in pharamaceutical and drug discovery applications.

One commonly used library in Python for data science (or cheminformatics) is called [RDKit](https://en.wikipedia.org/wiki/RDKit). 
RDKit is an open-source cheminformatics library, primarily developed in C++, and was created by [Dr. Greg Landrum](https://scholar.google.com/citations?user=xr9paY0AAAAJ&hl=en) in 2006. 
We will be using the Python interface to RDKit, though there are interfaces in other languages.


## Molecular Representations: SMILES

SMILES stands for "Simplified Molecular-Input Line-Entry System" and is a way to represent molecules as a string of characters.

You can read more about SMILES at [this tutorial](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html), but rules for atoms and bonds are also repeated below.

### Atoms
SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.

### Bonds
```
-	Single bond
=	Double bond
#	Triple bond
*	Aromatic bond
.	Disconnected structures
```
Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.

### Branches

A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. Some examples:

```

CC(O)C	2-Propanol
CC(=O)C	2-Propanone
CC(CC)C	2-Methylbutane
CC(C)CC(=O)	2-Methylbutanal
c1c(N(=O)=O)cccc1	Nitrobenzene
CC(C)(C)CC	2,2-Dimethylbutane
```

Most of the time, you will not need to write a SMILES string by hand. You will be able to look up a molecule's SMILES string from a web database like [PubChem](https://pubchem.ncbi.nlm.nih.gov/).

You can also use tools like this [molecule sketcher](https://pubchem.ncbi.nlm.nih.gov//edit3/index.html) to draw molecules and get their SMILES strings.

<div class="exercise admonition">
<p class="admonition-title">Check Your Understanding</p>
<p> Based on what you've learned about SMILES strings, answer the following questions:
<p>
    <ul>
        <li> What would be the SMILES string for ethanol?
        <li> What is the SMILES string for water?
        <li> What molecule is represented by the SMILES string O=C=O?
    </ul>
</p>
<p>Check your answers from this previous exercise using the PubChem molecule sketcher. Notice that you can type in a SMILES string and have the sketcher draw the molecule for you.</p>
</div>

### Fill in your answers here:
Double click the cell to edit.

1.
2.
3.


Now that we have an introduction to SMILES, 
we will use SMILES with RDKit to create RDKit molecule objects.


We are going to use a part of RDKit called `Chem`. To use `Chem`, we first have to import it. 

In [None]:
from rdkit import Chem

## Creating Molecules with RDKit

To get information about molecules in RDKit, we have to first create variables representing molecules.
RDKit has molecule object that can be used to retrieve information or calculate properties. 
First, the molecule name has to be communicated to RDKit in a way that computers understand. 

### Creating molecules using SMILES

In the first part of this notebook, we learned about molecular representations using SMILES strings. Now we will use SMILES strings to create molecule objects in RDKIT. 

We can create a representation of methane using RDKit by using the `MolFromSmiles` function in `rdkit.Chem`.
If you put a RDKit `mol` object as the last thing in a Jupyter notebook cell, you will see a representation of the molecule.

In [None]:
methane = Chem.MolFromSmiles("C")
methane

 Create RDKit molecules for the following molecules. You can look up the SMILES strings on <a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a>:
<p>
    <ul>
        <li> Propane
        <li> Ethene
        <li> Cyclohexane
        <li> Benzene
    </ul>
</p>

In [None]:
# fill in 
propane = 
propane

In [None]:
# fill in 
ethene = 
ethene

## Molecular Similarity : Substructure Searches

Sometimes you may wish to search a set of molecules and identify molecules that have certain functional groups.
In the language of graph representation, we would be looking for our molecule graph to contain a certain subgraph.

In [None]:
caffeine = Chem.MolFromSmiles('CN1C=NC2=C1C(=O)N(C(=O)N2C)C')
caffeine

In [None]:
o_pattern = Chem.MolFromSmiles("C=O")

In [None]:
mathes = caffeine.GetSubstructMatches(o_pattern)
caffeine

## Molecular Fingerprints

Molecular fingerprints are representations of molecules that are usually bit strings, or vectors of 0's and 1's. 
Fingerprints are built by considering the molecular structure (often as some sort of graph representation) and applying a certain algorithm to create the vector.

<center>
<img src="images/Topological_Fingerprint.png">
</center>

Image from [Chemistry LibreTexts](https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics/06%3A_Molecular_Similarity/6.01%3A_Molecular_Descriptors) [Cheminformatics Course](https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics)

There are many different fingerprinting algorithms. 
But they tend to fall into two types of groups - similarity or substructure fingerprints.
A common similarity fingerprint that is used is the Morgan fingerprint.
A common substructure fingerprint that is used is the [Daylight fingerprint](https://www.daylight.com/dayhtml/doc/theory/theory.finger.html) (the RDKFingerprint is a Daylight-like fingerprint).

In [None]:
from rdkit.Chem import AllChem

fpgen = AllChem.GetMorganGenerator(radius=2)

benzene = Chem.MolFromSmiles("c1ccccc1")
aniline = Chem.MolFromSmiles("Nc1ccccc1")
pyridine = Chem.MolFromSmiles("n1ccccc1")

In [None]:
benzene

In [None]:
aniline

In [None]:
pyridine

In [None]:
# Get fingerprints for the molecules.
benzene_fp = fpgen.GetFingerprint(benzene)
aniline_fp = fpgen.GetFingerprint(aniline)
pyridine_fp = fpgen.GetFingerprint(pyridine)

RDKit will let us see the bitstring for each fingerprint:

In [None]:
benzene_fp.ToBitString()

In [None]:
from rdkit import DataStructs

# measure similarity - a higher number is more similar
DataStructs.TanimotoSimilarity(benzene_fp, aniline_fp)

In [None]:
# Exercise - measure similarity between benzene and pyridine


## RDKit and Molecular Descriptors

A molecular descriptor is a numerical value or a set of values that represent specific structural or chemical features of a molecule. Molecular descriptors are based on molecular structure and allow statistical analysis and similarity measurements on molecules.

RDKit supports the calculation of many molecular descriptors. You can see a [full list of RDKit descriptors](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors) or [see the module documentation](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html).

To get molecular descriptors from RDKit, we import the `Descriptors` module.

```python
from rdkit.Chem import Descriptors
```

To get a descriptor, you do

```python
Descriptors.descriptor_name(molecule_variable)

```

RDKit supports a number of molecular descriptors, a list of which is given below, along with the RDKit method for the property.


Name of Property      | Name of RDKit Descriptor Function
----------------------|-----------------------------------|
molecular weight      | Descriptors.MolWt
number of heavy atoms | Descriptors.HeavyAtomCount
number of H-bond donors| Descriptors.NumHDonors
number of H-bond donors| Descriptors.NumHAcceptors
octanol-water partition coefficient| Descriptors.MolLogP
topological polar surface area | Descriptors.TPSA
number of rotatable bonds      | Descriptors.NumRotatableBonds
number of aromatic rings       | Descriptors.NumAromaticRings
number of aliphatic rings      | Descriptors.NumAliphaticRings

In [None]:
from rdkit.Chem import Descriptors

In [None]:
print("Printing info for methane:")
print(f"The molecular weight is {Descriptors.MolWt(methane)}")
print(f"The number of aromatic rings is {Descriptors.NumAromaticRings(methane)}")
