<a href="https://colab.research.google.com/github/UAMCAntwerpen/2040FBDBIC/blob/master/Topic_01/Chemical_informatics_with_RDKit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chemical informatics (or chemo-informatics) with RDKit

In this notebook, we'll provide a quick overview of the RDKit and it's functions.

First install the necessary Python libraries:

In [None]:
!pip install rdkit mols2grid requests

Now import the necessary Python libraries:

In [None]:
# RDKit chemistry
from rdkit import Chem

# RDKit drawing
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
IPythonConsole.ipython_useSVG = True
rdDepictor.SetPreferCoordGen(True)

# Library to display molecules in a grid
import mols2grid

# Library to download files
import requests

### Displaying a chemical structure

Create a molecule (benzene) from a SMILES string and put the molecule into a variable called **mol**:

In [None]:
mol = Chem.MolFromSmiles("c1ccccc1")

We can display the value of a variable in a Jupyter notebook by typing the variable name and clicking shift-return:

In [None]:
mol

Get some information about this molecule:

In [None]:
print("number of atoms:", mol.GetNumAtoms())
nBonds = mol.GetNumBonds()
print("number of bonds:", nBonds)

The **mol** variable is not readable by humans (only by RDKit), but the moecule can be converted back to a SMILES:

In [None]:
print(Chem.MolToSmiles(mol))

Invalid SMILES can be captured by testing if the molecule is not **None**:

In [None]:
for smiles in ["CCCC", "c"]:
  print("Smiles:", smiles)
  mol = Chem.MolFromSmiles(smiles)
  print(mol)
  print(mol is None)
  print("")

Hydrogen atoms are by default considered to be implicitly present, by not explicit. The **AllChem.AddHs()** and **AllChem.RemoveHs()** functions can be used to make the hydrogens explicit or implicit:

In [None]:
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("C")
print("number of atoms:", mol.GetNumAtoms())
print("number of bonds:", mol.GetNumBonds())
print("")
mol = AllChem.AddHs(mol)
print("number of atoms:", mol.GetNumAtoms())
print("number of bonds:", mol.GetNumBonds())
print("")
mol = AllChem.RemoveHs(mol)
print("number of atoms:", mol.GetNumAtoms())
print("number of bonds:", mol.GetNumBonds())

The SMILES representations for most marketed drugs are available from the Wikipedia page for the corresponding drug. For instance, we can get the SMILES for the oncology drug Imatinib (aka Gleevec) from [Wikipedia](https://en.wikipedia.org/wiki/Imatinib). With this SMILES string in hand, we can generate an RDKit molecule:

In [None]:
glvc = Chem.MolFromSmiles("Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C")

In [None]:
glvc

## Looping over atoms and bonds

In [None]:
mol = Chem.MolFromSmiles('C1OC1')
for atom in mol.GetAtoms():
  print(atom.GetAtomicNum(), atom.GetIdx(), atom.GetSymbol(), atom.GetExplicitValence())

Atom indices:

In [None]:
for i in range(0, mol.GetNumAtoms()):
  print(i, mol.GetAtomWithIdx(i).GetSymbol())

Atom neighbors:

In [None]:
for atom in mol.GetAtoms():
  neighbors = atom.GetNeighbors()
  print(neighbors)
  print(atom.GetIdx(), end = ": ")
  for neighbor in neighbors: print(neighbor.GetIdx(), end="-")
  print("")

Bonds:

In [None]:
for bond in mol.GetBonds():
  bt = bond.GetBondType()
  bbi = bond.GetBeginAtomIdx()
  bei = bond.GetEndAtomIdx()
  print(bt, bbi, "-", bei)

## Rings

In [None]:
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True
mol = Chem.MolFromSmiles('OC1C2C1CC2')
mol

In [None]:
for atom in mol.GetAtoms():
  idx = atom.GetIdx()
  ring = atom.IsInRing()
  r3 = atom.IsInRingSize(3)
  r4 = atom.IsInRingSize(4)
  r6 = atom.IsInRingSize(6)
  print(idx, ring, r3, r4, r6)

In [None]:
mol = Chem.MolFromSmiles("OC1C2C1CC2")
smallestSetOfSmallestRings = Chem.GetSymmSSSR(mol)
n_sssr = len(smallestSetOfSmallestRings)
print(n_sssr)
for i in range(n_sssr): print(list(smallestSetOfSmallestRings[i]))
mol

In [None]:
mol = Chem.MolFromSmiles("c12ccccc1cccc2")
smallestSetOfSmallestRings = Chem.GetSymmSSSR(mol)
n_sssr = len(smallestSetOfSmallestRings)
print(n_sssr)
for i in range(n_sssr): print(list(smallestSetOfSmallestRings[i]))
mol

In [None]:
mol = Chem.MolFromSmiles("[C@H]12[C@@H]3[C@@H]4[C@H]1[C@H]5[C@@H]4[C@H]3[C@@H]25")
smallestSetOfSmallestRings = Chem.GetSymmSSSR(mol)
n_sssr = len(smallestSetOfSmallestRings)
print(n_sssr)
for i in range(n_sssr): print(list(smallestSetOfSmallestRings[i]))
mol

## Reading and writing molecules

Single molecules:

In [None]:
mol = Chem.MolFromSmiles("c1ccccc1")
mol is None

In [None]:
mol = Chem.MolFromSmiles("c1cCc1")
mol is None

In [None]:
smiles = ['c1ccccc1', 'c1cCc1']
mols = []
for s in smiles:
  mol = Chem.MolFromSmiles(s)
  if mol is None:
    continue
  else:
    mols.append(mol)
print(len(mols))

InChi strings:

In [None]:
from rdkit.Chem import inchi
mol = Chem.MolFromSmiles("C1CCNCC1")
inchistring = inchi.MolToInchi(mol)
print(inchistring)
mol = inchi.MolFromInchi(inchistring)
print(Chem.MolToSmiles(mol))

MOL blocks:

In [None]:
mol = Chem.MolFromSmiles('C1CCC1')
print(Chem.MolToMolBlock(mol))

Property data:

In [None]:
mol.SetProp("_Name","cyclobutane")
print(Chem.MolToMolBlock(mol))

In [None]:
mol.GetProp("_Name")

## Working with conformations

In [None]:
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
mol.SetProp("_Name", "aspirine")
print(Chem.MolToMolBlock(mol))

In [None]:
mol = Chem.AddHs(mol)
print(Chem.MolToMolBlock(mol))

In [None]:
AllChem.EmbedMolecule(mol)
print(Chem.MolToMolBlock(mol))

In [None]:
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
mol = Chem.AddHs(mol)
conformationIds = AllChem.EmbedMultipleConfs(mol, numConfs=10)
print(len(conformationIds))
w = Chem.SDWriter("aspirin.sdf")
for cid in conformationIds: w.write(mol, confId = cid)
w.close()

In [None]:
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
mol = Chem.AddHs(mol)
conformationIds = AllChem.EmbedMultipleConfs(mol, numConfs=10)
rmslist = []
AllChem.AlignMolConformers(mol, RMSlist = rmslist)
for rms in rmslist: print(rms)
w = Chem.SDWriter("aspirin.sdf")
for cid in conformationIds: w.write(mol, confId = cid)
w.close()

## Substructure searching: SMARTS

In [None]:
m = Chem.MolFromSmiles('c1ccccc1O')
m

In [None]:
smartsMol = Chem.MolFromSmarts('ccO')
m.HasSubstructMatch(smartsMol)

In [None]:
m.GetSubstructMatch(smartsMol)

In [None]:
m.GetSubstructMatches(smartsMol)

In [None]:
bromophenols = ["Oc1ccccc1Br", "Oc1cccc(Br)c1", "Oc1ccc(Br)cc1"]
mols = []
for bp in bromophenols: mols.append(Chem.MolFromSmiles(bp))

In [None]:
p = Chem.MolFromSmarts("Br[$(c1c([OH])cccc1),$(c1ccc([OH])cc1)]")
for mol in mols:
  if mol.HasSubstructMatch(p):
    print(Chem.MolToSmiles(mol), "True")
  else:
    print(Chem.MolToSmiles(mol), "False")

### Reading multiple chemical structures

The RDKit also provides the ability to read molecules from common molecular structure formats. In the code below we use the RDKit's **Chem.SDMolSupplier()** function to read molecules from an [SD file](https://en.wikipedia.org/wiki/Chemical_table_file). First, we'll download the file from GitHub:

In [None]:
url = "https://raw.githubusercontent.com/UAMCAntwerpen/2040FBDBIC/master/Topic_01/Example_compounds.sdf"
r = requests.get(url)
bytes_written = open('example_compounds.sdf', 'w').write(r.text)

Now we''ll read the file:

In [None]:
mols = [x for x in Chem.SDMolSupplier("example_compounds.sdf")]

The code above reads the molecules into a list. When we display this, we see a list of molecule objects. Below we'll take a look at a couple of ways to display multiple chemical structures in a grid:

In [None]:
mols

### Displaying multiple chemical structures in a grid

The RDKit's built-in **MolsToGridImage()** method provides a convenient way of displaying a grid of structures:

In [None]:
Draw.MolsToGridImage(mols,molsPerRow=4,useSVG=True)