# Descriptors

## Molecular Descriptors

> Molecular descriptors can be defined as mathematical representations of molecules’ properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecules. An example of molecular descriptors is the LogP which is a quantitative representation of the [lipophilicity](https://www.sciencedirect.com/topics/pharmacology-toxicology-and-pharmaceutical-science/lipophilicity) of the molecules, it is obtained by measuring the partitioning of the molecule between an aqueous phase and a lipophilic phase which consists usually of water/*n*-octanol. - [source](https://www.sciencedirect.com/topics/medicine-and-dentistry/molecular-descriptor)
> 

Molecular descriptors can generally classified in four ways:

![image.png](attachment:c13f670f-e6ec-4364-adc7-044cea35600c.png)

([source](https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_(2019)/05%3A_5._Quantitative_Structure_Property_Relationships/5.03%3A_Molecular_Descriptors))

## Tutorial

In this tutorial, we’ll show how descriptors can be useful as filters in the drug discovery process. This tutorial was inspired from the [TeachOpenCADD talktorial](https://projects.volkamerlab.org/teachopencadd/talktorials/T002_compound_adme.html?highlight=descriptors), we highly encourage you to read through the theory, understand ADME and why we care about it in the drug discovery process from the talktorial before diving into this tutorial. It provides the necessary background information to fully understand the purpose of this tutorial. 

The set of descriptors that will be focused on today are: 

- Molecular weight ≤ 500 Da
- Number of hydrogen bond acceptors (HBAs) ≤ 10
- Number of hydrogen bond donors (HBD) ≤ 5
- Calculated LogP (octanol-water coefficient) ≤ 5

These descriptors and their limits are collectively known as **[Lipinkski’s rule of five (Ro5)](https://www.sciencedirect.com/science/article/abs/pii/S0169409X96004231)**, this is a method used to estimate a compounds bioavailability based solely on its chemical structure. If a molecule violates any of the rules listed above (i.e. a molecular weight of 700 Da), it’s probable that the compound will **exhibit poor absorption or permeation** and subsequently be removed from your list.

## Tutorial

This tutorial will show you a real-world scenario of 

- **Part 1:** Obtaining a virtual screening library from **[Enamine](https://enamine.net/compound-libraries/targeted-libraries/dna-library)**
    - The DNA library is designed to identify novel active compounds against proteins which are essential for DNA stability.  At 5530 compounds, this is one of Enamine’s smaller libraries. The same functions could easily be applied to some of the larger libraries using Datamol’s parallelize functions.
- **Part 2:** Then calculate the relevant molecular properties for the Ro5 for the list
- **Part 3:** Investigate compliance with Ro5
- **Part 4:** And finally, revealing the statistics for the dataset of compounds using Ro5 as a filter. With this, we will be able to find the answer to our question; how many fulfill vs. violate Ro5?
    - Subsequently, we can show different ways of displaying the data to make it more visually appealing using Matplotlib

In [None]:
import datamol as dm

# Part 1: Obtain a list of molecules and visualize
# Load sdf downloaded from Enamine with the flag as_df set to True
# This will automatically create a 'smiles' column from the sdf file
data = dm.read_sdf('/home/data/Enamine_DNA_Libary_5530cmpds_20200831.sdf', as_df=True)
smiles = data["smiles"].iloc[:].tolist()
mols = [dm.to_mol(s) for s in smiles]
dm.to_image(mols[900:909], n_cols = 3)

In [None]:
# Calculate a specific descriptor for a compound
n_aromatic_atoms = dm.descriptors.n_aromatic_atoms(mols[0])
print("Number of aromatic atoms in the compound is", n_aromatic_atoms)

In [None]:
# Part 2: Calculate the relevant molecular properties for the Ro5 for the list

# Calculate many descriptors for a compound
dm.descriptors.compute_many_descriptors(mols[900])

In [None]:
# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)
df

In [None]:
# Part 3: Investigate compliance with Ro5

df = df[df['mw'] <= 500]
df = df[df['n_lipinski_hba'] <= 10]
df = df[df['n_lipinski_hbd'] <= 5]
df = df[df['clogp'] <= 5]
df

# 5350 of the 5530 compounds in the dataset satisfy all criteria in the rule of 5

In [None]:
# Part 4: Reveal the statistics for the dataset of compounds using Ro5 as a filter. How many fulfill vs. violate Ro5? 
# Plotting the RO5 descriptors

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(ncols=4, figsize=(25, 6))
plt.rcParams['font.size'] = 12
sns.histplot(df, x='mw', ax=axs[0])
sns.histplot(df, x='n_lipinski_hba', ax=axs[1])
sns.histplot(df, x='n_lipinski_hbd', ax=axs[2])
sns.histplot(df, x='clogp', ax=axs[3])

If you’re curious to learn more about some of the other established rules in the drug discovery industry, feel free to run this list through a Google search: 

- Rules of CNS
- BBB score
- Rule of Egan
- Rule-of-5
- Beyond Rule-of-5
- Rule-of-4
- Ghose Filter
- Zinc Rule
- Rule of GSK (4/400)
- Lead-Like Soft Rule
- Oprea’s Rule
- Pfizer Rule (3/75)
- REOS Filter
- Rule-of-3
- Extended Rule-of-3
- Veber Filter

## References:

- TeachOpenCADD - [https://projects.volkamerlab.org/teachopencadd/talktorials/T002_compound_adme.html?highlight=descriptors](https://projects.volkamerlab.org/teachopencadd/talktorials/T002_compound_adme.html?highlight=descriptors)
- ADME criteria ([Wikipedia](https://en.wikipedia.org/wiki/ADME) and [Mol Pharm. (2010), 7(5), 1388-1405](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025274/))
- What are lead compounds? ([Wikipedia](https://en.wikipedia.org/wiki/Lead_compound))
- What is the LogP value? ([Wikipedia](https://en.wikipedia.org/wiki/Partition_coefficient))
- Lipinski et al. “Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.” ([Adv. Drug Deliv. Rev. (1997), 23, 3-25](https://www.sciencedirect.com/science/article/pii/S0169409X96004231))