<a href="https://colab.research.google.com/github/chupvl/gcolab/blob/main/2023_04_25_rdkit_fingerprints_nonfolded.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How many actual "fragments" are in the drug-like ChEMBL space?

*Vladimir Chupakhin, 2023-04-27*

The easiest possible descriptors used in ML for chemistry are fragments, and they are usually used in a folded form. The default folded size is 1024, with some recommendations to use higher dimensionality (>=4096) for computational chemogenomics tasks.

Folding of descriptors is done for several reasons: to save space, get a uniform length, with a price of bit collision, and sometimes insensitivity to small changes.

Let's check how many actual bits are in drugs as an example of FDA-approved chemical space vs random selection of 300K compounds from Enamine REAL virtual space of 3B compounds commonly used in virtual screening.

## Libs

In [1]:
!pip install rdkit -q

In [2]:
import rdkit
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem.Draw import IPythonConsole
print(rdkit.__version__)
%pylab inline

import itertools
import pandas as pd

2022.09.5
Populating the interactive namespace from numpy and matplotlib


In [3]:
def getUniqueBitsForMList(lst_mol, radius):
    """
    Input: a list of RDKit molecule objects and a radius
    Output: a set of unique bits generated using the Morgan fingerprint generator.

    Parameters:
    lst_mol (list): A list of RDKit molecule objects
    radius (int): The radius to use for the Morgan fingerprint generator

    Returns:
    set: A set of unique bits generated using the Morgan fingerprint generator

    """
    try:
      mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=radius)
      unique_bits_list = [list(mfpgen.GetSparseFingerprint(m).GetOnBits()) for m in lst_mol]
      # unique_bits_set = set([item for sublist in unique_bits_list for item in sublist])
      unique_bits_set = set(itertools.chain(*unique_bits_list))
      return unique_bits_set
    except:
      return set()

## Checking for ChEMBL drugs

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [5]:
df_drugs = pd.read_csv('/content/gdrive/MyDrive/gcollab/data/chembl/chembl_drugs_lite_20230221.csv')
lst_drugs = list(set(df_drugs['canonical_smiles_std'].to_list()))
mols_drugs = [ Chem.MolFromSmiles(s) for s in lst_drugs ]
len(lst_drugs)

4875

In [6]:
nUB_drugs = {}
for r in [2, 3, 4]:
  UB = getUniqueBitsForMList(mols_drugs, r)
  nUB_drugs[r] = len(UB)
  print(f'{len(UB)} unique bits for {len(mols_drugs)} drugs for radius {r} of Morgan fingerprint')

29201 unique bits for 4875 drugs for radius 2 of Morgan fingerprint
75912 unique bits for 4875 drugs for radius 3 of Morgan fingerprint
117811 unique bits for 4875 drugs for radius 4 of Morgan fingerprint


## Checking for 300K random compounds from Enamine lead-like

In [7]:
df_ell = pd.read_csv('/content/gdrive/MyDrive/gcollab/data/enamine_real/Enamine_random.txt.gz', header=None, compression='gzip', sep='\t')
df_ell.columns = ['smiles', 'id', 'sm']
df_ell = df_ell[['smiles', 'id']]
df_ell.head()

Unnamed: 0,smiles,id
0,CNC1=NC=CC=C1Br,Z1650167172
1,CSCCN1CCC(N(C)C)CC1,Z644156236
2,COC(C)(C)CN(C)C(=O)CSC,Z1412073910
3,CCC(C)OCC(=O)NC(C)CO,Z1497619376
4,CC(CO)CSC(C)C(=O)NC1CC1,Z1268437678


In [8]:
df_ell.shape

(392661, 2)

In [9]:
# surprisingly fast enough, but dask would be nice to have
df_ell_300k = df_ell.sample(300000)
lst_ell_300k = list(set(df_ell_300k['smiles'].to_list()))
mols_ell_300k = [ Chem.MolFromSmiles(s) for s in lst_ell_300k ]
len(lst_ell_300k)

300000

In [10]:
nUB_ell = {}
for r in [2, 3, 4]:
  UB = getUniqueBitsForMList(mols_ell_300k, r)
  nUB_ell[r] = len(UB)
  print(f'{len(UB)} unique bits for 300K Enamine Lead-like compounds {r} of Morgan fingerprint')

89070 unique bits for 300K Enamine Lead-like compounds 2 of Morgan fingerprint
596124 unique bits for 300K Enamine Lead-like compounds 3 of Morgan fingerprint
1899777 unique bits for 300K Enamine Lead-like compounds 4 of Morgan fingerprint


In [13]:
nUB_ell_100K = {}
for r in [2, 3, 4]:
  UB = getUniqueBitsForMList(mols_ell_300k[:100000], r)
  nUB_ell_100K[r] = len(UB)
  print(f'{len(UB)} unique bits for 100K Enamine Lead-like compounds {r} of Morgan fingerprint')

65627 unique bits for 100K Enamine Lead-like compounds 2 of Morgan fingerprint
361353 unique bits for 100K Enamine Lead-like compounds 3 of Morgan fingerprint
961158 unique bits for 100K Enamine Lead-like compounds 4 of Morgan fingerprint


## Stats

In [14]:
stats = pd.DataFrame({'Drugs (~5K)': nUB_drugs, 
                      'Enamine REAL Lead-like (100K)': nUB_ell_100K,
                      'Enamine REAL Lead-like (300K)': nUB_ell})
stats

Unnamed: 0,Drugs (~5K),Enamine REAL Lead-like (100K),Enamine REAL Lead-like (300K)
2,29201,65627,89070
3,75912,361353,596124
4,117811,961158,1899777


So, as we can see, on average, for radius 3 there are 16 bits per compound for drug space that's going down 4 and 2 bits for increased chemical space of 100K and 300K diverse compounds from the Enamine REAL library. Enamine virtual space is, of course, repetitive, but it still indicates that with a large chemical space number of unique bits will grow up significantly to more than half of million bits. This raises a good old question about using folded descriptors for ML tasks, especially in a very large chemical space.