# Preprocessing a dataset of blood brain barrier molecules with Datamol
> Sanitizing and manipulating molecules with binary labels of blood-brain barrier penetration(permeability).

- toc: false 
- badges: true
- comments: true
- categories: [python, bioinformatics, datasets, SMILES, cheminformatics, datamol, RDKit, molecules]
- image: images/mol.gif

SMILES (Simplified Molecular Input Line Entry System) is a standard notation representing the molecular structure of a compound as a string representation that can be understood by a computer. The SMILES notation consists of a handful of rules which allow for converting the string to an image or graph. SMILES can then be easily used for generating further representations to train machine learning models with.

In [27]:
import datamol as dm
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [28]:
# Loading dataset; which can be found on MoleculeNet.org
BBBP_df = pd.read_csv("data/BBBP.csv")
BBBP_df.head()

Unnamed: 0,num,name,p_np,smiles
0,1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12
1,2,Terbutylchlorambucil,1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl
2,3,40730,1,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...
3,4,24,1,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C
4,5,cloxacillin,1,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...


The dataframe shows 4 named columns, including the "num" of the molecule, the name, a binary label for blood brain barrier permeability status "p_np", and the SMILES string.

In [29]:

# The name and number can be dropped
BBBP_df = BBBP_df.drop(["num", "name"], axis=1)

# Checking the data for null values
BBBP_df["smiles"].isnull().values.any()

False

Mols and smiles need to be sanitized as it will leave us with SMILES that are complete nonesense, for example, errors resulting from kekulization.

![](images/kekul.jpg)

RDkit generates the alternate position of double bonds, and then (in a second step they call "aromatization") labels the ring as aromatic. In panel (2), there are three possible Lewis structures contributing to the actual structure (i.e. there is resonance), so the software would have to generate all three to be able to search for identical structures. [1]

Below is a function using datamol to preprocess the dataset, including steps to generate mol objects, SELFIES, inchi and inchikeys for each molecule. The function also standardizes mols and SMILES, drops NA values, and returns a dataframe.

In [30]:
def preprocess_smiles(df):
    df["mol"] = [dm.to_mol(x) for x in df['smiles']] # generating mols from SMILES
    df["mol"] = [dm.fix_mol(x) for x in df['mol']] # Fixing mols

    df = df.dropna() # dropping NA values

    df["mol"] = [dm.sanitize_mol(x, sanifix=True, charge_neutral=False) for x in df['mol']] # sanitize mol objects
    df["mol"] = [dm.standardize_mol(x, disconnect_metals=False, normalize=True, reionize=True, uncharge=False, stereo=True) for x in df['mol']] # standardize mol objects

    df["standard_smiles"] = [dm.standardize_smiles(x) for x in df['smiles']] # standardize SMILES
    df["selfies"] = [dm.to_selfies(x) for x in df['mol']] # generate SELFIES
    df["inchi"] = [dm.to_inchi(x) for x in df['mol']] # Generating InChi
    df["inchikey"] = [dm.to_inchikey(x) for x in df['mol']] # Generating InChIKey

    return df

Running the function and taking a look at the outputs

In [31]:
data_clean = preprocess_smiles(BBBP_df)

In [32]:
data_clean.shape

(2039, 7)

In [33]:
data_clean.head()

Unnamed: 0,p_np,smiles,mol,standard_smiles,selfies,inchi,inchikey
0,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(C)NCC(O)COc1cccc2ccccc12.[Cl-],[C][C][Branch1][C][C][N][C][C][Branch1][C][O][...,InChI=1S/C16H21NO2.ClH/c1-12(2)17-10-14(18)11-...,ZMRUPTIKESYGQW-UHFFFAOYSA-M
1,1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1,[C][C][Branch1][C][C][Branch1][C][C][O][C][=Br...,"InChI=1S/C18H27Cl2NO2/c1-18(2,3)23-17(22)6-4-5...",SZXDOYFHSIIZCF-UHFFFAOYSA-N
2,1,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...,"<img data-content=""rdkit/molecule"" src=""data:i...",CC1COc2c(N3CCN(C)CC3)c(F)cc3c(=O)c(C(=O)O)cn1c23,[C][C][C][O][C][=C][Branch1][N][N][C][C][N][Br...,InChI=1S/C18H20FN3O4/c1-10-9-26-17-14-11(16(23...,GSDSWSVVBLHKDQ-UHFFFAOYSA-N
3,1,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(=O)NCCCOc1cccc(CN2CCCCC2)c1,[C][C][=Branch1][C][=O][N][C][C][C][O][C][=C][...,InChI=1S/C17H26N2O2/c1-15(20)18-9-6-12-21-17-8...,FAXLXLJWHQJMPK-UHFFFAOYSA-N
4,1,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...,"<img data-content=""rdkit/molecule"" src=""data:i...",Cc1onc(-c2ccccc2Cl)c1C(=O)N[C@@H]1C(=O)N2[C@@H...,[C][C][O][N][=C][Branch1][#Branch2][C][=C][C][...,InChI=1S/C19H18ClN3O5S/c1-8-11(12(22-28-8)9-6-...,LQOLIRLGBULYKD-JKIFEVAISA-N


The data contains a 3:1 ratio of positive to negeative labels, which creates a bias towards molecules with blood brain permeability properties. This may need to be addressed when training models.

In [34]:
counts = data_clean['p_np'].value_counts().to_dict()
print(counts)

{1: 1560, 0: 479}


## References

1. Urbaczek, Sascha. A consistent cheminformatics framework for automated virtual screening. Ph.D. Thesis, Universität Hamburg, August 2014. URL: http://ediss.sub.uni-hamburg.de/volltexte/2015/7349/; URN: urn:nbn:de:gbv:18-73491; PDF via Semantic Scholar