SMILES is a standard way of specifying the molecular structure of a compound into a simple string representation. The string representation of the structure in the figure above is OC(=O)C1=CN(C2CC2)c3cc(N4CCNCC4)c(F)cc3C1=O. One can easily convert these string representation(s) into 2D drawings — which makes it a popular description to be used both for ML models and visualization purposes. There are other ways to represent structures as well but SMILES is better since it is more human-readable and can be transformed into other representational types such as graphs.
Since I am a data scientist and do not have a strong chemistry background, I would not go in-depth on how SMILES works. But one can go in more depth by reading the OPENSMILES documentation.
In short, it is a really powerful way of representing structures and has the capability of representing different kinds of atoms, bonds, rings and even complex concepts such as branching and aromaticity.

In [1]:
import datamol as dm
import pandas as pd

In [3]:
BBBP_df = pd.read_csv("data/BBBP.csv")
BBBP_df.head()

Unnamed: 0,num,name,p_np,smiles
0,1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12
1,2,Terbutylchlorambucil,1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl
2,3,40730,1,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...
3,4,24,1,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C
4,5,cloxacillin,1,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...


In [4]:
BBBP_df = BBBP_df.drop(["num", "name"], axis=1)
BBBP_df["smiles"].isnull().values.any()
#BBBP_df = BBBP_df.dropna()

False

Mols and smiles need to be sanitized as it will leave us with SMILES that are complete nonesense, for example, errors resulting from kekulization.

![](images/kekul.jpg)

If you were to search for the left molecule in panel (1) using an image search or a SMILES string, you might miss the right molecule in that panel.

According to the RDkit document cited in the question, the software routinely generates the alternate position of double bonds, and then (in a second step they call "aromatization") labels the ring as aromatic. In panel (2), there are three possible Lewis structures contributing to the actual structure (i.e. there is resonance), so the software would have to generate all three to be able to search for identical structures.

In [None]:
BBBP_df["mol"].isnull().values.any()
BBBP_df = BBBP_df.dropna()

In [None]:
smiles_column = "smiles"
mols_column = "mol"

In [None]:
def preprocess(row):
    mol = dm.to_mol(row[smiles_column], ordered=True)
    mol = dm.fix_mol(mol)

    if row[mols_column].isnull(): row[mols_column].dropna()
    return row


data_clean = BBBP_df.apply(preprocess, axis=1)    

In [None]:
data_clean

In [None]:
#BBBP_df["mol"] = [dm.to_mol(x) for x in BBBP_df['smiles']]
#BBBP_df["mol"] = [dm.fix_mol(x) for x in BBBP_df['mol']]
#BBBP_df = BBBP_df.dropna()
#BBBP_df["mol"] = [dm.sanitize_mol(x, sanifix=True, charge_neutral=False) for x in BBBP_df['mol']]
#BBBP_df["mol"] = [dm.standardize_mol(x, disconnect_metals=False, normalize=True, reionize=True, uncharge=False, stereo=True) for x in BBBP_df['mol']]



#mol = dm.to_mol(row[smiles_column], ordered=True)



In [None]:
#BBBP_df["standard_smiles"] = [dm.standardize_smiles(x) for x in BBBP_df['smiles']]
#BBBP_df["selfies"] = [dm.to_selfies(x) for x in BBBP_df['mol']]
#BBBP_df["inchi"] = [dm.to_inchi(x) for x in BBBP_df['mol']]
#BBBP_df["inchikey"] = [dm.to_inchikey(x) for x in BBBP_df['mol']]

In [16]:
def preprocess_smiles(df):
    df["mol"] = [dm.to_mol(x) for x in df['smiles']]
    df["mol"] = [dm.fix_mol(x) for x in df['mol']]

    shape_before = str(df.shape[0])

    df = df.dropna()

    shape_after = str(df.shape[0])

    df["mol"] = [dm.sanitize_mol(x, sanifix=True, charge_neutral=False) for x in df['mol']]
    df["mol"] = [dm.standardize_mol(x, disconnect_metals=False, normalize=True, reionize=True, uncharge=False, stereo=True) for x in df['mol']]

    df["standard_smiles"] = [dm.standardize_smiles(x) for x in df['smiles']]
    df["selfies"] = [dm.to_selfies(x) for x in df['mol']]
    df["inchi"] = [dm.to_inchi(x) for x in df['mol']]
    df["inchikey"] = [dm.to_inchikey(x) for x in df['mol']]

    cleaned_data = "shape prior to cleaning: " + shape_before + " " "shape after cleaning: " + shape_after

    return df, cleaned_data

data_clean = preprocess_smiles(BBBP_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["mol"] = [dm.sanitize_mol(x, sanifix=True, charge_neutral=False) for x in df['mol']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["mol"] = [dm.standardize_mol(x, disconnect_metals=False, normalize=True, reionize=True, uncharge=False, stereo=True) for x in df['mol']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-

In [17]:
data_clean

Unnamed: 0,p_np,smiles,mol,standard_smiles,selfies,inchi,inchikey
0,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(C)NCC(O)COc1cccc2ccccc12.[Cl-],[C][C][Branch1][C][C][N][C][C][Branch1][C][O][...,InChI=1S/C16H21NO2.ClH/c1-12(2)17-10-14(18)11-...,ZMRUPTIKESYGQW-UHFFFAOYSA-M
1,1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1,[C][C][Branch1][C][C][Branch1][C][C][O][C][=Br...,"InChI=1S/C18H27Cl2NO2/c1-18(2,3)23-17(22)6-4-5...",SZXDOYFHSIIZCF-UHFFFAOYSA-N
2,1,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...,"<img data-content=""rdkit/molecule"" src=""data:i...",CC1COc2c(N3CCN(C)CC3)c(F)cc3c(=O)c(C(=O)O)cn1c23,[C][C][C][O][C][=C][Branch1][N][N][C][C][N][Br...,InChI=1S/C18H20FN3O4/c1-10-9-26-17-14-11(16(23...,GSDSWSVVBLHKDQ-UHFFFAOYSA-N
3,1,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(=O)NCCCOc1cccc(CN2CCCCC2)c1,[C][C][=Branch1][C][=O][N][C][C][C][O][C][=C][...,InChI=1S/C17H26N2O2/c1-15(20)18-9-6-12-21-17-8...,FAXLXLJWHQJMPK-UHFFFAOYSA-N
4,1,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...,"<img data-content=""rdkit/molecule"" src=""data:i...",Cc1onc(-c2ccccc2Cl)c1C(=O)N[C@@H]1C(=O)N2[C@@H...,[C][C][O][N][=C][Branch1][#Branch2][C][=C][C][...,InChI=1S/C19H18ClN3O5S/c1-8-11(12(22-28-8)9-6-...,LQOLIRLGBULYKD-JKIFEVAISA-N
...,...,...,...,...,...,...,...
2045,1,C1=C(Cl)C(=C(C2=C1NC(=O)C(N2)=O)[N+](=O)[O-])Cl,"<img data-content=""rdkit/molecule"" src=""data:i...",O=c1[nH]c2cc(Cl)c(Cl)c([N+](=O)[O-])c2[nH]c1=O,[O][=C][NH1][C][=C][C][Branch1][C][Cl][=C][Bra...,InChI=1S/C8H3Cl2N3O4/c9-2-1-3-5(6(4(2)10)13(16...,CHFSOFHQIZKQCR-UHFFFAOYSA-N
2046,1,[C@H]3([N]2C1=C(C(=NC=N1)N)N=C2)[C@@H]([C@@H](...,"<img data-content=""rdkit/molecule"" src=""data:i...",C[S+](CC[C@H](N)C(=O)[O-])C[C@H]1O[C@@H](n2cnc...,[C][S+1][Branch1][N][C][C][C@H1][Branch1][C][N...,InChI=1S/C15H22N6O5S/c1-27(3-2-7(16)15(24)25)4...,MEFKEPWMEQBLKI-AIRLBKTGSA-N
2047,1,[O+]1=N[N](C=C1[N-]C(NC2=CC=CC=C2)=O)C(CC3=CC=...,"<img data-content=""rdkit/molecule"" src=""data:i...",CC(Cc1ccccc1)n1cc([N-]C(=O)Nc2ccccc2)[o+]n1,[C][C][Branch1][#Branch2][C][C][=C][C][=C][C][...,InChI=1S/C18H18N4O2/c1-14(12-15-8-4-2-5-9-15)2...,SAJPRPXALCNNRQ-UHFFFAOYSA-N
2048,1,C1=C(OC)C(=CC2=C1C(=[N+](C(=C2CC)C)[NH-])C3=CC...,"<img data-content=""rdkit/molecule"" src=""data:i...",CCc1c(C)[n+]([NH-])c(-c2ccc(OC)c(OC)c2)c2cc(OC...,[C][C][C][=C][Branch1][C][C][N+1][Branch1][C][...,InChI=1S/C22H26N2O4/c1-7-15-13(2)24(23)22(14-8...,IQLRWLFPGKTMDX-UHFFFAOYSA-N


In [None]:
smiles_column = "smiles"

def _preprocess(row):
    mol = dm.to_mol(row[smiles_column], ordered=True)
    mol = dm.fix_mol(mol)
    mol = dm.sanitize_mol(mol, sanifix=True, charge_neutral=False)
    mol = dm.standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, uncharge=False, stereo=True)

    row["standard_smiles"] = dm.standardize_smiles(dm.to_smiles(mol))
    row["selfies"] = dm.to_selfies(mol)
    row["inchi"] = dm.to_inchi(mol)
    row["inchikey"] = dm.to_inchikey(mol)
    return row

data_clean = BBBP_df.apply(_preprocess, axis=1)    
data_clean.head()

1. Urbaczek, Sascha. A consistent cheminformatics framework for automated virtual screening. Ph.D. Thesis, Universität Hamburg, August 2014. URL: http://ediss.sub.uni-hamburg.de/volltexte/2015/7349/; URN: urn:nbn:de:gbv:18-73491; PDF via Semantic Scholar