<span style="color:red;"> BBBP DATASET:

BBBP stands for Blood-Brain Barrier Penetration. The BBBP dataset consists of molecular features extracted from chemical compounds along with their BBBP values. 
This dataset includes binary labels for over 2000 compounds on their permeability properties.

    The raw data csv file contains columns below:

    - "name" - Name of the compound
    - "smiles" - SMILES representation of the molecular structure (a text-based notation that encodes the structure of chemical molecules )
    - "p_np" - Binary labels for penetration/non-penetration (1: Penetrates, 0: Doesn't Penetrate)





In [1]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import os


In [2]:
data = pd.read_csv("./BBBP.csv")  
class_counts = data.groupby('p_np').size()
print(class_counts)

p_np
0     483
1    1567
dtype: int64


<small><i>
Class 1 (Penetrant / BBB+) is the majority class (~77%)\
Class 0 (Non-penetrant / BBB−) is the minority class (~23%)
<small><i>

<span style="color:red;">MOLECULAR DESCRIPTIONS GENERATION:

What Are Molecular Descriptors?\
Molecular descriptors are numerical values that quantitatively describe various chemical and physical properties of molecules. They act as features in machine learning models, enabling algorithms to "understand" molecules by transforming them from complex chemical structures into fixed-size numerical vectors.

Below I implemented two typically used descriptor calculator: RDKit and Mordred (built on top of RDKit).\
What is done:\
Converting SMILES to molecules\
Computing descriptors (rdkit or mordred)\
Cleaning and scales the resulting feature set\
Returns the final descriptor matrix (with rdkit or mordred) and save it in a .cvs

In [3]:
import pandas as pd
import numpy as np
from rdkit import Chem
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from mordred import Calculator, descriptors as mordred_descriptors
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')


def smiles_to_mols(smiles_list):
    mols = []
    valid_idx = []
    for i, smi in enumerate(smiles_list):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            mols.append(mol)
            valid_idx.append(i)
    return mols, valid_idx

def compute_mordred(mols):
    calc = Calculator(mordred_descriptors, ignore_3D=True)
    return calc.pandas(mols)


def compute_rdkit(mols):
    desc_names = [d[0] for d in Descriptors._descList]
    calc = MoleculeDescriptors.MolecularDescriptorCalculator(desc_names)
    
    desc_data = []
    for mol in mols:
        mol = Chem.AddHs(mol)
        desc = calc.CalcDescriptors(mol)
        desc_data.append(desc)
    
    return pd.DataFrame(desc_data, columns=desc_names)

def clean_scale(df):
    """Remove columns with ≥10% missing values
      Keep only numeric columns
      Remove constant columns
      Impute missing values with column means
      Standardize (z-score normalization) all features"""
    
    df = df.dropna(axis=1, thresh=int(0.9 * len(df)))
    df = df.select_dtypes(include=[np.number])
    df = df.loc[:, (df != df.iloc[0]).any()]
    
    imputer = SimpleImputer(strategy='mean')
    scaler = StandardScaler()
    imputed = imputer.fit_transform(df)
    scaled = scaler.fit_transform(imputed)
    
    return pd.DataFrame(scaled, columns=df.columns)

def prepare_descriptors(data,name_col = None ,smiles_col='smiles', target_col=None, method=None):
   
    mols, valid_idx = smiles_to_mols(data[smiles_col])    
    if method == 'mordred':
        desc_df = compute_mordred(mols)
    elif method == 'rdkit':
        desc_df = compute_rdkit(mols)
    else:
        raise ValueError("Invalid method: choose 'rdkit' or 'mordred'")
    desc_df = clean_scale(desc_df)

 
    if name_col:
        name_series = data.iloc[valid_idx][name_col].reset_index(drop=True)
        desc_df.insert(0, name_col, name_series)
    if target_col:
        target_series = data.iloc[valid_idx][target_col].reset_index(drop=True)
        desc_df[target_col] = target_series

    return desc_df


In [4]:
df_rdkit = prepare_descriptors(data,  name_col="name", smiles_col="smiles", target_col="p_np", method="rdkit")
df_rdkit.to_csv("BBBP_rdkit_descriptors.csv", index=False)
df_mordred = prepare_descriptors(data,  name_col="name",smiles_col="smiles", target_col="p_np", method="mordred")
df_mordred.to_csv("BBBP_mordred_descriptors.csv", index=False)

100%|██████████| 2039/2039 [01:06<00:00, 30.51it/s]
