<center><h2>Data Preprocessing</h2></center>

This notebook presents the data preprocessing steps for the testing dataset used in compound–protein interaction (CPI) prediction for Alzheimer’s disease targets. The data was curated from Philippine articles consisting of 1,121 of Ligand SMILES from the natural products of Philippine medicinal plants.

## Library Details

Start by importing necessary library details

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

## Data Cleaning

This section shows the data cleaning process.

In [2]:
# Load the dataset
NP_Data = pd.read_csv('Downloads/NP DATASET.csv')

In [3]:
NP_Data.head()

Unnamed: 0,Organism Name,Compound Name,SMILES,Source
0,Momordica Charantia,Quercetin,Oc1cc(O)c2C(=O)C(=C(Oc2c1)c3ccc(O)c(O)c3)O,NPASS
1,Momordica Charantia,Niacin,OC(=O)c1cccnc1,NPASS
2,Momordica Charantia,Retinol,CC(=C\CO)/C=C/C=C(C)/C=C/C1=C(C)CCCC1(C)C,NPASS
3,Momordica Charantia,Ascorbate,OC[C@H](O)[C@H]1OC(=O)[C-](O)C1=O,NPASS
4,Momordica Charantia,Vitamin E,CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC[C@]1(C)CCc2c(...,NPASS


In [4]:
NP_Data.isnull().sum()

Organism Name    0
Compound Name    0
SMILES           0
Source           0
dtype: int64

In [5]:
NP_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1121 entries, 0 to 1120
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Organism Name  1121 non-null   object
 1   Compound Name  1121 non-null   object
 2   SMILES         1121 non-null   object
 3   Source         1121 non-null   object
dtypes: object(4)
memory usage: 35.2+ KB


### <center> Feature Extraction

In [6]:
import re
from rdkit import Chem
from rdkit.Chem import AllChem
import networkx as nx

In [7]:
def smiles_tokenizer(smiles):
    """
    Tokenize SMILES string using regex pattern.
    
    Args:
        smiles (str): SMILES string to tokenize
        
    Returns:
        list: List of tokens
    """
    pattern = r'(\[[^\]]+\]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
    tokens = re.findall(pattern, smiles)
    return tokens

In [8]:
def smiles_to_molecular_graph(smiles):
    """
    Convert SMILES string to molecular graph representation.
    
    Args:
        smiles (str): SMILES string
        
    Returns:
        dict: Dictionary containing:
            - 'mol': RDKit molecule object
            - 'graph': NetworkX graph
            - 'node_features': Dictionary of node features (atomic number, degree, etc.)
            - 'edge_features': Dictionary of edge features (bond types)
            - 'adjacency_matrix': Numpy adjacency matrix
    """
    # Parse SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(smiles)
    
    if mol is None:
        raise ValueError(f"Invalid SMILES string: {smiles}")
    
    AllChem.Compute2DCoords(mol)
    
    # NetworkX graph
    G = nx.Graph()
    
    # Nodes
    node_features = {}
    for atom in mol.GetAtoms():
        idx = atom.GetIdx()
        G.add_node(idx)
        node_features[idx] = {
            'atomic_num': atom.GetAtomicNum(),
            'symbol': atom.GetSymbol(),
            'degree': atom.GetDegree(),
            'formal_charge': atom.GetFormalCharge(),
            'hybridization': str(atom.GetHybridization()),
            'is_aromatic': atom.GetIsAromatic(),
            'num_hydrogens': atom.GetTotalNumHs()
        }
    
    # Edges
    edge_features = {}
    for bond in mol.GetBonds():
        start = bond.GetBeginAtomIdx()
        end = bond.GetEndAtomIdx()
        G.add_edge(start, end)
        edge_features[(start, end)] = {
            'bond_type': str(bond.GetBondType()),
            'is_conjugated': bond.GetIsConjugated(),
            'is_aromatic': bond.GetIsAromatic()
        }
        # Reverse edge for undirected graph
        edge_features[(end, start)] = edge_features[(start, end)]
    
    # Adjacency matrix
    n_atoms = mol.GetNumAtoms()
    adj_matrix = np.zeros((n_atoms, n_atoms))
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()
        adj_matrix[i, j] = 1
        adj_matrix[j, i] = 1
    
    return {
        'mol': mol,
        'graph': G,
        'node_features': node_features,
        'edge_features': edge_features,
        'adjacency_matrix': adj_matrix
    }

In [9]:
def safe_smiles_to_graph(smiles):
    """
    Safely convert SMILES to molecular graph with error handling.
    
    Args:
        smiles (str): SMILES string
        
    Returns:
        dict or None: Molecular graph dictionary if successful, None if error
    """
    try:
        return smiles_to_molecular_graph(smiles)
    except Exception as e:
        print(f"Error processing SMILES: {smiles[:50]}... - {str(e)}")
        return None

In [10]:
from tqdm import tqdm
tqdm.pandas()

NP_Data['SMILES Tokens'] = NP_Data['SMILES'].progress_apply(smiles_tokenizer)
NP_Data['Molecular Graph'] = NP_Data['SMILES'].progress_apply(safe_smiles_to_graph)

100%|██████████| 1121/1121 [00:00<00:00, 21021.03it/s]
[03:22:20] Explicit valence for atom # 43 Cl, 1, is greater than permitted
 22%|██▏       | 246/1121 [00:00<00:02, 337.30it/s]

Error processing SMILES: C[C@@H]1O[C@@H](OC[C@H]2O[C@@H](OC3=C(Oc4cc(O)cc(O... - Invalid SMILES string: C[C@@H]1O[C@@H](OC[C@H]2O[C@@H](OC3=C(Oc4cc(O)cc(O)c4C3=O)c5ccc(O)c(O)c5)[C@H](O)[C@@H](O)[C@@H]2O)[C@H](O)[C@H](O)[C@H]1O[Cl-].OC[C@H]1O[C@@H](Oc2cc3c(O)cc(O)cc3[o+]c2c4ccc(O)cc4)[C@H](O)[C@@H](O)[C@@H]1O


[03:22:21] SMILES Parse Error: syntax error while parsing: O(C(O[?])([?])[?])[?]
[03:22:21] SMILES Parse Error: check for mistakes around position 7:
[03:22:21] O(C(O[?])([?])[?])[?]
[03:22:21] ~~~~~~^
[03:22:21] SMILES Parse Error: extra open parentheses while parsing: O(C(O[?])([?])[?])[?]
[03:22:21] SMILES Parse Error: check for mistakes around position 2:
[03:22:21] O(C(O[?])([?])[?])[?]
[03:22:21] ~^
[03:22:21] SMILES Parse Error: extra open parentheses while parsing: O(C(O[?])([?])[?])[?]
[03:22:21] SMILES Parse Error: check for mistakes around position 4:
[03:22:21] O(C(O[?])([?])[?])[?]
[03:22:21] ~~~^
[03:22:21] SMILES Parse Error: Failed parsing SMILES 'O(C(O[?])([?])[?])[?]' for input: 'O(C(O[?])([?])[?])[?]'
[03:22:21] SMILES Parse Error: syntax error while parsing: [O]C(=O)[?][?][?][?][?][?][?][?][?][?][?][?][?][?][?][?][?]
[03:22:21] SMILES Parse Error: check for mistakes around position 10:
[03:22:21] [O]C(=O)[?][?][?][?][?][?][?][?][?][?][?]
[03:22:21] ~~~~~~~~~^
[03:22

Error processing SMILES: O(C(O[?])([?])[?])[?]... - Invalid SMILES string: O(C(O[?])([?])[?])[?]
Error processing SMILES: [O]C(=O)[?][?][?][?][?][?][?][?][?][?][?][?][?][?]... - Invalid SMILES string: [O]C(=O)[?][?][?][?][?][?][?][?][?][?][?][?][?][?][?][?][?]
Error processing SMILES: [O]C(=O)[?]... - Invalid SMILES string: [O]C(=O)[?]
Error processing SMILES: O([C]=O)[?]... - Invalid SMILES string: O([C]=O)[?]
Error processing SMILES: O(C(=O)[?])[?]... - Invalid SMILES string: O(C(=O)[?])[?]
Error processing SMILES: O([C](O[?])[?])[?]... - Invalid SMILES string: O([C](O[?])[?])[?]
Error processing SMILES: [O]C(=O)[?][?]... - Invalid SMILES string: [O]C(=O)[?][?]


100%|██████████| 1121/1121 [00:03<00:00, 326.18it/s]


In [11]:
NP_Data.head()

Unnamed: 0,Organism Name,Compound Name,SMILES,Source,SMILES Tokens,Molecular Graph
0,Momordica Charantia,Quercetin,Oc1cc(O)c2C(=O)C(=C(Oc2c1)c3ccc(O)c(O)c3)O,NPASS,"[O, c, 1, c, c, (, O, ), c, 2, C, (, =, O, ), ...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
1,Momordica Charantia,Niacin,OC(=O)c1cccnc1,NPASS,"[O, C, (, =, O, ), c, 1, c, c, c, n, c, 1]",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
2,Momordica Charantia,Retinol,CC(=C\CO)/C=C/C=C(C)/C=C/C1=C(C)CCCC1(C)C,NPASS,"[C, C, (, =, C, \, C, O, ), /, C, =, C, /, C, ...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
3,Momordica Charantia,Ascorbate,OC[C@H](O)[C@H]1OC(=O)[C-](O)C1=O,NPASS,"[O, C, [C@H], (, O, ), [C@H], 1, O, C, (, =, O...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
4,Momordica Charantia,Vitamin E,CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC[C@]1(C)CCc2c(...,NPASS,"[C, C, (, C, ), C, C, C, [C@@H], (, C, ), C, C...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...


In [12]:
# Check how many failed
print(f"Failed to process: {NP_Data['Molecular Graph'].isna().sum()} rows")

# Drop the rows where the graph could not be created
NP_Data = NP_Data.dropna(subset=['Molecular Graph'])

# Reset index again to keep it clean
NP_Data = NP_Data.reset_index(drop=True)

# Verify the final clean shape
print(f"Final clean dataset size: {NP_Data.shape}")

Failed to process: 8 rows
Final clean dataset size: (1113, 6)


In [13]:
NP_Data.to_csv("NP_Data.csv", index=False)