## Notebook 1: Data Exploration and Preprocessing

Load the DILIrank dataset, clean it, fetch molecular structures for each drug, and generate numerical features (molecular fingerprints) for our machine learning model.

### Setup

In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import RDLogger 
import cirpy 
from tqdm.notebook import tqdm
import time
import os

tqdm.pandas()

print("Libraries imported successfully.")

Libraries imported successfully.


### Load the Dataset

In [2]:
try:
    df = pd.read_csv('../data/raw/DILIrank_dataset.csv')
    print("Dataset loaded successfully.")
    print(f"Shape of the dataset: {df.shape}")
except FileNotFoundError:
    print("Error: DILIrank_dataset.csv not found.")
    print("Please make sure you have saved the dataset as a CSV in the 'data/raw/' directory.")

Dataset loaded successfully.
Shape of the dataset: (1036, 6)


### Initial Exploration

In [3]:
print("First 5 rows of the dataset:")
df.head()

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Check the distribution of our target variable, 'vDILIConcern'
print("\nDistribution of DILI Concern classes:")
print(df['vDILIConcern'].value_counts())

First 5 rows of the dataset:



Missing values in each column:
LTKBID            0
Compound Name     0
Severity Class    0
Label Section     0
vDILIConcern      0
Version           0
dtype: int64

Distribution of DILI Concern classes:
vDILIConcern
No-DILI-Concern           312
Less-DILI-Concern         278
Ambiguous DILI-concern    254
Most-DILI-Concern         192
Name: count, dtype: int64


### Data Cleaning and Preprocessing

For our baseline model, we will simplify the problem from three classes to a binary classification: **DILI Concern (1)** vs. **No DILI Concern (0)**. This is a common and effective approach. We'll combine the "Most" and "Less" classes into a single positive class.


In [4]:
# Create the binary target column 'dili_concern'
df['dili_concern'] = df['vDILIConcern'].apply(lambda x: 0 if str(x).strip() == 'No-DILI-Concern' else 1)

# Verify the new distribution
print("Distribution of the new binary target 'dili_concern':")
print(df['dili_concern'].value_counts())

# Clean up compound names
df['Compound Name'] = df['Compound Name'].str.strip()

df_clean = df[['Compound Name', 'vDILIConcern', 'dili_concern']].copy()
df_clean.head()

Distribution of the new binary target 'dili_concern':
dili_concern
1    724
0    312
Name: count, dtype: int64


Unnamed: 0,Compound Name,vDILIConcern,dili_concern
0,mercaptopurine,Most-DILI-Concern,1
1,acetaminophen,Most-DILI-Concern,1
2,azathioprine,Most-DILI-Concern,1
3,chlorpheniramine,No-DILI-Concern,0
4,clofibrate,Less-DILI-Concern,1


### Fetching Molecular Structures (SMILES)

Our model can't understand "Aspirin". It needs a machine-readable representation of the molecule. We'll use the `pubchempy` library to search the PubChem database for each drug name and retrieve its **SMILES string**, a standard text format for chemical structures.


In [5]:
def get_smiles_from_name_cir(compound_name):
    """
    Resolves a compound name to a SMILES string using the cirpy library.
    """
    try:
        # The 'smiles' representation is what we want
        smiles = cirpy.resolve(compound_name, 'smiles')
        if smiles is None:
            print(f"'{compound_name}' not found by cirpy.")
        return smiles
    except Exception as e:
        print(f"An error occurred for '{compound_name}': {e}")
        return None

print("--- Running Sanity Check with cirpy ---")
test_smiles = get_smiles_from_name_cir("Aspirin")
if test_smiles:
    print(f"Sanity Check PASSED. Found Aspirin: {test_smiles}\n")
else:
    print("Sanity Check FAILED. Could not retrieve SMILES for Aspirin.")
    print("This strongly indicates a persistent network/firewall issue.\n")

print("Fetching SMILES strings using cirpy... (This may take a few minutes)")
df_clean['smiles'] = df_clean['Compound Name'].progress_apply(get_smiles_from_name_cir)

print("\nFinished fetching SMILES.")

--- Running Sanity Check with cirpy ---
Sanity Check PASSED. Found Aspirin: CC(=O)Oc1ccccc1C(O)=O

Fetching SMILES strings using cirpy... (This may take a few minutes)


  0%|          | 0/1036 [00:00<?, ?it/s]

'rifampin' not found by cirpy.
'carboplatin' not found by cirpy.
'corticotropin' not found by cirpy.
'cholestyramine' not found by cirpy.
'isorbide mononitrate' not found by cirpy.
'sucralfate' not found by cirpy.
'cupric chloride' not found by cirpy.
'minocycline' not found by cirpy.
'prazosin' not found by cirpy.
'corticorelin ovine triflutate' not found by cirpy.
'sirolimus' not found by cirpy.
'sacrosidase' not found by cirpy.
'nesiritide' not found by cirpy.
'gemtuzumab ozogamicin' not found by cirpy.
'trospium' not found by cirpy.
'exenatide' not found by cirpy.
'lisdexamfetamine' not found by cirpy.
'fesoterodine' not found by cirpy.
'levocetirizine dihydrochloride' not found by cirpy.
'nilotinib' not found by cirpy.
'ferumoxytol' not found by cirpy.
'degarelix' not found by cirpy.
'fospropofol' not found by cirpy.
'pazopanib' not found by cirpy.
'pancrelipase' not found by cirpy.
'ecallantide' not found by cirpy.
'dermatan' not found by cirpy.
'pafuraidine' not found by cirpy.


Check if any SMILES were not found

In [6]:
missing_smiles = df_clean['smiles'].isnull().sum()
print(f"\nNumber of compounds where SMILES could not be found: {missing_smiles}")

if missing_smiles > 0:
    print("\nSome compounds with missing SMILES (first 10):")
    print(df_clean[df_clean['smiles'].isnull()]['Compound Name'].head(10).tolist())

# Drop the rows where we couldn't find a structure
df_final = df_clean.dropna(subset=['smiles']).copy()
print(f"\nShape of the dataset after dropping missing SMILES: {df_final.shape}")


Number of compounds where SMILES could not be found: 126

Some compounds with missing SMILES (first 10):
['rifampin', 'carboplatin', 'corticotropin', 'cholestyramine', 'isorbide mononitrate', 'sucralfate', 'cupric chloride', 'minocycline', 'prazosin', 'corticorelin ovine triflutate']

Shape of the dataset after dropping missing SMILES: (910, 4)


### Feature Engineering - Molecular Fingerprints

Now that we have the SMILES strings, we can use `RDKit` to convert them into **Morgan Fingerprints**. A fingerprint is a numerical vector (a list of 0s and 1s) that represents the presence or absence of various small chemical substructures within the molecule. This is the numerical input our XGBoost model will learn from.


In [None]:
def generate_fingerprint(smiles):
    """
    Generates a Morgan Fingerprint from a SMILES string.
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
        return list(fp)
    else:
        return None

# Temporarily disable RDKit warnings
lg = RDLogger.logger()
lg.setLevel(RDLogger.CRITICAL)

print("\nGenerating Morgan Fingerprints...")
if not df_final.empty:
    df_final['fingerprint'] = df_final['smiles'].progress_apply(generate_fingerprint)
    print("Finished generating fingerprints.")
else:
    print("Skipping fingerprint generation as no SMILES were found.")

# Re-enable RDKit warnings
lg.setLevel(RDLogger.INFO)


Generating Morgan Fingerprints...


  0%|          | 0/910 [00:00<?, ?it/s]

Finished generating fingerprints.


### Processed Data Saving

In [None]:
output_dir = '../data/processed'
output_path = os.path.join(output_dir, 'dili_data_clean.csv')

# Save the processed data only if it's not empty
if not df_final.empty:
    # Create the directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Save the dataframe to csv
    df_final.to_csv(output_path, index=False)
    print(f"\nProcessed data saved to '{output_path}'")
else:
    print("\nNo data to save.")


Processed data saved to '../data/processed/dili_data_clean.csv'
