# Dataset Preprocessing

In [7]:
import numpy as np
import pandas as pd
from rdkit import (Chem, RDLogger)
from rdkit.Chem import Descriptors
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit.ML.Descriptors import MoleculeDescriptors as md

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, confusion_matrix, classification_report, roc_curve
)
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

## Molecule, Descriptor, and Outlier Utility Functions

### `molecule_from_smiles(smiles)`
Converts a SMILES string into a cleaned **RDKit molecule object**, while temporarily silencing RDKit logs to avoid console spam.

**Process:**
1. Parse the SMILES into an RDKit molecule (`Chem.MolFromSmiles`).
2. Remove salts and keep the **largest fragment** using `LargestFragmentChooser`.
3. Re-sanitize the molecule to ensure validity.
4. Logging is muted during processing and restored afterward.

**Returns:**
- `(molecule, status)`  
  - `molecule`: RDKit molecule object or `None`  
  - `status`: `"succeed"`, `"failed"`, or `"error: <message>"`

---

### `calculate_descriptors(molecule)`
Calculates all available **1D and 2D molecular descriptors** using RDKit’s built-in descriptor list.

**Steps:**
1. Collect all descriptor names from `Descriptors._descList`.  
2. Use `MolecularDescriptorCalculator` to compute their values for the molecule.  
3. Return as a dictionary mapping *descriptor name → value*.

**Returns:**
- `dict`: `{ descriptor_name: value }`

---

### `outliers_iqr(df, factor=1.5)`
Applies the **Interquartile Range (IQR)** rule to cap extreme numeric values.

**Process:**
- Compute Q1 (25%) and Q3 (75%) for each column.  
- Define bounds: `[Q1 − 1.5×IQR, Q3 + 1.5×IQR]`.  
- Values beyond these limits are **clipped** to the nearest boundary.  
- Columns with zero IQR (flat values) are skipped.

**Purpose:**
Removes the influence of outliers **without deleting rows**, preserving dataset structure and stabilizing machine learning models.

In [8]:
def molecule_from_smiles(smiles):
    lg = RDLogger.logger()
    # Temporarily silence RDKit logs (Only critical)
    lg.setLevel(RDLogger.CRITICAL)
    try:
        # Extract molecule
        molecule = Chem.MolFromSmiles(smiles, sanitize=True)
        if molecule is None:
            return None, "failed"

        # Remove salts
        clean_molecule = rdMolStandardize.LargestFragmentChooser()
        molecule = clean_molecule.choose(molecule)

        # Sanitize molecule again to reflect changes
        Chem.SanitizeMol(molecule)
        return molecule, "succeed"
    except Exception as e:
        return None, f"error: {e}"
    finally:
        # re-enable logging afterward
        lg.setLevel(RDLogger.INFO)


def calculate_descriptors(molecule):
    # Get all descriptors (1D/2D)
    descriptor_names = []
    for descriptor, _ in Descriptors._descList:
        descriptor_names.append(descriptor)

    # Use descriptors to calculate values
    calculator = md.MolecularDescriptorCalculator(descriptor_names)
    descriptor_values = calculator.CalcDescriptors(molecule)

    # Create dictionary
    descriptors = dict(zip(descriptor_names, descriptor_values))
    return descriptors


def outliers_iqr(df, factor=1.5):
    df_copy = df.copy()
    for col in df_copy.columns:
        # Only for numeric columns, but clean_desc should already be numeric
        q1 = df_copy[col].quantile(0.25)
        q3 = df_copy[col].quantile(0.75)
        iqr = q3 - q1

        # If IQR is 0 - column is too flat → skip
        if iqr == 0:
            continue

        lower = q1 - factor * iqr
        upper = q3 + factor * iqr

        # Apply the IQR limits
        df_copy[col] = df_copy[col].clip(lower, upper)
    return df_copy

## Dataset Processing and Descriptor Cleaning Pipeline

This notebook processes the *in chemico* dataset by:
- Converting SMILES strings into RDKit molecule objects  
- Computing 1D/2D molecular descriptors  
- Cleaning and preparing descriptors for machine learning  

It produces two key outputs:
1. **Full Excel report** – original dataset + raw descriptors + molecule build status  
2. **Clean CSV files** – train/test feature matrices ready for ML modeling

---

### Configuration Variables

| Variable | Description |
|-----------|--------------|
| `ORIG_DATASET` | Path to the original Excel file with SMILES and labels |
| `SKIP_ROWS` | Number of rows to skip at the top of the Excel file (e.g. non-data header) |
| `SMILES_COL` | Column name containing SMILES strings |
| `TARGET_COL` | Column name of the target variable (e.g. `Phototoxicity`) |
| `FULL_OUTPUT_DATASET` | Excel output with all raw descriptors + molecule status |
| `TRAIN_X_CSV`, `TEST_X_CSV` | Clean numeric descriptor files for ML (train/test) |
| `TRAIN_Y_CSV`, `TEST_Y_CSV` | Corresponding label files for ML (train/test) |
| `SIMILARITY_THRESHOLD` | Drop descriptors where ≥ this fraction of values are identical (e.g. 0.80 = 80%) |

---

### Workflow Overview

#### 1 Load the Dataset
- Read the Excel file specified by `ORIG_DATASET` using `pandas.read_excel()`.
- Skip any non-data rows (`SKIP_ROWS`).

#### 2 Convert SMILES → RDKit Molecules
- Loop through the SMILES column (`SMILES_COL`).
- Convert each SMILES to an RDKit `Mol` object using `molecule_from_smiles()`.
- Track molecule build status (`"succeed"`, `"failed"`, or `"error: ..."`).

#### 3 Compute RDKit Descriptors
- For each valid molecule, compute 1D/2D descriptors with `calculate_descriptors()`.
- Store descriptor values as dictionaries in a list.

#### 4 Build the Descriptor Table
- Convert the list of descriptor dictionaries into a single `pandas.DataFrame`.
- Each **descriptor** becomes a **column**; each **molecule** becomes a **row**.

#### 5 Descriptor Cleaning (Leak-Free)
All cleaning is fitted **only on the training set** to prevent data leakage:
- Keep only **numeric** descriptor columns.
- Replace `inf` and `-inf` values with `NaN`.
- Fill missing values with the **median** from the **training set**.
- Drop **constant or near-constant** descriptors where ≥ `SIMILARITY_THRESHOLD` of values are identical.
- Apply **IQR-based clipping** (`outliers_iqr(df, factor=1.5)`) on the **training data only** to cap extreme outliers.

#### 6 Merge and Save Outputs
- Merge the **original dataset** and **raw descriptor data** into one DataFrame,  
  adding a `MoleculeStatus` column.
- Save:
  - The full annotated dataset → `FULL_OUTPUT_DATASET` (Excel)
  - Cleaned train/test features → `TRAIN_X_CSV` and `TEST_X_CSV`
  - Train/test labels → `TRAIN_Y_CSV` and `TEST_Y_CSV`

#### 7 Logging and Summary
After processing:
- Print the number of rows/columns for the full and clean datasets.
- List descriptors dropped as constant or almost constant.
- Show a preview (`head()`) of both full and cleaned feature sets.

---

### Notes
- Columns that are **completely NaN** are dropped *before splitting* (safe operation).
- Missing values, constant-column filtering, and IQR clipping are all **fit only on the training set**.
- This ensures a fully **data leakage–free** preprocessing pipeline, ready for ML experiments.

In [9]:
# Configuration
ORIG_DATASET = "in_chemico_dataset.xlsx"
SKIP_ROWS = 1
SMILES_COL = "SMILES code"
TARGET_COL = "Phototoxicity"
FULL_OUTPUT_DATASET = "in_chemico_dataset_processed.xlsx"

# Outputs
TRAIN_X_CSV = "in_chemico_x_train.csv"
TEST_X_CSV = "in_chemico_x_test.csv"
TRAIN_Y_CSV = "in_chemico_y_train.csv"
TEST_Y_CSV = "in_chemico_y_test.csv"

# Near constant treshold - tolerance
SIMILARITY_THRESHOLD = 0.80

# Load dataset and skip first row (Header)
dataset = pd.read_excel(ORIG_DATASET, engine="openpyxl", skiprows=SKIP_ROWS)

# Build descriptors
descriptor_rows = []
state_molecules = []
molecules = []

# Loop over the SMILES column
for smiles in dataset[SMILES_COL].astype(str):
    molecule, state = molecule_from_smiles(smiles)
    state_molecules.append(state)
    molecules.append(molecule)

    # If molecule construction failed - empty placeholder
    if molecule is None:
        descriptor_rows.append({})
        continue
    # Calculate descriptors for each molecule
    descriptor_rows.append(calculate_descriptors(molecule))

# Convert list of dictionaries into dataframe
descriptor_data_all = pd.DataFrame(descriptor_rows)

# Keep everything + status
output = pd.concat(
    [dataset.reset_index(drop=True), descriptor_data_all.reset_index(drop=True)],
    axis=1
)
output["MoleculeStatus"] = state_molecules

# Output whole dataset with descriptors and state
with pd.ExcelWriter(FULL_OUTPUT_DATASET, engine="openpyxl") as writer:
    output.to_excel(writer, index=False, sheet_name="Descriptors")

print(f"Full - Rows: {len(output)}/Columns: {output.shape[1]}")
print(output.head().to_string(index=False))

# Drop failed molecules - boolean array
molecules_right = []
for molecule in molecules:
    if molecule is not None:
        molecules_right.append(True)
    else:
        molecules_right.append(False)
if not any(molecules_right):
    raise ValueError("No valid molecules after SMILES parsing.")

dataset_ok = dataset.loc[molecules_right].reset_index(drop=True)
descriptor_ok = descriptor_data_all.loc[molecules_right].reset_index(drop=True)

# Target
y_full = dataset_ok[TARGET_COL].astype(int)

# Take only numeric descriptor columns
X_full = descriptor_ok.select_dtypes(include=[np.number]).copy()
for column in X_full.columns:
    X_full[column] = X_full[column].replace([np.inf, -np.inf], np.nan)

# Drop columns that are entirely NaN
all_nan_cols = X_full.columns[X_full.isna().all()].tolist()
if all_nan_cols:
    print(f"Dropping {len(all_nan_cols)} NaN columns.")
    X_full = X_full.drop(columns=all_nan_cols)

# Split dataset - train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_full, test_size=0.2, random_state=42, stratify=y_full
)

# Calculate medians for each column in train only
train_medians = X_train.median(numeric_only=True)

# Fill missing values in both train and test using those medians
X_train = X_train.fillna(train_medians)
X_test = X_test.fillna(train_medians)

# Compute constants on train only
constant_cols = []
for col in X_train.columns:
    top_freq = X_train[col].value_counts(normalize=True, dropna=False).max()
    if top_freq >= SIMILARITY_THRESHOLD:
        constant_cols.append(col)

# Drop from train and apply same drop to test
if constant_cols:
    X_train = X_train.drop(columns=constant_cols)
    X_test = X_test.drop(columns=constant_cols)
    print(f"Dropped {len(constant_cols)} constant/almost-constant columns.")

X_train = outliers_iqr(X_train, factor=1.5)

X_train.to_csv(TRAIN_X_CSV, index=False)
X_test.to_csv(TEST_X_CSV, index=False)
y_train.to_csv(TRAIN_Y_CSV, index=False, header=[TARGET_COL])
y_test.to_csv(TEST_Y_CSV, index=False, header=[TARGET_COL])

print(f"Train - Rows: {len(X_train)}/Columns: {X_train.shape[1]}")
print("First rows of train x:")
print(X_train.head().to_string(index=False))
print(f"Test - Rows: {len(X_test)}/Columns: {X_test.shape[1]}")
print("First rows of train y:")
print(y_train.head().to_string(index=False))
X_train.describe()

Full - Rows: 162/Columns: 230
                          Name                                                                                              IUPAC name CAS registry number    Structure  Phototoxicity                                                      SMILES code                            Sources               Note    Unnamed: 8 Unnamed: 9  Unnamed: 10 Unnamed: 11  MaxAbsEStateIndex  MaxEStateIndex  MinAbsEStateIndex  MinEStateIndex      qed       SPS   MolWt  HeavyAtomMolWt  ExactMolWt  NumValenceElectrons  NumRadicalElectrons  MaxPartialCharge  MinPartialCharge  MaxAbsPartialCharge  MinAbsPartialCharge  FpDensityMorgan1  FpDensityMorgan2  FpDensityMorgan3  BCUT2D_MWHI  BCUT2D_MWLOW  BCUT2D_CHGHI  BCUT2D_CHGLO  BCUT2D_LOGPHI  BCUT2D_LOGPLOW  BCUT2D_MRHI  BCUT2D_MRLOW   AvgIpc  BalabanJ    BertzCT      Chi0     Chi0n     Chi0v      Chi1    Chi1n    Chi1v    Chi2n    Chi2v    Chi3n    Chi3v    Chi4n    Chi4v  HallKierAlpha           Ipc    Kappa1   Kappa2   Kappa3  Labute

Unnamed: 0,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,...,fr_C_O_noCOO,fr_NH0,fr_NH1,fr_aniline,fr_benzene,fr_bicyclic,fr_ether,fr_halogen,fr_para_hydroxylation,fr_pyridine
count,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,...,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0
mean,11.352259,11.352259,0.193352,-0.913807,0.606684,14.198433,317.702837,300.605333,317.265179,115.248062,...,0.593023,1.263566,0.434109,0.5,1.317829,0.604651,0.523256,0.670543,0.294574,0.232558
std,2.430989,2.430989,0.188957,1.107129,0.211797,4.792314,126.620808,122.16805,126.421616,43.814066,...,0.772495,1.444382,0.632393,0.728869,0.819532,0.797149,0.80005,0.913219,0.564622,0.476177
min,6.194108,6.194108,0.000297,-3.038324,0.139518,6.0,46.069,40.021,46.041865,20.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.206402,10.206402,0.052299,-1.315329,0.481898,10.857143,232.239,220.143,232.084792,88.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,11.846296,11.846296,0.124071,-0.83088,0.652473,12.333333,313.788,297.66,313.09819,112.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,12.881264,12.881264,0.277145,-0.166667,0.791627,16.678571,381.379,367.267,381.075882,138.0,...,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0
max,15.645307,15.645307,0.614414,1.311296,0.89323,25.410714,605.089,587.953,604.562518,213.0,...,2.5,5.0,2.5,2.5,3.0,2.5,2.5,2.5,2.0,2.0
