# Decision Tree Model

This Notebook will do two things: 

1. Import a CSV file and convert SMILES into fingerprints
2. Apply those fingerprints as input to train a Decision Tree Model

Most of the code was sourced from ChatGPT to kick-start the project. Modifications were applied. 

In [1]:
# Imports 
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

After importing all the necessary libraries, we upload the CSV file with all the data. This file holds 15,619 molecules in the form of SMILES (text). Two additional columns of data provide the InCHI and the acitivity (float) of the molecule. 

## Data and SMILES -> Fingerprints

In [2]:
# Load the CSV file
file_path = "../Data/Mtb_published_regression_AC_Cleaned(in).csv"  # Replace with your file path
data = pd.read_csv(file_path)

# Ensure the file has the necessary columns
required_columns = {"SMILES", "InChi", "Activity"}
if not required_columns.issubset(data.columns):
    raise ValueError(f"The input CSV must contain the following columns: {required_columns}")

After uploading the data, we convert the SMILES into fingerprints, leveraging the RDKit library. 

In [3]:
# Convert SMILES to molecular fingerprints
def smiles_to_fingerprint(smiles):
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
            return np.array(fp)
        else:
            return None
    except Exception as e:
        print(f"Error processing SMILES {smiles}: {e}")
        return None

# Generate fingerprints
data['fingerprint'] = data['SMILES'].apply(smiles_to_fingerprint)
num_SMILES_raw = data["SMILES"].size

After creating a set of fingerprints, we then remove any invalid molecules. 

In [4]:
# Remove rows with invalid SMILES
data = data.dropna(subset=['fingerprint'])
X = np.array(data['fingerprint'].tolist())
y = data['Activity'].values

To check how many molecules were removed, we print the size of the original dataset with the new dataset

In [5]:
print(f"Original Dataset has {num_SMILES_raw} molecules")
num_fingerprints = data["fingerprint"].size
print(f"{num_SMILES_raw - num_fingerprints} molecules were removed")

Original Dataset has 15618 molecules
0 molecules were removed


## Model

The model will split the dataset into 8:1:1 - training:test:validation. 

In [6]:
# Initial 80-20 split (train and temp)
hitch_hiker = 42
non_training_size = 0.2
validation_size = 0.5
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=non_training_size, random_state=hitch_hiker)

# Further split temp into 50-50 (test and validation)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=validation_size, random_state=hitch_hiker)

# Check the sizes
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"Validation set size: {len(X_val)}")

# Build and train the decision tree model
model = DecisionTreeRegressor(random_state=hitch_hiker)
model.fit(X_train, y_train)

Training set size: 12494
Testing set size: 1562
Validation set size: 1562


In [7]:
# Evaluate on test and validation sets
y_test_pred = model.predict(X_test)
y_val_pred = model.predict(X_val)

mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
mse_val = mean_squared_error(y_val, y_val_pred)
r2_val = r2_score(y_val, y_val_pred)

print(f"Test Set - Mean Squared Error: {mse_test}, R^2 Score: {r2_test}")
print(f"Validation Set - Mean Squared Error: {mse_val}, R^2 Score: {r2_val}")

# Save the model (optional)
import joblib
joblib.dump(model, "decision_tree_model.pkl")

Test Set - Mean Squared Error: 0.5828544096592759, R^2 Score: 0.1161765333343604
Validation Set - Mean Squared Error: 0.6034630754251734, R^2 Score: 0.11768918302789977


['decision_tree_model.pkl']

From the MSE, we can see that the error is small, a good sign. However the Regression value is very small and doesnt show much clarification. 

## Visualise

In [8]:
from sklearn.tree import export_graphviz
import graphviz

# Export the decision tree to DOT format
dot_data = export_graphviz(
    model, 
    out_file=None, 
    feature_names=None,     # Replace with feature names
    filled=True, 
    rounded=True, 
    special_characters=True
)

# Render the DOT data to an image
graph = graphviz.Source(dot_data)
graph.render("decision_tree")  # Saves as decision_tree.pdf
graph.view()                   # Opens the visualization in a viewer

ModuleNotFoundError: No module named 'graphviz'