###### Decision Tree Model - Morgan

This Notebook will do two things: 

1. Import a CSV file and convert SMILES into Morgan fingerprints
2. Apply those fingerprints as input to train a Decision Tree Model

Most of the code was sourced from ChatGPT to kick-start the project. Modifications were applied. 

In [1]:
# Imports 
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

After importing all the necessary libraries, we upload the CSV file with all the data. This file holds 15,619 molecules in the form of SMILES (text). Two additional columns of data provide, the International Chemical Identifier (InChi) and the acitivity (float) of the molecule. 

## Data and Fingerprints

In [2]:
def load_csv(file_path):
    """Function reads a csv file and returns a Panda DataFrame"""
    
    # Load the CSV file
    data = pd.read_csv(file_path)

    # Ensure the file has the necessary columns
    required_columns = {"SMILES", "InChi", "Activity"}
    if not required_columns.issubset(data.columns):
        raise ValueError(f"The input CSV must contain the following columns: {required_columns}")
    data.head()
    
    return data

After uploading the data, we convert the SMILES into fingerprints, leveraging the RDKit library. 

In [7]:
# Convert SMILES to molecular fingerprints
def smiles_to_fingerprint(smiles, radius = 2, nBits = 2048):
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nBits)
            return np.array(fp)
        else:
            return None
    except Exception as e:
        print(f"Error processing SMILES {smiles}: {e}")
        return None

# Convert Inchi to molecular fingerprints
def inchis_to_fingerprint(inchis, radius = 2, nBits = 2048):
    try:
        mol_inchi = Chem.MolFromInchi(inchis)
        if mol:
            fp_inchi = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nBits) 
            return np.array(fp_inchi)
        else:
            return None
    except Exception as e:
        print(f"Error processing InChis {inchis}: {e}")
        return None

After creating a set of fingerprints, we then remove any invalid molecules. 

In [8]:
# Remove empty rows
def clean_data(data):
    """Function cleans data (removes negative activities, removes empty cells) and returns X and y"""

    original_size = data["SMILES"].size

    # Remove any missing values
    data = data.dropna(subset=['SMILES'])
    data = data.dropna(subset=['InChi'])
    data = data.dropna(subset=['Activity'])
    data = data.dropna(subset=['Fingerprint'])

    # Define X, y
    X = np.array(data['Fingerprint'].tolist())
    y = data['Activity'].values

    # Check the difference between clean and unclean data
    final_size = data["SMILES"].size
    print(f"Original Dataset has {original_size} molecules")
    print(f"{original_size-final_size} of molecule were removed")
    print(f"Final Size is {final_size}")
    
    return X, y

To check how many molecules were removed, we print the size of the original dataset with the new dataset

## Model
Lets train a simple Decision Tree Regressor. 

In [10]:
# Load data
data = load_csv("../Data/training.csv")
data.head()
data.describe()

Unnamed: 0,Activity
count,15618.0
mean,5.006235
std,0.840051
min,1.87257
25%,4.428867
50%,4.936465
75%,5.50864
max,10.0


In [11]:


# First split: 80% training, 20% (validation + test)
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=42)

# Second split: 50% validation, 50% test (from the 20%)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Prep training set
train_data['Fingerprint'] = train_data['SMILES'].apply(smiles_to_fingerprint)
X_train, y_train = clean_data(train_data)

# Prep validation set
val_data['Fingerprint'] = val_data['SMILES'].apply(smiles_to_fingerprint)
X_val, y_val = clean_data(val_data)

# Prep test set
test_data['Fingerprint'] = test_data['SMILES'].apply(smiles_to_fingerprint)
X_test, y_test = clean_data(test_data)

# Check the sizes
print("")
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"Validation set size: {len(X_val)}")

Original Dataset has 12494 molecules
0 of molecule were removed
Final Size is 12494
Original Dataset has 1562 molecules
0 of molecule were removed
Final Size is 1562
Original Dataset has 1562 molecules
0 of molecule were removed
Final Size is 1562

Training set size: 12494
Testing set size: 1562
Validation set size: 1562


In [12]:
# Build and train the decision tree model
model = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                              max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                              min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
                              max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0)
model.fit(X_train, y_train)

# Predict from the model
y_test_pred = model.predict(X_test)

# Calulcate MSE, R^2 of the training data
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"Test Set - Mean Squared Error: {mse_test}, R^2 Score: {r2_test}")


Test Set - Mean Squared Error: 0.5261513274444239, R^2 Score: 0.23072508248931467


## Testing and Validating
Compare the test and validation set to the model. 

In [20]:
# Evaluate on test and validation sets
y_test_pred = model.predict(X_test)
y_val_pred = model.predict(X_val)

mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
mse_val = mean_squared_error(y_val, y_val_pred)
r2_val = r2_score(y_val, y_val_pred)

print(f"Test Set - Mean Squared Error: {mse_test}, R^2 Score: {r2_test}")
print(f"Validation Set - Mean Squared Error: {mse_val}, R^2 Score: {r2_val}")

# Save the model (optional)
import joblib
joblib.dump(model, "decision_tree_model.pkl")

Test Set - Mean Squared Error: 0.5360262138012457, R^2 Score: 0.21628721643950632
Validation Set - Mean Squared Error: 0.5262215107557578, R^2 Score: 0.2020530133038475


['decision_tree_model.pkl']

From the MSE, we can see that the error is small, a good sign. However the Regression value is very small and doesnt show much clarification. Increasing radius to 3 reduces the R2 score, but increasing the nbits representation of the fingerprints increased it. 