<a href="https://colab.research.google.com/github/chupvl/gcolab/blob/main/2023_04_14_chatgpt_for_xgb_qsar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hey ChatGPT write me a xgboost-based classification model for compounds using Morgan fingerprints

Sure, I can help you create an XGBoost-based classification model for compounds using Morgan fingerprints. Below is an example of a basic script in Python, using RDKit for generating Morgan fingerprints and the XGBoost library for creating the classification model.

Assuming you have a dataset containing SMILES strings and their corresponding labels (1 for active, 0 for inactive), you can follow the steps below:

First, install the required packages:

In [1]:
!pip install numpy pandas rdkit xgboost scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rdkit-2022.9.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.4/29.4 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2022.9.5


Create a Python script for the XGBoost classifier with Morgan fingerprints:

In [2]:
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

In [3]:
def generate_morgan_fingerprints(smiles_list, radius, nBits):
    fps = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)
        fps.append(np.array(fp))
    return np.array(fps)

In [4]:
def main():
    # Load your dataset
    # data = pd.read_csv('your_dataset.csv')
    # smiles_list = data['smiles'].tolist()
    # labels = data['label'].tolist()

    # Example data
    smiles_list = ['CCO', 'CCN', 'CCBr']
    labels = [1, 0, 1]

    # Generate Morgan fingerprints
    radius = 2
    nBits = 2048
    fingerprints = generate_morgan_fingerprints(smiles_list, radius, nBits)

    # Split the dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(fingerprints, labels, test_size=0.2, random_state=42)

    # Train the XGBoost classifier
    model = xgb.XGBClassifier()
    model.fit(X_train, y_train)

    # Predict the test set and evaluate the performance
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print("Accuracy:", accuracy)
    print("Confusion Matrix:", cm)

if __name__ == "__main__":
    main()

Accuracy: 0.0
Confusion Matrix: [[0 0]
 [1 0]]


This script demonstrates how to generate Morgan fingerprints from SMILES strings, split the dataset into training and test sets, train an XGBoost classifier, and evaluate its performance using accuracy and confusion matrix.

Please note that this is a basic example, and you should adjust it to your specific dataset and requirements. Additionally, you may want to consider fine-tuning the XGBoost hyperparameters to improve the model's performance.

# Great! Can you please add also various descriptors like MW, logP and alike and combined them with Morgan fingerprint?

Certainly! You can calculate various molecular descriptors using RDKit, such as molecular weight (MW) and octanol-water partition coefficient (logP). In the following example, I'll show you how to calculate these descriptors and combine them with the Morgan fingerprints:



In [5]:
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

def generate_morgan_fingerprints_and_descriptors(smiles_list, radius, nBits):
    fps_and_descriptors = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)

        # Calculate molecular descriptors
        mw = Descriptors.MolWt(mol)
        logp = Descriptors.MolLogP(mol)
        num_h_donors = Descriptors.NumHDonors(mol)
        num_h_acceptors = Descriptors.NumHAcceptors(mol)

        # Combine Morgan fingerprints and descriptors
        combined_features = np.concatenate((fp, [mw, logp, num_h_donors, num_h_acceptors]))
        fps_and_descriptors.append(combined_features)
    return np.array(fps_and_descriptors)

def main():
    # Load your dataset
    # data = pd.read_csv('your_dataset.csv')
    # smiles_list = data['smiles'].tolist()
    # labels = data['label'].tolist()

    # Example data
    smiles_list = ['CCO', 'CCN', 'CCBr']
    labels = [1, 0, 1]

    # Generate Morgan fingerprints and descriptors
    radius = 2
    nBits = 2048
    features = generate_morgan_fingerprints_and_descriptors(smiles_list, radius, nBits)

    # Split the dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

    # Train the XGBoost classifier
    model = xgb.XGBClassifier()
    model.fit(X_train, y_train)

    # Predict the test set and evaluate the performance
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print("Accuracy:", accuracy)
    print("Confusion Matrix:", cm)

if __name__ == "__main__":
    main()


Accuracy: 0.0
Confusion Matrix: [[0 0]
 [1 0]]


# Fantastic! Thank you! But can you use rdkit.Chem.rdMolDescriptors module instead of rdkit.Chem.Descriptors, and please use all all possible descriptors 


Sure, I can modify the script to use the rdkit.Chem.rdMolDescriptors module and calculate all available descriptors. The rdMolDescriptors module provides more descriptors compared to the rdkit.Chem.Descriptors. Here's the updated script:

In [6]:
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, rdMolDescriptors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

def generate_morgan_fingerprints_and_descriptors(smiles_list, radius, nBits):
    fps_and_descriptors = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)

        # Calculate molecular descriptors using rdMolDescriptors
        descriptors = []
        for descriptor_name in dir(rdMolDescriptors):
            descriptor = getattr(rdMolDescriptors, descriptor_name)
            if callable(descriptor) and 'Compute' in descriptor_name:
                try:
                    value = descriptor(mol)
                    if isinstance(value, tuple):
                        descriptors.extend(value)
                    else:
                        descriptors.append(value)
                except Exception as e:
                    pass

        # Combine Morgan fingerprints and descriptors
        combined_features = np.concatenate((fp, descriptors))
        fps_and_descriptors.append(combined_features)
    return np.array(fps_and_descriptors)

def main():
    # Load your dataset
    # data = pd.read_csv('your_dataset.csv')
    # smiles_list = data['smiles'].tolist()
    # labels = data['label'].tolist()

    # Example data
    smiles_list = ['CCO', 'CCN', 'CCBr']
    labels = [1, 0, 1]

    # Generate Morgan fingerprints and descriptors
    radius = 2
    nBits = 2048
    features = generate_morgan_fingerprints_and_descriptors(smiles_list, radius, nBits)

    # Split the dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

    # Train the XGBoost classifier
    model = xgb.XGBClassifier()
    model.fit(X_train, y_train)

    # Predict the test set and evaluate the performance
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print("Accuracy:", accuracy)
    print("Confusion Matrix:", cm)

if __name__ == "__main__":
    main()


Accuracy: 0.0
Confusion Matrix: [[0 0]
 [1 0]]


In this updated script, I replaced the rdkit.Chem.Descriptors module with the rdkit.Chem.rdMolDescriptors module. The script iterates through all available descriptors in rdMolDescriptors and calculates them for each molecule. Please note that some descriptors may not be applicable to all molecules, so the script uses a try-except block to handle such cases.

Keep in mind that using all possible descriptors might increase the dimensionality of the input features, which can affect the performance of the model. Depending on your specific use case, it might be useful to select a subset of relevant descriptors or apply dimensionality reduction techniques like PCA. Additionally, consider fine-tuning the XGBoost hyperparameters to improve the model's performance.