# Predictive Toxicology using QSAR Analysis with RDKit

This Jupyter Notebook aims to explore Quantitative Structure-Activity Relationship (QSAR) analysis with a focus on predicting toxicity.

QSAR workflow involves two main steps:

**Step 1: Creating the Mathematical Model for Toxicity**

    In the first step, we create a mathematical model for toxicity using the available dataset. This model is based on the quantitative relationship between the chemical properties of compounds (molecular fingerprints)  and their toxicity. 

**Step 2: Using the Toxicity Model for Filtering Ligand-Based Virtual Screening**

    Once the QSAR model is created and validated, you can utilize it for filtering compounds in ligand-based virtual screening

### Step 1: Creating the Mathematical Model for Toxicity

***Dataset Preparation***: Load the toxicity dataset. This data set includes information on the toxicity of several chemical compounds related with liver toxicity. Cleans the database: removes compounds with salts, removes charges, removes Nan elements. 

***Descriptor Calculation***: Calculate molecular descriptors for each compound in the dataset using RDKit. 

***Model Building***: Select a suitable machine learning or statistical model. Train the model using the computed molecular descriptors as features and the toxicity data as the target variable.

    Model Validation: Evaluate the performance of the model using validation techniques like cross-validation, RMSE, R2score...

    Model Optimization: Fine-tune the model parameters to improve its predictive performance, if needed.

    Model Interpretation: Analyze the model to understand which molecular features contribute to toxicity predictions. This insight can be valuable for designing safer compounds.

In [None]:
import pandas as pd
import requests
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
import matplotlib.pyplot as plt
import numpy as np

**Machine Learning in Python: Scikit-learn**

[Scikit-learn webpage](https://scikit-learn.org/stable/index.html)

[Scikit-learn GitHub](https://github.com/scikit-learn/scikit-learn)

[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)


In [None]:
#!pip install scikit-learn 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score

***Dataset Preparation***

Initial Dataset related with liver toxicity obtained from: https://www.fda.gov/science-research/liver-toxicity-knowledge-base-ltkb/drug-induced-liver-injury-rank-dilirank-dataset

In [None]:
excel_file = 'LiverToxdf.xlsx'
# Load the Excel file into a DataFrame
df_tox = pd.read_excel(excel_file)
df_tox.head()

In [None]:
len(df_tox)

In [None]:
#Remove NaN 
df_tox_cleaned = df_tox.dropna().reset_index(drop=True)
len(df_tox_cleaned)

In [None]:
# Drop rows with compounds containing charges
df_tox_cleaned = df_tox_cleaned[df_tox_cleaned['SMILES'].apply(lambda x: all(atom.GetFormalCharge() == 0 for atom in Chem.MolFromSmiles(x).GetAtoms()))].reset_index(drop=True)
len(df_tox_cleaned)

    df_tox_cleaned['SMILES'].apply(lambda x: all(atom.GetFormalCharge() == 0 for atom in Chem.MolFromSmiles(x).GetAtoms())): This part of the code applies a lambda function to each value in the 'SMILES' column of the DataFrame. The lambda function converts the SMILES string into a chemical molecule using the RDKit library (Chem.MolFromSmiles(x)), and then checks that all formal charges of atoms in that molecule are equal to zero (atom.GetFormalCharge() == 0).

    df_tox_cleaned[...]: It filters the original DataFrame (df_tox_cleaned) by keeping only the rows where the above condition is true. In other words, rows containing molecules with atoms having formal charges different from zero are removed.

    .reset_index(drop=True): After filtering the DataFrame, the index is reset to reflect the new dataset without the removed rows. The argument drop=True prevents the addition of an additional column to store the old indices.

In [None]:
#Count the number of compounds in each severity class category
severity_counts = df_tox_cleaned['Severity Class'].value_counts().sort_index()
severity_counts

In [None]:
# Create a bar chart with matplotlib
plt.bar(severity_counts.index, severity_counts, edgecolor='black')

# Add labels and title
plt.xlabel('Severity Class')
plt.ylabel('Number of Compounds')
plt.title('Count of Compounds by Severity Category')

# Show the plot
plt.show()

When dealing with imbalanced datasets, where certain classes are underrepresented, undersampling is a technique that involves reducing the number of instances in the majority class to balance the distribution. Remember that undersampling can lead to a loss of information, and its success depends on the specific characteristics of your dataset and the problem you are addressing. 

In [None]:
# Find the most populated class
most_populated_class = severity_counts.idxmax()

# Set the desired number of instances for all classes (you can adjust it according to your needs)
desired_size = 40

# Create an empty DataFrame to store the balanced dataset
balanced_df = pd.DataFrame()

# Apply undersampling only to the most populated class
for severity_class, count in severity_counts.items():
    if severity_class == most_populated_class:
        # If it's the most populated class, reduce the number of instances to desired_size
        undersampled_indices = df_tox_cleaned[df_tox_cleaned['Severity Class'] == severity_class].sample(n=desired_size, random_state=42).index
    else:
        # If it's not the most populated class, include all instances
        undersampled_indices = df_tox_cleaned[df_tox_cleaned['Severity Class'] == severity_class].index

    # Concatenate the undersampled indices to the balanced DataFrame
    balanced_df = pd.concat([balanced_df, df_tox_cleaned.loc[undersampled_indices]])

In [None]:
#Count the number of compounds in each severity class category
severity_counts2 = balanced_df['Severity Class'].value_counts().sort_index()
severity_counts2

**Descriptors Calculation**

In [None]:
maccs_fingerprints_list = []

# Step 3: Iterate through the SMILES column and calculate descriptors
for k, row in df_tox_cleaned.iterrows():
    mol = Chem.MolFromSmiles(row.SMILES)
    
        # Calculate MACCS fingerprints
    maccs_fingerprints = AllChem.GetMACCSKeysFingerprint(mol)
    maccs_fingerprints_list.append(
        {"Compound Name": row["Compound Name"], "smiles": row["SMILES"], "fingerprint": maccs_fingerprints}
    )

# Crear un DataFrame a partir de la lista de huellas peptídicas MACCS
df_maccs = pd.DataFrame(maccs_fingerprints_list)
df_maccs

In [None]:
df_tox_cleaned.columns

In [None]:
df_tox_cleaned = pd.merge(df_tox_cleaned, df_maccs, on='Compound Name')
df_tox_cleaned

Data Splitting
Split the dataset into training and testing sets.

In [None]:
#Define variables
X = np.array(df_tox_cleaned['fingerprint'].to_list())
Y = df_tox_cleaned['Severity Class']  # The target variable
# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

***Model Building:Random Forest***

In [None]:
# Create and train the Random Forest regression model
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred_rf = random_forest_model.predict(X_test)

mse_rf = mean_squared_error(Y_test, Y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(Y_test, Y_pred_rf)
pearson_corr_rf = np.corrcoef(Y_test, Y_pred_rf)[0, 1]
mae_rf = mean_absolute_error(Y_test, Y_pred_rf)

# Print the calculated metrics for Random Forest
print(f'Random Forest - Mean Squared Error (MSE): {mse_rf}')
print(f'Random Forest - Root Mean Squared Error (RMSE): {rmse_rf}')
print(f'Random Forest - R2 Score: {r2_rf}')
print(f'Random Forest - Pearson Correlation Coefficient (r): {pearson_corr_rf}')
print(f'Random Forest - Mean Absolute Error (MAE): {mae_rf}')

    Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Look for models with lower MSE and RMSE values, as they indicate better precision in predictions. 
    R2 Score: A higher R2 Score indicates a better fit of the model to the data.

    Pearson Correlation Coefficient (r): Look for models with a Pearson correlation coefficient closer to 1. A higher value indicates a better linear relationship between variables. 

    Mean Absolute Error (MAE): Look for models with a lower MAE, as it indicates a smaller absolute difference between predictions and actual observations. 

**Model Evaluation**

Cross-Validation Procedure:

In k-fold cross-validation, the dataset is divided into k subsets (folds). The model is trained and evaluated k times, using a different fold for evaluation each time and the remaining folds for training. The performance metrics are then averaged across all folds to provide a more robust assessment.

In [None]:
# Perform cross-validation for Random Forest (5-fold cross-validation)
cv_scores_rf = cross_val_score(random_forest_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_scores_rf = np.sqrt(-cv_scores_rf) # Convert the negative mean squared error to positive (sklearn returns neg_mean_squared_error)
print("Random Forest - Cross-Validation RMSE Scores:", cv_rmse_scores_rf)
print("Mean CV RMSE Score for Random Forest:", cv_rmse_scores_rf.mean())

In [None]:
plt.scatter(Y_test, Y_pred_rf, alpha=0.4, label='Random Forest')
plt.plot([0, 10], [0, 10], color='red', linestyle='--')  # Diagonal line for reference
plt.title('Random Forest Regression') 
plt.xlabel('Experimental Toxicity', fontsize='large', fontweight='bold')
plt.ylabel('Predicted Toxicity', fontsize='large', fontweight='bold')
plt.xlim(0, 10)
plt.ylim(0, 10)
plt.legend()
plt.show()

In [None]:

def performance_by_hyperparameter(n_estimators, random_state):
    random_forest_model = RandomForestRegressor(n_estimators=n_estimators, random_state=random_state)
    cv_scores_rf = cross_val_score(random_forest_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
    cv_rmse_scores_rf = np.sqrt(-cv_scores_rf) # Convert the negative mean squared error to positive (sklearn returns neg_mean_squared_error)
    return cv_rmse_scores_rf.mean()

In [None]:
hyperparameter_tuning = []
for n_estimators in [20, 50, 100, 200, 500]:
    for i in range(5):
        rmse = performance_by_hyperparameter(n_estimators, i)
        hyperparameter_tuning.append(dict(n_estimators=n_estimators, rmse=rmse))
hyperparameter_tuning = pd.DataFrame.from_records(hyperparameter_tuning)
hyperparameter_tuning

In [None]:
fig, ax = plt.subplots(1)
ax.scatter(hyperparameter_tuning['n_estimators'], hyperparameter_tuning['rmse'])
ax.set_xscale("log")


### Step 2: Using the Toxicity Model for Filtering Ligand-Based Virtual Screening

In [None]:
#Load your dataset
Dataset = pd.read_csv('MoleculeDatabase_compounds_lipinski.csv')
Dataset.head()

***Descriptor calculation***

In [None]:
#Copy only necessary columns
columns  = ['chembl_id', 'smiles']
Dataset2 = Dataset[columns].copy()
Dataset2. head()

In [None]:
maccs_fingerprints_list = []

maccs_fingerprints_list = []

# Step 3: Iterate through the SMILES column and calculate descriptors
for k, row in Dataset2.iterrows():
    mol = Chem.MolFromSmiles(row['smiles'])
    
        # Calculate MACCS fingerprints
    maccs_fingerprints = AllChem.GetMACCSKeysFingerprint(mol)
    maccs_fingerprints_list.append(
        {"chembl_id": row['chembl_id'], "fingerprints": maccs_fingerprints, "smiles": row['smiles']} 
    )

# Crear un DataFrame a partir de la lista de huellas peptídicas MACCS
candidates = pd.DataFrame.from_records(maccs_fingerprints_list)
candidates

In [None]:
X = np.array(candidates['fingerprints'].to_list())

In [None]:
candidates['Predicted_Severity'] = random_forest_model.predict(X)
candidates

In [None]:
# Crear un histograma
plt.hist(candidates['Predicted_Severity'], bins=30, edgecolor='black', alpha=0.7)

# Agregar etiquetas y título
plt.xlabel('Predicted Severity')
plt.ylabel('Number of Compounds')
plt.title('Distribution of Predicted Severity')

# Mostrar el histograma
plt.show()

**Exercise**

Repeat the exercise with a different model:

- [Neural Networks](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor)
- [Kernel Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge)
- [Nearest Neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)

Send a figure with the final result (best prediction and hyperparameter validation).