In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

<div class="alert alert-block alert-success">

# EPFL Course: CH-630 Drug Discovery

## Doctoral School EDCH

## Week 4: Exercises
</div>

 <h1 style="color:green;"> Lesson 4.5: Exercises on Ligand-Based Methods </h1>

Ligand-based methods use information about known ligands to make predictions about new compounds. Instead of relying on the protein structure, these approaches analyze the chemical features, patterns and relationships between molecules themselves ‚Äî their shapes, fingerprints, physicochemical descriptors and measured activities ‚Äî to infer what makes a compound active or inactive.

In this lessons we will explore:

1. Basic cheminformatics with RDKit and molecular descriptors 
2. PCA for dimensionality reduction 
3. Simple machine learning regression (Random Forest) to predict a molecular property<br><br> 

> üí° This lesson will help you understand how ligand-based computational methods work, from computing molecular descriptors to exploring chemical space and building simple predictive models.

<h2 style="color:green;"> Step 1: Basic cheminformatics with RDKit </h2>

In this first step, we will use RDKit, one of the most widely used open-source cheminformatics libraries. RDKit provides a complete toolkit for representing, manipulating, and analyzing molecular structures directly in Python.

We will transform chemical structures ‚Äî typically stored as SMILES strings or SDF files ‚Äî into meaningful mathematical objects. This allows us to compute molecular fingerprints, molecular descriptors, evaluate similarity between molecules and prepare inputs for machine learning models.

<h3 style="color:green;"> Step 1.1: Load a small dataset </h3>

We will use a small subset of the ESOL dataset, which contains experimental water solubility values expressed on a base-10 logarithmic scale (logS).

Higher logS values correspond to greater water solubility, whereas lower (more negative) logS values indicate poor water solubility. Low solubility can lead to poor dissolution and may limit the bioavailability of a compound.

Reference: *Delaney, 2004 https://pubs.acs.org/doi/10.1021/ci034243x.*

This gives us a realistic property to predict later with machine learning.

In [None]:
# compound and real experimental solubility values (logS) from the ESOL dataset
data = {"Compound ID": ["1-Butene", "Ethanol", "Butane", "Butanethiol", "Benzene", "Pyridazine", "Pyridine", "Pyrimidine",
                        "2-Iodopropane", "Dipropyl ether", "1,2-Dichloroethane", "1-Pentene", "2-Hydroxypyridine", "Acetamide",
                        "Fluorobenzene", "Anisole", "Bromochloromethane", "Diethyl ether", "Ethane", "2-pyrrolidone"],
        "SMILES": ["CCC=C", "CCO", "CCCC", "CCCCS", "c1ccccc1", "c1ccnnc1", "c1ccncc1", "c1cncnc1", "CC(C)I", "CCCOCCC",
                         "ClCCCl", "CCCC=C", "Oc1ccccn1", "CC(=O)N", "Fc1ccccc1", "COc1ccccc1", "ClCBr", "CCOCC", "CC", "O=C1CCCN1"],
        "logS_measured": [-1.94, 1.1, -2.57, -2.18, -1.64, 1.1, 0.76, 1.1, -2.09, -1.62, -1.06, -2.68, 1.02, 1.58, -1.8, -1.85, -0.89, -0.09, -1.36, 1.07]}

First of all, we will visualize the data in a dataset using the module `Pandas`.

In [None]:
df = pd.DataFrame(data)
df

<h3 style="color:green;"> Step 1.2: Convert molecules and compute descriptors </h3>

Once we have a list of molecules represented as SMILES strings, we can use RDKit to convert each SMILES into an internal molecular object and compute a variety of basic physicochemical descriptors.

These descriptors summarize important molecular properties and are widely used in cheminformatics and QSAR studies. In this exercise, we focus on a small set of simple but informative descriptors:

- Molecular Weight (MW) ‚Äì total mass of the molecule

- LogP ‚Äì predicted octanol/water partition coefficient (a measure of hydrophobicity)

- HBD (Hydrogen Bond Donors) ‚Äì number of groups capable of donating H-bonds

- HBA (Hydrogen Bond Acceptors) ‚Äì number of groups capable of accepting H-bonds

- TPSA (Topological Polar Surface Area) ‚Äì a measure related to polarity and permeability


In [None]:
# function to compute molecular descriptors
def compute_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return {
        'MW': Descriptors.MolWt(mol),
        'LogP': Descriptors.MolLogP(mol),
        'HBD': Descriptors.NumHDonors(mol),
        'HBA': Descriptors.NumHAcceptors(mol),
        'TPSA': Descriptors.TPSA(mol)
    }

In [None]:
# compute descriptors for each compound
X = df['SMILES'].apply(compute_descriptors).apply(pd.Series)
X

We obtained the matrix $\rm X$, which is a clean numeric feature matrix:

- Each row = one molecule

- Each column = one descriptor (MW, LogP, HBD, HBA, TPSA)

This is the format required for data visualization (PCA) or machine learning (Random Forest).

<h2 style="color:green;"> Step 2: PCA to explore chemical space </h2>

Each molecule in the dataset is described by several numerical descriptors, including Molecular weight, logP, H-bond donors, H-bond acceptors, TPSA.

In real drug-discovery projects, the number of descriptors can easily reach hundreds to thousands (fingerprints, structural fragments, physicochemical descriptors, 3D descriptors, etc.).

This means each molecule corresponds to a point in a high-dimensional space: in our simple case 5D, but in real drug discovery cases also 100D-1000D.

As humans cannot visualize high-dimensional data, we need a way to reduce the dimensionality while keeping most of the information.<br><br>

**Principal Component Analysis (PCA)** is a  linear dimensionality reduction technique that allows to visualize and explore data.
With PCA, the dataset is transformed such that the directions capturing the largest variation in the data are easily identifiable.

The steps are:
- Identify directions in the dataset with maximum variance

- Create new axes called principal components (PC1, PC2, ‚Ä¶)

- Order them such that: PC1 captures the largest variance, PC2 captures the next largest, and so on.

By keeping only the first two principal components (PC1 and PC2), we can plot molecules in 2D, providing a map of chemical space that highlights similarities, differences, and trends in molecular properties.<br><br>

Consequently, PCA plots are often used in drug discovery for:

- spotting chemical diversity

- identifying outliers

- comparing different chemical series

- checking if a dataset is well-balanced

In [None]:
# perform PCA on the descriptor matrix, reducing the descriptor space to 2 dimensions
pca = PCA(n_components=2)

# the high-dimensional data is projected onto these 2 principal components
X_pca = pca.fit_transform(X)
print(f"PC1: {X_pca[:,0]}")
print(f"PC2: {X_pca[:,1]}")

With PCA, we compressed the high-dimensional descriptor space of our dataset into 2 dimensions while keeping the largest possible amount of information (i.e., variance).

After reducing the descriptors to two principal components (PC1 and PC2), we can create a 2D scatter plot:

- Each point represents one molecule

- Proximity between points may indicate chemical similarity based on the descriptors used

    -   Molecules that are close together may have similar molecular weight, logP, H-bond counts, TPSA

    -   Molecules far apart may differ significantly in these properties

- Molecules are color-coded by a property (logS in our case, but pKa, logP, pIC50 are also common). This allows us to see patterns and trends:

    -   A gradient in color might reveal how the property changes with molecular structure

    -   Clusters of molecules with similar properties may be identified


In [None]:
# the data can now plotted with a 2D scatter plot
plt.scatter(X_pca[:,0], X_pca[:,1], c=df['logS_measured'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of molecular descriptors')
plt.colorbar(label='logS')
plt.show()

In [None]:
# Variance captured by each principal component

var_pc1 = np.var(X_pca[:, 0], ddof=1)
var_pc2 = np.var(X_pca[:, 1], ddof=1)

print(var_pc1, var_pc2)

<div class="alert alert-block alert-info">

1. PC1 often captures the majority of the variance in your dataset, so it should have a wider range than PC2. How are your data spreaded along PC1 and PC2?

2. Are some points isolated? This means they have have descriptor values very different from the rest and could be outliers.

3. The color bar shows logS (solubility) values. Molecules with similar colors should be relatively close together in PC space. Are extreme soluble molecules well-separated in the plot? Do you see a structure-property (logS) trend?

</div>

In conclusion, PCA helps in:

- Visualizing chemical diversity in a dataset

- Exploring structure‚Äìproperty relationships

- Detecting potential outliers or artifacts

- Preparing data for further analysis, like machine learning regression <br><br>

However, it is important to remember that PCA does not create clusters by itself, it only allows us to visually explore data. Even if the first two components usually capture most of the variability, they do not capure it all, and some information is inevitably lost.

It is also important to note that the PCs don't normally have a physical-chemical meaning, but represent a combination of the different properties.  

<h2 style="color:green;"> Step 3: Simple Machine Learning Regression </h2>

Once we have represented molecules with numerical descriptors and explored the chemical space using PCA, we can try to predict a molecular property from these descriptors.

**Machine learning regression** allows us to learn patterns between molecular features (like MW, LogP, H-bond counts, TPSA) and experimental properties (like solubility, logS). In this exercise, we will use a **Random Forest Regressor**, a robust and widely used ensemble method that combines multiple decision trees to improve prediction accuracy and reduce overfitting.

The steps are:

- Split the dataset into training and test sets;

- Train the Random Forest model on the training data;

- Predict the property for the test set and compare with true values;

- Assess model performance using metrics like Mean Squared Error (MSE);

- Inspect feature importance to understand which descriptors contribute most to the predictions.

<h3 style="color:green;"> Step 3.1: Split the dataset into training and test sets </h3>

- Before training any machine learning model, we need to separate the data we will use to learn patterns (training set) from the data we will use to evaluate the model (test set). If we evaluate the model on the same data it was trained on, it might appear to perform perfectly, but this can be misleading because the model could just memorize the training examples rather than learning general rules.

- Testing the model on data that is not seen during training prevents overfitting, making sure the ML model is really learning patterns.

- Typically, 70‚Äì80% of the data is used for training, and 20‚Äì30% for testing. 

In [None]:
# target of the model = logS
y = df['logS_measured']

# train/test split (70% test and 30% train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

<h3 style="color:green;"> Step 3.2: Train the Random Forest model on the training data </h3>

- The Random Forest algorithm builds an ensemble of decision trees. Each tree is trained on a random subset of the data and descriptors.

- By averaging the predictions of many trees, Random Forest reduces errors and is less sensitive to noise or outliers.

- During training, the model learns relationships between descriptors (MW, LogP, HBD, HBA, TPSA) and the target property (logS).

In [None]:
# train model
model = RandomForestRegressor(n_estimators=50, random_state=0)
model.fit(X_train, y_train)

<h3 style="color:green;"> Step 3.3: Predict the property for the test set and compare with true values. </h3>
Once trained, the model is used to predict logS for molecules it has not seen before (the test set).

The prediction is compared with experimental values allows us to check how well the model generalizes.

In [None]:
# predictions
y_pred = model.predict(X_test)

In [None]:
# plot predicted vs true
plt.scatter(y_test, y_pred, color='blue')
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--')  # diagonal line
plt.xlabel('True logS')
plt.ylabel('Predicted logS')
plt.title('Random Forest Regression')
plt.show()

<h3 style="color:green;"> Step 3.4: Assess model performance </h3>

- Comparing predictions with experimental values allows us to check how well the model generalizes.

- The **mean squared error (MSE)** measures the average squared difference between predicted and true values.

- Lower MSE means the model‚Äôs predictions are closer to experimental values.

- The **coefficient of determination (R¬≤)** quantifies the fraction of variance in the experimental data explained by the model.

- An R¬≤ value close to 1 indicates strong predictive performance, while values near 0 (or negative) indicate poor explanatory power.

In [None]:
# evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('MSE:', mse)
print('R^2:', r2)

<h3 style="color:green;"> Step 3.5: Inspect feature importance </h3>

Random Forest allows us to evaluate how much each descriptor contributes to the predictions.

This can provide chemical insight: for example, if LogP or TPSA is highly important, it suggests solubility is strongly influenced by hydrophobicity or polarity.

Feature importance is a useful way to interpret ‚Äúblack-box‚Äù machine learning models in a chemistry context.

In [None]:
# feature importance
importances = model.feature_importances_
for desc, imp in zip(X.columns, importances):
    print(f"{desc}: {imp:.2f}")

1. How well does the model generalize to unseen molecules? Are the predictions for the test set close to the true values?

2. Which descriptors are the most important for predicting solubility (logS) with this model?

3. What chemical insight can you derive from the feature importance? For example, why might LogP or TPSA be particularly relevant?

4. Are there any molecules where the prediction is particularly inaccurate? Can you explain why based on their descriptors?

<h2 style="color:orange;"> Exercise </h2>

- Try adding additional molecular properties or descriptors (e.g., number of rotatable bonds, aromatic proportion) and see how the model performance and feature importance change.

- Experiment with different Random Forest hyperparameters (e.g., number of trees, maximum depth) and observe the effect on prediction accuracy.