# AI-PSCI-008: Compound Clustering & Chemical Space Visualization

**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 4 | Module: AI in Drug Discovery | Estimated Time: 60-90 minutes**

**Prerequisites**: AI-PSCI-001 through AI-PSCI-007

---

**üéØ This is the FINAL target-agnostic talktorial!** In Week 5 (AI-PSCI-009), you will select your drug target and apply all these skills to your chosen system.

## üéØ Learning Objectives

After completing this talktorial, you will be able to:

1. Cluster molecules by structural similarity using hierarchical and k-means clustering
2. Apply dimensionality reduction techniques (PCA, t-SNE, UMAP) to fingerprints
3. Create publication-quality chemical space visualizations
4. Interpret clustering results to identify structural families
5. Select diverse compound subsets for screening

---

## üìö Background

### What is Chemical Space?

**Chemical space** is the theoretical space encompassing all possible molecules. It's estimated to contain 10‚Å∂‚Å∞ drug-like molecules ‚Äî more than atoms in the observable universe! Understanding and navigating chemical space is essential for drug discovery.

Visualizing chemical space helps us:
- **Identify clusters**: Groups of structurally similar compounds
- **Assess diversity**: How well does our library cover chemical space?
- **Spot outliers**: Unusual compounds that might be interesting or problematic
- **Guide optimization**: Where are active compounds located?

### Clustering Methods

**1. Hierarchical Clustering**
- Builds a tree (dendrogram) of compounds
- Can use different linkage methods (ward, complete, average)
- Good for visualizing relationships
- Doesn't require specifying number of clusters upfront

**2. K-Means Clustering**
- Partitions data into k clusters
- Fast and scalable
- Requires specifying k in advance
- Works well with spherical clusters

### Dimensionality Reduction

Molecular fingerprints have thousands of dimensions (2048 for Morgan). We need to reduce this to 2-3D for visualization:

**1. PCA (Principal Component Analysis)**
- Linear method, preserves global structure
- Fast, reproducible
- May miss non-linear relationships

**2. t-SNE (t-Distributed Stochastic Neighbor Embedding)**
- Non-linear, preserves local structure
- Great for revealing clusters
- Results vary with random seed

**3. UMAP (Uniform Manifold Approximation and Projection)**
- Non-linear, preserves both local and global structure
- Faster than t-SNE, better scaling
- Current state-of-the-art for chemical space visualization

### Key Concepts

- **Dendrogram**: Tree diagram showing hierarchical clustering
- **Silhouette score**: Measure of clustering quality (-1 to 1, higher is better)
- **Perplexity**: t-SNE parameter controlling neighborhood size
- **Scaffold**: Core chemical structure shared by a cluster

---

## üõ†Ô∏è Setup

Run this cell to install required packages:

In [None]:
#@title üõ†Ô∏è Install Packages
!pip install rdkit umap-learn -q
print("‚úÖ Packages installed successfully!")

Import the required libraries:

In [None]:
#@title üì¶ Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem
from rdkit.Chem import Draw, AllChem, Descriptors, MACCSkeys
from rdkit import DataStructs
from rdkit.Chem import rdFingerprintGenerator
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist, squareform
import umap
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 10)
plt.rcParams['figure.figsize'] = [10, 8]
plt.rcParams['figure.dpi'] = 100

print("‚úÖ All libraries imported!")

---

## üî¨ Guided Inquiry 1: Preparing Compounds for Clustering

### Context

Before we can cluster molecules, we need a dataset with fingerprints. Let's create a diverse set of approved drugs from different therapeutic classes to demonstrate clustering.

### Your Task

Using your AI assistant, write code to:

1. Create a dataset of 20+ approved drugs from various therapeutic classes:
   - Antibiotics (trimethoprim, amoxicillin, ciprofloxacin)
   - NSAIDs (aspirin, ibuprofen, naproxen)
   - Statins (atorvastatin, simvastatin, rosuvastatin)
   - Antihypertensives (lisinopril, amlodipine, losartan)
   - Antidepressants (fluoxetine, sertraline, escitalopram)
   - Others (metformin, omeprazole, acetaminophen)

2. Generate Morgan fingerprints (radius=2, 2048 bits) for each drug

3. Convert fingerprints to a numpy array for clustering

4. Display basic statistics about the dataset

üí° **Prompting Tips**:
- Ask: "What are the SMILES for common approved drugs?"
- Ask: "How do I convert RDKit fingerprints to numpy arrays?"
- Use `np.array([list(fp) for fp in fps])` to convert fingerprints

### Verification

After running your code, confirm:
- [ ] Dataset has 20+ drugs with valid SMILES
- [ ] Fingerprint array shape is (n_drugs, 2048)
- [ ] Drugs span multiple therapeutic classes
- [ ] Class labels are assigned for later coloring

üìì **Lab Notebook**: Record the drug classes represented. Why might drugs in the same class cluster together?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 2: Hierarchical Clustering with Dendrograms

### Context

Hierarchical clustering builds a tree structure showing relationships between compounds. The **dendrogram** visualization helps us understand which molecules are most similar and identify natural groupings.

### Your Task

Using your AI assistant, write code to:

1. Calculate pairwise Tanimoto distances between all molecules

2. Perform hierarchical clustering using Ward's method

3. Create a dendrogram with drug names as labels

4. Color the dendrogram by therapeutic class

üí° **Prompting Tips**:
- Ask: "How do I create a distance matrix for hierarchical clustering?"
- Tanimoto distance = 1 - Tanimoto similarity
- `scipy.cluster.hierarchy.linkage()` performs the clustering
- `scipy.cluster.hierarchy.dendrogram()` creates the visualization

### Verification

After running your code, confirm:
- [ ] Dendrogram displays all drug names
- [ ] NSAIDs cluster together (aspirin, ibuprofen, naproxen)
- [ ] Statins form a distinct cluster
- [ ] Height of linkage indicates dissimilarity

üìì **Lab Notebook**: Which drug classes form the tightest clusters? Which are more spread out?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 3: K-Means Clustering

### Context

K-means is a fast clustering algorithm that partitions data into k groups. Unlike hierarchical clustering, we need to specify the number of clusters beforehand. The **silhouette score** helps us choose the optimal k.

### Your Task

Using your AI assistant, write code to:

1. Test different numbers of clusters (k=2 to k=8)

2. Calculate silhouette score for each k

3. Plot silhouette scores to find optimal k

4. Perform k-means with optimal k and assign cluster labels

5. Display drugs in each cluster

üí° **Prompting Tips**:
- Ask: "How do I determine the optimal number of clusters for k-means?"
- Silhouette score: -1 (bad) to 1 (good)
- `sklearn.cluster.KMeans` for clustering
- `sklearn.metrics.silhouette_score` for evaluation

### Verification

After running your code, confirm:
- [ ] Silhouette plot shows a clear peak
- [ ] Drugs in same cluster share structural features
- [ ] Drug class often correlates with cluster assignment
- [ ] Each cluster has interpretable members

üìì **Lab Notebook**: What is the optimal number of clusters? Do the clusters make pharmaceutical sense?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 4: Dimensionality Reduction with PCA

### Context

2048-dimensional fingerprints can't be visualized directly. **PCA (Principal Component Analysis)** is a linear method that finds the directions of maximum variance in the data, reducing it to 2D for visualization.

### Your Task

Using your AI assistant, write code to:

1. Apply PCA to reduce fingerprints from 2048D to 2D

2. Calculate explained variance for each component

3. Create a scatter plot colored by drug class

4. Add drug names as labels

üí° **Prompting Tips**:
- Ask: "How do I use PCA for dimensionality reduction in sklearn?"
- `explained_variance_ratio_` shows how much variance each PC captures
- Use different markers or colors for drug classes

### Verification

After running your code, confirm:
- [ ] Scatter plot shows clear separation of some drug classes
- [ ] First 2 PCs capture reasonable variance (check `explained_variance_ratio_`)
- [ ] Similar drugs appear close together
- [ ] Labels are readable

üìì **Lab Notebook**: How much variance do the first 2 PCs explain? What does this tell you about molecular fingerprint data?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 5: t-SNE for Non-Linear Visualization

### Context

**t-SNE** is a non-linear dimensionality reduction method that preserves local structure. It's excellent at revealing clusters but the results depend on the **perplexity** parameter (roughly, the expected number of neighbors).

### Your Task

Using your AI assistant, write code to:

1. Apply t-SNE with perplexity=5 (good for small datasets)

2. Create a scatter plot colored by drug class

3. Compare t-SNE visualization to PCA

4. Try different perplexity values (3, 5, 10) and observe effects

üí° **Prompting Tips**:
- Ask: "What perplexity value should I use for t-SNE?"
- Rule of thumb: perplexity = sqrt(n_samples)
- For small datasets (<50), use perplexity 3-10
- t-SNE is stochastic ‚Äî set `random_state` for reproducibility

### Verification

After running your code, confirm:
- [ ] Clusters are more visually separated than PCA
- [ ] Similar drugs are close together
- [ ] Different perplexities give different views
- [ ] Results are reproducible with fixed random_state

üìì **Lab Notebook**: How does t-SNE compare to PCA? Which reveals drug class structure better?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 6: UMAP for Chemical Space

### Context

**UMAP** (Uniform Manifold Approximation and Projection) is the current state-of-the-art for visualization. It preserves both local and global structure better than t-SNE and is faster.

### Your Task

Using your AI assistant, write code to:

1. Apply UMAP with n_neighbors=5 and min_dist=0.1

2. Create a publication-quality scatter plot

3. Add cluster labels from k-means

4. Create a final comparison figure: PCA vs t-SNE vs UMAP

üí° **Prompting Tips**:
- Ask: "What are good UMAP parameters for molecular data?"
- `n_neighbors` controls local vs global balance (like perplexity)
- `min_dist` controls how tightly points cluster
- Lower `min_dist` = tighter clusters

### Verification

After running your code, confirm:
- [ ] UMAP shows clear cluster structure
- [ ] Global relationships are preserved (unlike t-SNE)
- [ ] Drug classes form recognizable groups
- [ ] Comparison shows strengths of each method

üìì **Lab Notebook**: Which visualization method would you use for a publication? Why?

In [None]:
# Your code here



---

## ‚úÖ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Prepare molecular fingerprints for clustering
- [ ] Create and interpret dendrograms from hierarchical clustering
- [ ] Perform k-means clustering and select optimal k using silhouette scores
- [ ] Apply PCA, t-SNE, and UMAP for dimensionality reduction
- [ ] Create publication-quality chemical space visualizations
- [ ] Compare and choose appropriate methods for different purposes

### Your lab notebook should include:

- [ ] Dendrogram of the drug dataset
- [ ] Silhouette score plot for k-means
- [ ] Comparison figure: PCA vs t-SNE vs UMAP
- [ ] Notes on which drugs cluster together and why
- [ ] Reflection on method selection criteria

---

**üéâ Congratulations!** You have completed the Foundation Phase (Weeks 1-4). In Week 5, you will select your drug target and begin applying these skills to YOUR chosen system!

## ü§î Reflection Questions

Answer these in your lab notebook:

1. **Method Selection**: You're preparing a figure for a journal submission showing chemical diversity in a screening library. Which visualization method would you choose and why?

2. **Clustering Interpretation**: Some antibiotics clustered with non-antibiotics, and some NSAIDs are far apart. What does this tell you about the relationship between structure and function?

3. **Practical Application**: How could you use clustering to select a diverse subset of 10 compounds from a library of 1000 for experimental testing?

---

## üìñ Further Reading

- [UMAP Documentation](https://umap-learn.readthedocs.io/) - Official UMAP guide
- [t-SNE Explained](https://distill.pub/2016/misread-tsne/) - Interactive visualization of t-SNE behavior
- [TeachOpenCADD T005](https://projects.volkamerlab.org/teachopencadd/talktorials/T005_compound_clustering.html) - Compound clustering tutorial
- [McInnes et al. (2018)](https://arxiv.org/abs/1802.03426) - Original UMAP paper

---

## üîó Connection to Research

Compound clustering and chemical space visualization are essential for:

- **Library design**: Selecting diverse compounds for screening
- **Hit analysis**: Understanding structure-activity relationships
- **Lead optimization**: Navigating chemical space around a hit
- **Patent analysis**: Mapping competitive landscapes
- **Machine learning**: Feature visualization and model interpretation

### What's Next?

In **Week 5 (AI-PSCI-009)**, you will:
1. **Select your drug target** from the 6-target portfolio
2. Apply ChEMBL queries to YOUR target
3. Begin building your personalized drug discovery pipeline

All the skills you've learned ‚Äî Python, RDKit, ChEMBL, fingerprints, filtering, and clustering ‚Äî will now be applied to real pharmaceutical research!

---

*AI-PSCI-008 Complete. Foundation Phase Complete! Proceed to AI-PSCI-009: Protein Data Acquisition & Target Selection.*