# ChemCluster Project Report

**CH-200 – Practical Programming in Chemistry**  
**Group:** Elisa Rubbia, Romain Guichonnet, Flavia Zabala Perez  
**Date:** May 2025

This project was created as part of the CH-200 “Practical Programming in Chemistry” course at EPFL. It integrates cheminformatics tools such as RDKit, scikit-learn, and Streamlit to build a web-based app that is both accessible and extensible. Throughout this report, we demonstrate how ChemCluster can assist in typical tasks of early-stage drug discovery or structure–property exploration.


---
#  Table of Contents

1. Setup and Initialization  
  1.1 Installation from PyPI  
  1.2 Running Locally from Source  

2. Dataset Mode – Analysis of Molecular Libraries  
  2.1 Data Input and Preprocessing  
  2.2 Descriptor Calculation and Fingerprint Generation  
  2.3 Dimensionality Reduction  
  2.4 Clustering with KMeans  
  2.5 Visualization and Molecule Inspection  
  2.6 Cluster Filtering by Properties  
  2.7 Export Functionality  

3. Single Molecule Mode – Conformer Generation and Visualization  
  3.1 Molecular Input  
  3.2 Conformer Generation  
  3.3 Conformer Clustering  
  3.4 Visualization of Centroids  
  3.5 Property Calculation for Centroids  

4. Results and Observations  
  4.1 Interface Overview  
  4.2 Dataset Mode – Clustering a Molecular Library  
  4.3 Single Molecule Mode – Conformer Clustering  

5. Underlying Algorithms and Tools  
  5.1 Principal Component Analysis (PCA)  
  5.2 KMeans Clustering  
  5.3 Tanimoto Similarity and Fingerprints  
  5.4 Butina Clustering  
  5.5 Interactive Visualization Tools  

6. Testing and Validation  
  6.1 Coverage and Scope  
  6.2 Example Test Snippet  

7. Challenges Faced  

8. Conclusion and Outlook


--- 

## Welcome to ChemCluster!

ChemCluster is a streamlined and interactive platform designed to facilitate the analysis and visualization of chemical datasets using molecular clustering techniques. It leverages the power of **RDKit** for cheminformatics and **Streamlit** for web interface deployment, offering an intuitive solution for researchers and students alike.

Two main modes of operation are supported by the platform:

- **Dataset Mode**: This mode allows users to upload datasets (in `.sdf`, `.mol` or `.csv` format) containing multiple molecular structures.  The molecules are processed to compute their key physicochemical descriptors, then clustered based on these properties. The resulting clusters are visualized interactively, and users can explore individual structures along with their 2D/3D representations and calculated properties.
- **Single Molecule Mode**: In this workflow, ChemCluster generates a set of conformers for a single input molecule. These conformers are clustered based on structural similarity, and representative structures are visualized in  3D, enabling users to analyze conformational diversity effectively.


---

## 1. Setup and Initialization

To begin using ChemCluster, you have two options: installation via PyPI or running the application locally from source.

### 1.1 Installation from PyPI
The simplest way to install ChemCluster is through the Python Package Index. In a terminal or command prompt, run:

```bash
pip install chemcluster
```


Once installed, you can launch the application by executing:

```bash 
chemcluster
```


This command will open the ChemCluster interface directly in your default web browser.

### 1.2 Running Locally from Source (for Development or Contribution)
If you prefer to contribute to the project or wish to run it locally in a development environment, follow these steps:
1. Clone the repository from GitHub:

```bash
git clone https://github.com/erubbia/chemcluster
```

2. Navigate into the project directory:


```bash
cd chemcluster
```


3. Create a conda environment based on the project's environment file:

```bash 
conda env create -f environment.yml
```

4. Activate the newly created environment:

```bash 
conda activate chemcluster-env
```

5. Finally, install the project in editable mode:

```bash 
pip install -e .
```

After this setup, you can launch ChemCluster the same way by running `chemcluster` in the terminal.

---

## 2. Dataset Mode – Analysis of Molecular Libraries


The dataset mode enables users to upload and analyze molecular datasets containing multiple compounds. This mode is suitable for exploring the diversity of a chemical library, identifying representative clusters, and filtering molecules based on specific chemical properties.

The clustering workflow is based on molecular descriptors derived from circular fingerprints, and involves dimensionality reduction and unsupervised clustering. Molecules within clusters can be interactively explored and exported for further analysis.

### 2.1 Data Input and Preprocessing
Users can upload molecular datasets in `.csv`, `.sdf`, or `.mol` formats. For `.csv` files, the application searches for a column named `SMILES` or similar, and converts each string into an RDKit `Mol` object. These are then passed through the `clean_smiles_list()` function, which removes invalid or unreadable entries to ensure robust downstream processing.

In [None]:
from chemcluster import clean_smiles_list
mols, smiles_list = clean_smiles_list(smiles_column)

### 2.2 Descriptor Calculation and Fingerprint Generation
Each valid molecule is encoded into a binary fingerprint vector using RDKit's implementation of Morgan fingerprints. This is achieved via the `get_fingerprint()` function, which returns a 2048-bit representation capturing the molecular environment of atoms.

The similarity between all pairs of molecules is then calculated using the Tanimoto coefficient, a widely used metric for binary vector comparison:

In [None]:
from chemcluster import get_fingerprint
fps = [get_fingerprint(mol) for mol in mols]
from rdkit import DataStructs
similarity = DataStructs.TanimotoSimilarity(fps[i], fps[j])

These similarities are converted into a distance matrix by computing `1 - similarity`, which serves as input for PCA.

### 2.3 Dimensionality Reduction
To facilitate visualization and clustering, the high-dimensional distance matrix is projected into a 2D space using Principal Component Analysis (PCA). PCA reduces the complexity of the data while retaining the directions of maximum variance.

The transformation is performed using `sklearn.decomposition.PCA`, producing 2D coordinates that reflect the relative positions of molecules in descriptor space:

In [None]:
from sklearn.decomposition import PCA
coords = PCA(n_components=2).fit_transform(dist_matrix)

### 2.4 Clustering with KMeans
Molecules are grouped into clusters using the KMeans algorithm from `sklearn.cluster`. The optimal number of clusters `k` is determined by evaluating the silhouette score over a range of possible values (typically k = 2 to 10). The silhouette score quantifies both the cohesion within clusters and the separation between them.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
best_k, best_score = 2, -1
for k in range(2, 11):
    model = KMeans(n_clusters=k).fit(coords)
    score = silhouette_score(coords, model.labels_)
    if score > best_score:
        best_k, best_score = k, score
labels = KMeans(n_clusters=best_k).fit_predict(coords)

### 2.5 Visualization and Molecule Inspection
The clustering results are displayed as a 2D scatter plot using Plotly, where each point corresponds to a molecule in PCA space. The `plotly_events()` function captures user clicks on individual points, triggering the display of the molecule’s structure, 3D viewer, and properties.

In [None]:
from streamlit_plotly_events import plotly_events
selected_points = plotly_events(fig, click_event=True)

This interactivity enables quick identification and inspection of interesting chemical structures.

### 2.6 Cluster Filtering by Properties
For each cluster, ChemCluster calculates the average values of selected molecular descriptors using the `calculate_properties()` function. This information is used to filter clusters that match user-defined criteria, such as high LogP or low TPSA.

In [None]:
cluster_props_summary = {}
for c in set(labels):
    idxs = [i for i, lbl in enumerate(labels) if lbl == c]
    props = [calculate_properties(mols[i]) for i in idxs]
    df = pd.DataFrame(props).select_dtypes(include='number')
    cluster_props_summary[c] = df.mean()

### 2.7 Export Functionality
The filtered clusters and their associated molecular data can be exported in `.csv` format for documentation or further analysis. This is handled using Pandas’ `to_csv()` method and Streamlit’s `download_button()`:

In [None]:
csv = cluster_df.to_csv(index=False).encode('utf-8')
st.download_button("Download Cluster Molecules", data=csv,
                   file_name="cluster_data.csv", mime="text/csv")

Once a molecular dataset is uploaded, ChemCluster automatically processes the molecules, removes invalid entries, and calculates structural fingerprints. These fingerprints are then compared using Tanimoto similarity, producing a distance matrix that captures how chemically close each molecule is to the others.

This matrix is reduced into two dimensions using PCA, allowing clusters of structurally similar molecules to emerge in a scatter plot. Users can click directly on any point to view the 2D and 3D structure of a molecule, as well as its properties.

Moreover, ChemCluster calculates average property values for each cluster and allows users to filter clusters matching specific criteria, such as high logP or low polar surface area (TPSA). These refined groups can then be exported for documentation or further cheminformatics work.

---
## 3. Single Molecule Mode – Conformer Generation and Visualization

In Single Molecule Mode, ChemCluster allows users to enter or draw a molecule, generate 3D conformers, perform clustering based on conformational similarity, and visualize representative conformations.

This mode is particularly useful for analyzing the conformational landscape of flexible molecules or inspecting representative shapes for downstream modeling.

### 3.1 Molecular Input
The user can enter a molecule via a SMILES string or draw it interactively using an embedded chemical editor. The SMILES string is parsed with RDKit to generate a `Mol` object.

In [None]:
from rdkit import Chem
mol = Chem.MolFromSmiles("CCO")

### 3.2 Conformer Generation
RDKit's ETKDG algorithm is used to generate multiple conformers for the input molecule. Explicit hydrogen atoms are added prior to embedding to improve 3D accuracy. The user can specify the number of conformers to be generated.


In [None]:
from rdkit.Chem import AllChem
mol = Chem.AddHs(mol)
cids = AllChem.EmbedMultipleConfs(mol, numConfs=50, randomSeed=42)
AllChem.UFFOptimizeMoleculeConfs(mol)

### 3.3 Conformer Clustering
The conformers are clustered based on their pairwise RMSD values using the Butina clustering algorithm. This allows identification of conformational families. The conformer with the lowest average RMSD within each cluster is selected as the centroid.

In [None]:
from rdkit.ML.Cluster import Butina
from rdkit.Chem import rdMolAlign
dists = []
for i in range(len(cids)):
    for j in range(i):
        rms = rdMolAlign.GetBestRMS(mol, mol, i, j)
        dists.append(rms)
clusters = Butina.ClusterData(dists, len(cids), 1.5, isDistData=True)

### 3.4 Visualization of Centroids
Users can select centroid conformers to visualize them in 3D using py3Dmol. This allows comparing conformational representatives interactively.

In [None]:
import py3Dmol
viewer = py3Dmol.view(width=400, height=400)
mb = Chem.MolToMolBlock(mol, confId=centroid_id)
viewer.addModel(mb, 'mol')
viewer.setStyle({'stick': {}})
viewer.zoomTo()

### 3.5 Property Calculation for Centroids
For each centroid conformer, the `calculate_properties()` function is used to compute relevant physicochemical descriptors. These include molecular weight, LogP, number of hydrogen bond donors/acceptors, TPSA, rotatable bonds, etc.

In [None]:
from chemcluster import calculate_properties
props = calculate_properties(mol, mol_name="example")

---

## 4. Results and Observations

This section provides a demonstration of ChemCluster’s capabilities through visual examples from the user interface.

### 4.1 Interface Overview

Launching ChemCluster brings up the main page where users can select between two analysis modes:

![Main page](https://github.com/erubbia/ChemCluster/blob/main/assets/main_page.png?raw=true)

---

### 4.2 Dataset Mode – Clustering a Molecular Library

Users can upload a dataset (e.g., in SMILES format), perform PCA dimensionality reduction, and cluster molecules using KMeans. Clusters are shown as colored groups on a scatter plot.

![PCA plot](https://github.com/erubbia/ChemCluster/blob/main/assets/pca_plot.png?raw=true)

Clicking on a molecule displays its 2D structure and computed properties:

![Molecule selected](https://github.com/erubbia/ChemCluster/blob/main/assets/pca_plot2.png?raw=true)

Filtering based on selected properties (e.g., LogP, TPSA) highlights matching clusters. Molecules from a selected cluster can then be exported as `.csv`.

---

### 4.3 Single Molecule Mode – Conformer Clustering

In single molecule mode, a SMILES string is entered and 3D conformers are generated. These conformers are clustered using Butina clustering.

Centroid conformers are displayed in 3D using Py3Dmol:

![Centroid superposition](https://github.com/erubbia/ChemCluster/blob/main/assets/centroid_superposition.png?raw=true)


---
## 5. Underlying Algorithms and Tools

ChemCluster integrates several cheminformatics and data science algorithms to enable clustering, dimensionality reduction, and visualization of molecular data. This section outlines the most important computational techniques and their relevance to the application.

### 5.1 Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is used in ChemCluster to reduce the dimensionality of molecular descriptor space. Since pairwise Tanimoto distances result in a high-dimensional space (equal to the number of molecules), PCA allows projecting this data into two or three dimensions for visualization and clustering.

PCA works by computing the eigenvectors of the covariance matrix and finding orthogonal axes (principal components) that capture the most variance.

In [None]:
from sklearn.decomposition import PCA
coords = PCA(n_components=2).fit_transform(dist_matrix)

### 5.2 KMeans Clustering
KMeans is an unsupervised clustering algorithm used to partition the dataset into `k` clusters by minimizing the sum of squared distances between each point and the centroid of its assigned cluster. The optimal value of `k` is selected using the silhouette score, which balances inter-cluster separation and intra-cluster cohesion.

KMeans assumes a Euclidean space and seeks compact, spherical clusters, which aligns with the reduced PCA space.

In [None]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=best_k).fit(coords)
labels = model.labels_

### 5.3 Tanimoto Similarity and Fingerprints
Tanimoto similarity is a measure commonly used to compare binary molecular fingerprints. It calculates the ratio of the intersection to the union of two bit vectors:

$$\text{Tanimoto}(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

This metric is well-suited for comparing the presence or absence of chemical substructures. RDKit's `DataStructs.TanimotoSimilarity()` computes this efficiently from Morgan fingerprints.

### 5.4 Butina Clustering
For single molecule conformers, ChemCluster uses Butina clustering to group similar 3D conformers based on their pairwise RMSD distances. This algorithm constructs a distance matrix and applies a hierarchical, distance-based cutoff to form clusters without requiring a predefined number.

In [None]:
from rdkit.ML.Cluster import Butina
clusters = Butina.ClusterData(rmsd_values, numConfs, cutoff=1.5, isDistData=True)

### 5.5 Interactive Visualization Tools
- **`plotly_events()`** enables user interaction on the PCA scatter plot, allowing users to click and retrieve information about individual molecules.
- **`py3Dmol`** is used to render 3D structures in the browser. It interprets mol-block data and displays conformers with interactive rotation and zoom capabilities.

These tools enhance accessibility and allow hands-on exploration of chemical space and molecular shape diversity.

---
## 6. Testing and Validation

To ensure reliability and maintainability, ChemCluster includes a suite of unit tests covering all core functionalities. These tests are located in the `tests/` directory and can be executed using either `pytest` or `tox`, ensuring compatibility across Python versions.

### 6.1 Coverage and Scope
The following components are covered by unit tests:
- `clean_smiles_list`: checks for SMILES parsing and molecule validity
- `get_fingerprint`: verifies correct generation of Morgan fingerprints
- `calculate_properties`: ensures property values match expected reference
- `mol_to_base64_img`: validates base64 image string formatting
- `show_3d_molecule`: checks py3Dmol viewer object creation

### 6.2 Example Test Snippet

In [None]:
def test_calculate_properties():
    mol = Chem.MolFromSmiles("CCO")  # ethanol
    result = calculate_properties(mol, mol_name="ethanol")
    assert isinstance(result, dict)
    assert abs(result["Molecular Weight"] - 46.07) < 0.1

---
## 7. Challenges Faced
While the development of ChemCluster was overall successful and rewarding, some challenges we faced are the following:

- **High dimensionality of molecular space:**  
  Molecular fingerprints are often thousands of dimensions. Therefore, effective clustering relies on robust dimensionality reduction techniques. Naturally, important variance can still be lost.

- **Molecular representation bottlenecks:**  
  Traditional Morgan fingerprints may fail to capture finer details, such as complex electronic or stereochemical subtleties. This also impacts quality of the clustering.
  
- **Performance bottlenecks:**  
  When processing more than 1000 molecules or generating large numbers of 3D conformers, the application experienced noticeable slowdowns. This could be mainly due to the computational cost of fingerprint comparison, distance matrix construction, and 3D optimization.

- **Errors in SMILES parsing:**  
  Molecules may pass the initial SMILES validity check, but fail at 3D embedding. As a consquence, there were occasionally incorrect or unexpected geometries, especially for stereochemically rich or flexible molecules. 

- **Cross-platform compatibility and Streamlit limitations:**  
  The app’s frontend, built with Streamlit, showed slight differences between platforms (notably Windows vs. macOS), such as font rendering and responsiveness. Some third-party components like `streamlit_plotly_events` or `py3Dmol` also required version-specific fixes or careful dependency management.

These issues were mitigated through automatic molecule cleaning, limitation of dataset size, and standardization of 3D visualization settings. While not fully resolved, they are documented and provide a clear direction for future improvements.



---
## 8. Conclusion and Outlook

The development of ChemCluster provided valuable insights into the challenges and potential of combining cheminformatics tools with unsupervised machine learning techniques. The application has successfully enabled both conformer-based and fingerprint-based molecular clustering, while also offering interactive visualization and analysis via a Streamlit interface. 

Naturally, there still remains areas of improvement:

- **Improving molecular representations:**  
  A promising direction would be to use learned molecular embeddings, such as those generated from transformer-based models like ChemBERTa; the latter encodes richer chemical context by leveraging large-scale molecular data during pretraining.

- **Explore alternative clustering algorithms:**  
 Instead of using KMeans, different unsupervised clustering algorithms such as DBSCAN or Agglomerative clustering. 

- **Explore alternative similarity metrics:**  
 Conformer clustering currently relies on RMSD, which measures geometrical alignment but not energetic or functional similarity. Therefore, it would be interesting to use metrics based on conformer energy differences, shape overlap and pharmacophoric feature. This would offer more chemically meaningful clustering, particularly in the context of bioactive conformations.

- **Investigation of hydrogen bonding interactions:**  
Extending the current pipeline to identify and visualize H-bonding networks between conformer centroids in solution could reveal solvation effects. Thus, this could be especially valuable for applications involving transition state sampling (e.g. in computational chemistry), where conformer bias can greatly affect kinetic predictions.

In summary, addressing these new directions, ChemCluster could evolve into a more robust platform for both molecular clustering and conformational analysis. 