# ChemCluster Project Report

**CH-200 – Practical Programming in Chemistry**  
**Group:** Elisa Rubbia, Romain Guichonnet, Flavia Zabala Perez  
**Date:** May 2025

## Welcome to ChemCluster!

ChemCluster is a streamlined and interactive platform designed to facilitate the analysis and visualization of chemical datasets using molecular clustering techniques. It leverages the power of **RDKit** for cheminformatics and **Streamlit** for web interface deployment, offering an intuitive solution for researchers and students alike.

Two main modes of operation are supported by the platform:

- **Dataset Mode**: This mode allows users to upload datasets (in `.csv` or `.sdf` format) containing multiple molecular structures. These are then clustered based on structural similarity, and representative structures for each cluster are highlighted for further exploration.
- **Single Molecule Mode**: In this workflow, ChemCluster generates a set of 3D conformers for a single input molecule. These conformers are clustered, and representative structures are visualized both in 2D and 3D, enabling users to analyze conformational diversity effectively.

## 1. Setup and Initialization

To begin using ChemCluster, you have two options: installation via PyPI or running the application locally from source.

### 1.1 Installation from PyPI
The simplest way to install ChemCluster is through the Python Package Index. In a terminal or command prompt, run:

```bash
pip install chemcluster
```
Once installed, you can launch the application by executing:

```bash
chemcluster
```
This command will open the ChemCluster interface directly in your default web browser.

### 1.2 Running Locally from Source (for Development or Contribution)
If you prefer to contribute to the project or wish to run it locally in a development environment, follow these steps:
1. Clone the repository from GitHub:
```bash
git clone https://github.com/erubbia/ChemCluster
```
2. Navigate into the project directory:
```bash
cd ChemCluster
```
3. Create a conda environment based on the project's environment file:
```bash
conda env create -f environment.yml
```
4. Activate the newly created environment:
```bash
conda activate chemcluster-env
```
5. Finally, install the project in editable mode:
```bash
pip install -e .
```
After this setup, you can launch ChemCluster the same way by running `chemcluster` in the terminal.

## 2. Dataset Mode – Analysis of Molecular Libraries

The dataset mode in ChemCluster allows users to upload a file containing multiple molecules in formats such as `.csv`, `.sdf`, or `.mol`. These datasets are typically used to analyze structural diversity, identify clusters of similar compounds, or inspect physicochemical properties.

### 2.1 Functional Overview
Upon upload, the molecular structures are converted into RDKit `Mol` objects. When working with CSV files, the relevant column containing SMILES strings is parsed and cleaned using the function `clean_smiles_list`. This ensures invalid entries are excluded prior to analysis.

Example:
```python
from chemcluster import clean_smiles_list
mols, smiles_list = clean_smiles_list(smiles_column)
```

### 2.2 Computational Procedure
Each molecule is encoded as a Morgan fingerprint (2048-bit vector), which represents circular substructures. A pairwise Tanimoto similarity matrix is computed, followed by a transformation into a distance matrix (1 - similarity). To enable visualization and clustering, the dimensionality is reduced using Principal Component Analysis (PCA).

Clustering is performed using the KMeans algorithm. To determine the optimal number of clusters (k), ChemCluster evaluates models for values ranging from 2 to 10. The configuration that maximizes the silhouette score, which is a metric assessing both the separation between clusters and the cohesion within them, is selected automatically.

Relevant functions and methods:   
- `get_fingerprint()` – generates molecular fingerprints
- `PCA` from `sklearn.decomposition`
- `KMeans` from `sklearn.cluster`
- `silhouette_score` for quality assessment

### 2.3 Output and Visualization
Each molecule is projected onto a 2D PCA plane and assigned a cluster label. These results are visualized using an interactive Plotly scatter plot. Users can inspect any molecule by clicking on a point in the plot, revealing its structure, SMILES, and calculated descriptors.

The interface also allows users to export selected clusters as `.csv` files for further analysis.

### 2.4 Example Code Snippet

In [None]:
```python
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

coords = PCA(n_components=2).fit_transform(dist_matrix)
best_k, best_score = 2, -1
for k in range(2, 10):
    model = KMeans(n_clusters=k).fit(coords)
    score = silhouette_score(coords, model.labels_)
    if score > best_score:
        best_k, best_score = k, score
labels = KMeans(n_clusters=best_k).fit_predict(coords)
```

## 3. Single Molecule Mode – Conformer Generation and Clustering

In addition to handling datasets, ChemCluster provides functionality to explore the 3D conformational space of a single molecule. This mode is useful for investigating structural flexibility, intramolecular interactions, and conformer diversity.

### 3.1 Workflow Description
The user inputs a molecule through a SMILES string or draws it using an embedded editor. The SMILES string is then converted into an RDKit `Mol` object. Explicit hydrogen atoms are added to improve 3D geometry prediction.

A set of conformers is generated using RDKit’s ETKDG algorithm, which attempts to produce realistic 3D geometries by optimizing distance geometry with torsion angle preferences and force-field minimization (UFF or MMFF94).

### 3.2 Conformer Clustering
Once generated, the conformers are clustered based on Root-Mean-Square Deviation (RMSD) of atomic positions. This step is carried out using the Butina clustering algorithm, which is a fast, greedy method suited for large sets of conformers.

Within each cluster, the conformer with the lowest average RMSD to others is selected as a representative (centroid).

### 3.3 Implementation Highlights
- `AllChem.EmbedMultipleConfs()` – generates conformers using ETKDG
- `AllChem.UFFOptimizeMoleculeConfs()` – optimizes each conformer
- `rdMolAlign.GetBestRMS()` – calculates RMSD between conformers
- `Butina.ClusterData()` – performs RMSD-based clustering

### 3.4 Example Code Snippet
```python
from rdkit import Chem
from rdkit.Chem import AllChem, rdMolAlign
from rdkit.ML.Cluster import Butina

mol = Chem.MolFromSmiles('CCO')
mol = Chem.AddHs(mol)
cids = AllChem.EmbedMultipleConfs(mol, numConfs=50)
_ = AllChem.UFFOptimizeMoleculeConfs(mol)

# Compute pairwise RMSD
dists = []
for i in range(len(cids)):
    for j in range(i):
        rms = rdMolAlign.GetBestRMS(mol, mol, i, j)
        dists.append(rms)
clusters = Butina.ClusterData(dists, len(cids), 1.5, isDistData=True)
```

## 4. Molecular Properties and Cluster Filtering

ChemCluster calculates a variety of key physicochemical properties for each molecule using RDKit. These descriptors help users assess drug-likeness, polarity, flexibility, and more. In dataset mode, the properties can also be aggregated at the cluster level to assist in filtering and selection.

### 4.1 Property Overview
The following molecular descriptors are calculated for each molecule:

- Molecular Weight (g/mol)
- LogP (hydrophobicity, from Crippen model)
- Number of Hydrogen Bond Donors
- Number of Hydrogen Bond Acceptors
- TPSA (Topological Polar Surface Area)
- Number of Rotatable Bonds
- Number of Aromatic Rings
- Heavy Atom Count

### 4.2 Calculation Method
All properties are computed with RDKit’s built-in descriptor functions, wrapped inside the helper function `calculate_properties()` defined in `core.py`. This function takes an RDKit `Mol` object and returns a dictionary of descriptors.

```python
from chemcluster import calculate_properties
props = calculate_properties(mol, mol_name="aspirin")
```

### 4.3 Cluster-Based Property Filtering
In dataset mode, ChemCluster aggregates molecular properties at the cluster level to enable filtering. For each cluster, average values for the descriptors are computed and compared to overall means. Users can select properties of interest (e.g., high LogP, low TPSA) and retrieve clusters that meet these criteria.

This enables rapid identification of chemical subspaces with desired features.

### 4.4 Example Code
```python
import numpy as np
cluster_props_summary = {}
for c in set(labels):
    idxs = [i for i, lbl in enumerate(labels) if lbl == c]
    props = [calculate_properties(mols[i]) for i in idxs]
    df = pd.DataFrame(props).select_dtypes(include=[np.number])
    cluster_props_summary[c] = df.mean()
```

## 5. Export and Interactivity

ChemCluster is designed with an interactive interface built using Streamlit and Plotly, allowing users to dynamically explore and export results. This interactivity is essential for intuitive cheminformatics analysis and facilitates quick insight into molecular datasets.

### 5.1 Interactive Exploration of Clusters
Clusters are visualized in a 2D PCA space using Plotly. Each point in the scatter plot corresponds to a molecule. By clicking on a point, the associated molecular structure is displayed alongside its computed properties. This enables rapid investigation of cluster composition.

The interface is implemented with `streamlit_plotly_events` to detect clicks on Plotly points and display corresponding information.

### 5.2 2D and 3D Molecular Visualization
- **2D Structures:** Rendered using RDKit’s `MolToImage()` function
- **3D Structures:** Displayed interactively using `py3Dmol` via the helper function `show_3d_molecule()`

These views allow for both topological and spatial inspection of selected molecules and conformers.

### 5.3 Export of Selected Data
Users may export cluster contents and selected molecule data to `.csv` files. This includes:
- SMILES
- Cluster label
- All calculated molecular descriptors

This feature supports downstream use in other cheminformatics tools or for documentation/reporting purposes.

### 5.4 Example Code Snippet
```python
# Exporting cluster dataframe to CSV
csv = cluster_df.to_csv(index=False).encode('utf-8')
st.download_button("Download Cluster Molecules", data=csv,
                   file_name=f"cluster_{selected_cluster}_molecules.csv",
                   mime="text/csv")
```

## 6. Results and Observations

To demonstrate the functionality of ChemCluster, we considered testing the application with a small molecule input. One possible example is the flavone molecule, provided as an `.sdf` file (`Flavone.sdf`), which may be used during the oral presentation.

In single molecule mode, ChemCluster successfully processed the molecule, added explicit hydrogens, and generated a set of conformers using the ETKDG algorithm. The generated conformers were clustered based on RMSD distances using the Butina algorithm. The application identified a small number of representative conformers (cluster centroids), which were then visualized in both 2D and 3D.

The flavone molecule’s physicochemical properties — including molecular weight, LogP, TPSA, and hydrogen bonding features — were computed and displayed within the application interface. All values were consistent with expectations for a moderately hydrophobic, aromatic compound.

The export feature allowed downloading all conformer and property data in a structured `.csv` format. This facilitates further analysis or integration with external cheminformatics workflows.

*Note: A final version of this section will include updated values and screenshots based on the exact molecule used during the oral demonstration.*


## 7. Testing and Validation

To ensure reliability and maintainability, ChemCluster includes a suite of unit tests covering all core functionalities. These tests are located in the `tests/` directory and can be executed using either `pytest` or `tox`, ensuring compatibility across Python versions.

### 7.1 Coverage and Scope
The following components are covered by unit tests:
- `clean_smiles_list`: checks for SMILES parsing and molecule validity
- `get_fingerprint`: verifies correct generation of Morgan fingerprints
- `calculate_properties`: ensures property values match expected reference
- `mol_to_base64_img`: validates base64 image string formatting
- `show_3d_molecule`: checks py3Dmol viewer object creation

### 7.2 Example Test Snippet
```python
def test_calculate_properties():
    mol = Chem.MolFromSmiles("CCO")  # ethanol
    result = calculate_properties(mol, mol_name="ethanol")
    assert isinstance(result, dict)
    assert abs(result["Molecular Weight"] - 46.07) < 0.1
```

## 8. Challenges Faced

While the development of ChemCluster was overall successful and rewarding, several technical and practical challenges emerged during the implementation and testing phases:

- **Performance with large datasets or high conformer counts:**  
  When processing large molecule sets or generating a high number of conformers (especially >200 per molecule), the application experienced significant slowdowns during 3D generation and clustering. This was particularly evident with long optimization times and memory consumption.

- **3D visualization issues with certain SMILES:**  
  Although py3Dmol generally worked well, some SMILES strings led to unexpected 3D representations or viewer artifacts. This may be linked to problematic stereochemistry or non-standard atom arrangements in the input data.

- **Cross-platform compatibility limitations:**  
  The app's dependencies (e.g. `py3Dmol`, `streamlit_plotly_events`) occasionally behaved differently between Windows and MacOS environments. Some packages required manual installation or specific version pinning depending on the OS.

These limitations were addressed when possible and are documented for future improvements.



## 9. Conclusion and Outlook

ChemCluster has proven to be a functional and user-friendly cheminformatics platform for exploring molecular structures and clusters through RDKit and Streamlit. The application allows both single-molecule and dataset-based analyses, including conformer generation, clustering, property calculation, 2D/3D visualization, and interactive selection.

The project successfully integrates key aspects of molecular analysis into a cohesive, visual, and reproducible workflow. It has been used to process real molecular inputs such as flavone, and demonstrates clear separation of molecular clusters, meaningful descriptor distributions, and smooth user interaction.

While the main functionalities are complete and operational, additional features were considered during the development process. These include more advanced modeling or solution-phase interaction analysis, which were postponed in favor of keeping the application focused, functional, and robust.

### Future Perspective

A promising extension could be the visualization of hydrogen bonding interactions between different centroid conformers of a given molecule in solution. This would help investigate the possible over- or underestimation of transition states (TS) depending on conformer environments. Although this idea was not implemented in the final version, it remains a valuable direction for future work.
