# Analyzing similarity of molecular dataset

This notebook illustrates how the `NearestNeighborsRetrieverTanimoto` can be used for analyzing the Tanimoto similarities of two datasets. 

Such analysis can be useful for many applications. For example, for analyzing how similar new molecules are to the training set to assess the applicability domain when making predictions. Alternatively the similarity of training and test set can be evaluated to understand how well the model generalizes. 

The notebook has the following sections:

**How to compute dataset similarities?**

**How to analyze similarities between train and test set?**

**Comparison to native RDKit Tanimoto computation**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from rdkit import DataStructs
from sklearn.model_selection import train_test_split

from molpipeline import ErrorFilter, Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.estimators.nearest_neighbor import NearestNeighborsRetrieverTanimoto
from molpipeline.mol2any import MolToMorganFP
from molpipeline.utils.kernel import tanimoto_similarity_sparse

ModuleNotFoundError: No module named 'seaborn'

For this notebook we use 20k molecules from ChEMBL35 as a dataset.

In [None]:
df = pd.read_csv("example_data/chembl_35_20k.smi.gz", index_col="index")

In [None]:
df

## How to compute dataset similarities? 

To start the comparison we need the fingerprints as sparse matrices.

In [None]:
%%time
error_filter = ErrorFilter()

fingerprint_pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        ("error_filter", error_filter),
        ("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2, return_as="sparse")),
    ],
    n_jobs=-1,
)

fp_matrix = fingerprint_pipeline.transform(df["smiles"])
fp_matrix

The resulting fingerprint matrix has the shape (19999, 2048) showing that 1 molecule could not be processed.

To make a data set comparison we need to define the target and the query data set. The `NearestNeighborsRetrieverTanimoto` will retrieve the k most similar molecules in the target data sets for every query fingerprint. In this example we use the same matrix as target and query data set and compute their 3-nearest neighbors using `k=3`.  

In [None]:
%%time
target_fps = fp_matrix
query_fps = target_fps

retriever = NearestNeighborsRetrieverTanimoto(target_fps, k=3, n_jobs=-1)
indices, similarities = retriever.predict(query_fps)

The output of the retriever are a list of `indices` corresponding to the hits in the target dataset and a list of the hits' Tanimoto similarities

In [None]:
indices

In [None]:
indices.shape

The `indices` array contains one row for each query fingerprint and three columns for the 3-nearest neighbors. The hits of each query are sorted from left to right in descending order. The `similarities` array has the same shape as `indices` but contains the Tanimoto scores. 

In [None]:
similarities

Since we used the same dataset for the query and the target dataset, we always find a molecule with a similarity of 1.0 because the query itself is contained in the target dataset. However, sometimes there are multiple hits with the same Tanimoto score of 1.0.

## How to analyze similarities between train and test set?

The nearest neighbors can be used for analyzing the similarity between training and test set which can be an essential tool to better understand the generalization capabilities of machine learning models. In addition, this information can be used to select an appropriate data splitting strategy.

First we make a train/test split with our ChEMBL data and a dummy y vector because we don't use the labels in this example. 

In [None]:
# let's use dummy values for y
y = np.zeros(fp_matrix.shape[0], dtype=np.int64)

X_train, X_test, y_train, y_test = train_test_split(
    fp_matrix, y, test_size=0.33, random_state=42,
)

In [None]:
X_train

We use the `NearestNeighborsRetrieverTanimoto` to get the 1-nearest neighbors of the test compounds in the training set

In [None]:
retriever = NearestNeighborsRetrieverTanimoto(X_train, k=1, n_jobs=-1)
indices, similarities = retriever.predict(X_test)
similarities

Let's look at the mean similarities of the most similar compounds in the training set

In [None]:
np.mean(similarities)

We can also plot the distribution of similarities to get a better impression how similar the train and test set are to each other.

In [None]:
sns.histplot(pd.DataFrame({"1nn_similarities": similarities}), bins=50)
plt.title("1-nearest neighbor Tanimoto similarities to training data")

As the histogram shows, the similarity between the test and training set is relatively high with most compounds having a similarity >0.6 and even ~250 molecules with a Tanimoto score of 1. However, this is just a hypothetical example. If a real-world dataset would have such high similarities we would probably use cluster or time splits to reduce the similarity and data leakage. 

## Comparison to native RDKit Tanimoto computation

`NearestNeighborsRetrieverTanimoto` performs an exhaustive comparison to find the k-nearest neighbors. To do this, the full similarity matrix must be computed. MolPipeline's algorithm for finding these Tanimoto similarity scores differs from the approach in RDKit. In MolPipeline, we use an implementation based on sparse matrices that exploits the sparse matrix dot product algorithm from scipy. The central function is `tanimoto_similarity_sparse` which computes the full similarity matrix.

In [None]:
%%time
sim_matrix = tanimoto_similarity_sparse(fp_matrix, fp_matrix)
sim_matrix.shape

In [None]:
sim_matrix

To get the full similarity matrix with RDKit using `BulkTanimotoSimilarity`, we have to have the fingerprints as a different datastructure, for example as `ExplicitBitVect`.  

In [None]:
%%time
error_filter = ErrorFilter()

fingerprint_pipeline2 = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        ("error_filter", error_filter),
        (
            "morgan2_2048",
            MolToMorganFP(n_bits=2048, radius=2, return_as="explicit_bit_vect"),
        ),
    ],
    n_jobs=-1,
)

fp_matrix_explicit = fingerprint_pipeline2.transform(df["smiles"])
fp_matrix_explicit[:4]

Now, let's compute the full similarity matrix using RDKit's `BulkTanimotoSimilarity`

In [None]:
%%time
sim_mat_rdkit = np.full((len(fp_matrix_explicit), len(fp_matrix_explicit)), np.nan)
for i, query_fp in enumerate(fp_matrix_explicit):
    sim_mat_rdkit[i, :] = DataStructs.BulkTanimotoSimilarity(
        query_fp, fp_matrix_explicit,
    )
sim_mat_rdkit.shape

In [None]:
if not np.allclose(sim_matrix, sim_mat_rdkit):
    raise AssertionError("Similarities are not the same")

Based on this simple comparison MolPipeline's similarity matrix computation is about ~2-3 times faster than RDKit's. However, of course there are many other things to consider that are not touched in this notebook. For example, `tanimoto_similarity_sparse` uses more memory since it needs intermediate matrices while `BulkTanimotoSimilarity` uses almost no memory. In addition, for both approaches different strategies for parallelization come to mind (one is implemented in `NearestNeighborsRetrieverTanimoto`), which can be beneficial in different scenarios. Lastly, while the here discussed functions are useful for easy analysis in Python, there are highly optimized tools for similarity search, like [Artor](https://www.nextmovesoftware.com/arthor.html) which should probably be used when search speed is essential. 