### Interactive Visualization of Chemical Space
In many situtations, we want to be able quickly visualize the chemical space occupied by a set of compounds.  In this space, chemically similar compounds will be close together and dissimilar compounds will be farther apart.  This notebook provides a brief example of how to create an interactive plot where the chemical structures of compounds corresponding to selected points are shown below the plot. 

**Important Note if You're Running in Colab**   
After the libraries are installed, you'll see a message saying "Your session has crashed, automatically restarting".  Don't worry about this.  We're simply forcing Colab to restart and pick up the newly installed libraries. Continue to execute the notebook cells, everything will work. 

In [2]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !pip install useful_rdkit_utils jupyter-scatter mols2grid scikit-learn
    exit()

In [5]:
import pandas as pd
import useful_rdkit_utils as uru
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import numpy as np
import jscatter
import mols2grid
import ipywidgets
import warnings

#### 1. Read the input data
Read a dataset with drugs from the [ChEMBL](https://www.ebi.ac.uk/chembl/) database. 

In [7]:
url = "https://raw.githubusercontent.com/PatWalters/datafiles/refs/heads/main/chembl_drugs.smi"
df = pd.read_csv(url,sep=" ",names=["SMILES","Name"])

#### 2. Generate chemical fingerprints
Instantiate a fingerprint generator object from [useful_rdkit_utils](https://github.com/PatWalters/useful_rdkit_utils). This is just a convenience wrapper around the [RDKit Morgan fingerprint generator](https://greglandrum.github.io/rdkit-blog/posts/2023-01-18-fingerprint-generator-tutorial.html). 

In [8]:
smi2fp = uru.Smi2Fp()
df['fp'] = df.SMILES.apply(smi2fp.get_np)

#### 3. Reduce the fingerprint dimensionality with PCA
We are going to use Truncated Stochasitc Neighbor Embedding (TSNE) to project the chemical fingerprints generated above into two dimensions. TSNE works better when the dimensionality of the input data has been reduced to ~50 features.  We will use Principal Component Analysis (PCA) to reduce the fingerprints to 50 dimensions. 

In [9]:
pca = PCA(n_components=50)
pcs = pca.fit_transform(np.stack(df.fp))

#### 4. Project the PCs into two dimensions with TSNE
Now we can reduce the 50 dimensional principal components to 2 dimensions for plotting. Note that I used a context manager to catch a few annoying warning messages. The coordinates from the TSNE projection are added to the dataframe as **tsne_x** and **tsne_y**. 

In [10]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore",category=FutureWarning)
    tsne = TSNE(n_components=2,init='pca')
    df[["tsne_x","tsne_y"]] = tsne.fit_transform(pcs).tolist()

#### 5. Generate an interactive scatterplot
That's all we need to do. Now we can make a plot of chemical space using the nifty [Jupyter Scatter](https://github.com/flekschas/jupyter-scatter) component.  You can control this component using the icons on the left of the plot.  Click on the second icon from the top to put the compoent into selection mode.  Click and drag to select a set of points, and the corresponding chemical structures will be shown below the plot.  The third icon from the top can be used to change the selection mode.  For efficiency, I've limited the display to 25 chemical structures.  This can be easily changed in the code block below. 

In [11]:
scatter = jscatter.Scatter(data=df,x="tsne_x", y="tsne_y")
output = ipywidgets.Output()

@output.capture(clear_output=True)
def selection_change_handler(change):
    display(mols2grid.display(df.loc[change.new].head(25),subset=["img","Name"],template="static",prerender=True,size=(200,200)))
            
scatter.widget.observe(selection_change_handler, names=["selection"])

ipywidgets.VBox([scatter.show(), output])

VBox(children=(VBox(children=(HBox(children=(VBox(children=(Button(icon='arrows', style='primary', width=36), …