### Compute UMAP embedding based on biomolecular structures
A file containing the precomputed UMAP embedding based on the 20k biomolecular structures set ([biostructures_20k.csv](biostructures_20k.csv)) is provided, enabling the projection of new structures onto this embedding.
MCES distances of all new structures to all biomolecular structures have to be provided (computation of Myopic MCES distances via [https://github.com/AlBi-HHU/molecule-comparison](https://github.com/AlBi-HHU/molecule-comparison)).

In [1]:
# imports
import pandas as pd
import pickle
import numpy as np

#### Load newly computed MCES distances

In [8]:
new_distances = pd.read_csv('new_mces_distances_example.csv')
# for biomolecular structures, do not consider outlier clusters
outlier_indices = [int(l.strip()) for l in open('biostructures_20k_outlier_indices.txt')]
biostructures_set = [l.strip() for i, l in enumerate(pd.read_csv('biostructures_20k.csv').smiles.tolist()) 
                     if i not in outlier_indices]
print(f'{len(biostructures_set)=}')

len(biostructures_set)=18096


#### bring to into correct format

In [9]:
new_smiles = new_distances.smiles1.unique().tolist()
new_distances = new_distances.set_index(['smiles1', 'smiles2'])
mces_array = np.full((len(new_smiles), len(biostructures_set)), np.nan)
for i, smiles1 in enumerate(new_smiles):
    for j, smiles2 in enumerate(biostructures_set):
        mces_array[i, j] = new_distances.loc[(smiles1, smiles2), 'mces']
print(f'{mces_array.shape=}')

mces_array.shape=(10, 18096)


#### load precomputed UMAP embeddings

In [None]:
umap = pickle.load(open('umap_embedding_biostructures.pkl', 'rb'))
umap_projected = umap.transform(mces_array)

#### Append new distances to `umap_df.csv` to visualize

In [7]:
new_distances_df = pd.DataFrame({'umap1': umap_projected[:, 0], 'umap2': umap_projected[:, 1], 
                                 'smiles': new_smiles, 
                                 'set': ['new_distances'] * len(new_smiles)})
umap_df = pd.read_csv('umap_df.csv')
umap_df = pd.concat([umap_df, new_distances_df], axis=0).to_csv('umap_df.csv', index=False)

#### Visualization
see [display_umap.ipynb](display_umap.ipynb)