# Molecule Embedding Example
This is an example of how to pass molecules and text through our model to obtain their embeddings. We will be using the checkpoint used within the GraphTextRetrieval downstream task. Ensure that you have extracted the [model data](https://huggingface.co/andrewt28/MolLM/tree/main/GraphTextRetrieval) for GraphTextRetrieval into `downstream/GraphTextRetrieval` and that you are in the `prediction` Conda environment.

We will show how to extract the embedding for a small molecule and two pieces of text. Then, we will compute cosine similarity between the embeddings of the pieces of text to the molecule.

In [None]:
import importlib
import sys

sys.path.insert(0, '../../downstream/graph-transformer')
MolLMPkg = importlib.import_module("MolLM")
MolLM = MolLMPkg.MolLM

In [2]:
model = MolLM('../../downstream/GraphTextRetrieval/all_checkpoints/model-epoch=394.ckpt', '../../downstream/GraphTextRetrieval/', '../../downstream/GraphTextRetrieval/bert_pretrained')

  rank_zero_warn(


Below shows passing the SMILES string for a small molecule, Aspirin, through the MolLM model to obtain its embedding.

In [3]:
# Aspirin molecule
molecule_embedding = model.forward_molecule('O=C(C)Oc1ccccc1C(=O)O')
molecule_embedding.shape

torch.Size([1, 768])

Below shows passing a piece of text through the MolLM model to obtain its embeddings. In this case, the text describes Aspirin. 

In [4]:
# Aspirin description
text_embedding = model.forward_text('Acetylsalicylic acid appears as odorless white crystals or crystalline powder with a slightly bitter taste.')
text_embedding.shape

torch.Size([1, 768])

Then, we compute the cosine similarity between the molecule and text embedding.

In [5]:
from torch.nn.functional import cosine_similarity

cosine_similarity(molecule_embedding, text_embedding)

tensor([0.5192])

Then, we obtain the embedding of another piece of text. In this case, the text is not related to Aspirin. Finally, we compute the cosine similarity of this embedding to that of the Aspirin molecule, and it is significantly lower than the previous similarity as expected.

In [7]:
text_embedding2 = model.forward_text('Sodium octadecanoate is an organic sodium salt comprising equal numbers of sodium and stearate ions.')
cosine_similarity(molecule_embedding, text_embedding2)
# Similarity for a description that is not related aspirin has a much lower cosine similarity to the aspirin molecule embedding as expected

tensor([0.0510])