### Loading the Prot2Vec Model and Dataset

In [16]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec

In [17]:
prot2vec = Word2Vec.load("prot2vec.model")

In [18]:
df = pd.read_csv("Downloads/CPI_Data.csv")

df.head()

Unnamed: 0,SMILES,Protein Sequence,Target Name,Binding Label,3-gram tokens,SMILES Tokens,Molecular Graph
0,Fc1ccc2nc3CCCCc3c(NCCCCCC(=O)NCCc3c[nH]c4ccccc...,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Acetylcholinesterase,1,"['MRP', 'RPP', 'PPQ', 'PQC', 'QCL', 'CLL', 'LL...","['F', 'c', '1', 'c', 'c', 'c', '2', 'n', 'c', ...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
1,COc1ccc2[nH]cc(CCNC(=O)CCCCCNc3c4CCCCc4nc4cccc...,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Acetylcholinesterase,1,"['MRP', 'RPP', 'PPQ', 'PQC', 'QCL', 'CLL', 'LL...","['C', 'O', 'c', '1', 'c', 'c', 'c', '2', '[nH]...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
2,S=C(CCCCCCCNc1c2CCCCc2nc2ccccc12)NCCc1c[nH]c2c...,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Acetylcholinesterase,1,"['MRP', 'RPP', 'PPQ', 'PQC', 'QCL', 'CLL', 'LL...","['S', '=', 'C', '(', 'C', 'C', 'C', 'C', 'C', ...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
3,Clc1cc(Cl)c2c(NCCCCCCC(=O)NCCc3c[nH]c4ccccc34)...,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Acetylcholinesterase,1,"['MRP', 'RPP', 'PPQ', 'PQC', 'QCL', 'CLL', 'LL...","['Cl', 'c', '1', 'c', 'c', '(', 'Cl', ')', 'c'...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...
4,O=C(CCCCCCNc1c2CCCCc2nc2ccccc12)NCCc1c[nH]c2cc...,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Acetylcholinesterase,1,"['MRP', 'RPP', 'PPQ', 'PQC', 'QCL', 'CLL', 'LL...","['O', '=', 'C', '(', 'C', 'C', 'C', 'C', 'C', ...",{'mol': <rdkit.Chem.rdchem.Mol object at 0x000...


**parse_3grams(x):** This converts the string representation of our 3-gram lists back into actual Python lists.

**grams_to_prot2vec(...):** This  looks up each 3-gram in our Prot2Vec "dictionary" and gets its vector. It then takes the mean (average) of all vectors in a sequence to create one single, fixed-length representation for the whole protein.

In [19]:
import ast

def parse_3grams(x):
    return ast.literal_eval(x)

In [20]:
def grams_to_prot2vec(grams, model):
    vectors = [model.wv[g] for g in grams if g in model.wv]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

In [21]:
protein_vectors = np.vstack(
    df["3-gram tokens"]
      .apply(parse_3grams)
      .apply(lambda g: grams_to_prot2vec(g, prot2vec))
)

**prot_cols:** We name each of the columns (e.g., prot2vec_0, prot2vec_1, etc.) based on the vector size we defined earlier.

**pd.DataFrame(...):** This puts our protein vectors into a table format that matches our original dataset.

In [22]:
prot_cols = [f"prot2vec_{i}" for i in range(prot2vec.vector_size)]
prot_df = pd.DataFrame(protein_vectors, columns=prot_cols)

Organizing the data this way makes it easy to "fuse" these protein features with the compound features (from MolBERT and GAT) during the training of the final Mnemosyne model.

In [24]:
prot_df.to_csv("Prot2Vec.csv", index=False)

Finally, we save the extracted features to a CSV file named Prot2Vec.csv.