# Testing with the Original Embeddings from the CSV file and the Embeddings from the Morgan Fingerprint model

## COVID-19 Drug Repurposing via disease-compounds relations
This example shows how to do drug repurposing using DRKG even with the pretrained model.

In [None]:
import csv
import numpy as np
import pandas as pd
import sys
import torch as th
import torch.nn.functional as fn

import tensorflow as tf
from tensorflow import keras
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolDescriptors as rd
from eosce.models import ErsiliaCompoundEmbeddings

## Collecting COVID-19 related disease
At the very beginning we need to collect a list of disease of Corona-Virus(COV) in DRKG. We can easily use the Disease ID that DRKG uses for encoding the disease. Here we take all of the COV disease as target.

In [2]:
COV_disease_list = [
'Disease::SARS-CoV2 E',
'Disease::SARS-CoV2 M',
'Disease::SARS-CoV2 N',
'Disease::SARS-CoV2 Spike',
'Disease::SARS-CoV2 nsp1',
'Disease::SARS-CoV2 nsp10',
'Disease::SARS-CoV2 nsp11',
'Disease::SARS-CoV2 nsp12',
'Disease::SARS-CoV2 nsp13',
'Disease::SARS-CoV2 nsp14',
'Disease::SARS-CoV2 nsp15',
'Disease::SARS-CoV2 nsp2',
'Disease::SARS-CoV2 nsp4',
'Disease::SARS-CoV2 nsp5',
'Disease::SARS-CoV2 nsp5_C145A',
'Disease::SARS-CoV2 nsp6',
'Disease::SARS-CoV2 nsp7',
'Disease::SARS-CoV2 nsp8',
'Disease::SARS-CoV2 nsp9',
'Disease::SARS-CoV2 orf10',
'Disease::SARS-CoV2 orf3a',
'Disease::SARS-CoV2 orf3b',
'Disease::SARS-CoV2 orf6',
'Disease::SARS-CoV2 orf7a',
'Disease::SARS-CoV2 orf8',
'Disease::SARS-CoV2 orf9b',
'Disease::SARS-CoV2 orf9c',
'Disease::MESH:D045169',
'Disease::MESH:D045473',
'Disease::MESH:D001351',
'Disease::MESH:D065207',
'Disease::MESH:D028941',
'Disease::MESH:D058957',
'Disease::MESH:D006517'
]

## Treatment relation

Two treatment relations in this context

In [3]:
treatment = ['Hetionet::CtD::Compound:Disease','GNBR::T::Compound:Disease']

## Using Original Embeddings

In [4]:
# Read the input file with the SMILES and original embedding columns
input_df = pd.read_csv('smiles_embeddings_infer_drugs.csv')
drug_smiles = input_df['SMILES']
original_embeddings = input_df.iloc[:, 3:].values
print(original_embeddings)

[[-0.27149197 -0.5939862  -0.37011808 ... -0.50732565  0.15921181
  -0.67021894]
 [-0.4293835  -0.35515165 -0.45263517 ...  0.6304008   0.44173548
  -0.43939406]
 [-0.6724328  -0.2223129  -0.5301088  ...  0.37865484  0.36450392
  -0.3003229 ]
 ...
 [ 0.16583948  0.6799841  -0.5094551  ... -0.6383422  -0.52847636
  -0.60481673]
 [ 0.5265029   0.631999   -0.58314145 ...  0.52151537 -0.53969806
   0.5051691 ]
 [-0.5288267  -0.43796182 -0.6211387  ... -0.58782583 -0.57391274
  -0.54813313]]


To reproduce the authors' code, I need the drug bank ID. So I will create a mapping that gets the drugbank ID of a molecule from the SMILES. I'm using the `drugbank_info/drugbank_smiles.txt` file for the mapping.

The dictionary will look like this {'CCCC': 'Compound::DB000'}

The limitation here is that the new input SMILES may not have an associated DrugBank ID that's in the `drugbank_info/drugbank_smiles.txt` file.

In [5]:
# Read the content of the text file
with open('drugbank_smiles.txt', 'r') as file:
    file_content = file.read()

# Split the content into lines and iterate over them
lines = file_content.split('\n')
drugbank_dict = {}

for line in lines:
    # Split each line into SMILES and DrugBank ID
    fields = line.split('\t')
    
    if len(fields) == 2:
        smiles = fields[1]
        drugbank_id = 'Compound::' + fields[0]
        # Add the SMILES and DrugBank ID to the dictionary
        drugbank_dict[smiles] = drugbank_id

# flip the drugbank_dict so we can get the SMILES from the drugbank ID
smiles_dict = {value: key for key, value in drugbank_dict.items()}

# Print the resulting dictionary
# print(drugbank_dict)

In [6]:
# Get a list of the drugs in the input file
drug_list = []

# Use the SMILES to extract the drugBank ID and append it to the list
for smiles in drug_smiles:
    drug_list.append(drugbank_dict.get(smiles))

# An edge case is that the list could contain None values
print('There are', len(drug_list), 'input drugbank ID before filtering out None')
drug_list = list(filter(lambda x: x is not None, drug_list))
print('There are', len(drug_list), 'input drugbank ID after filtering out None')
print(drug_list[:5])

There are 6521 input drugbank ID before filtering out None
There are 6521 input drugbank ID after filtering out None
['Compound::DB00605', 'Compound::DB00983', 'Compound::DB01240', 'Compound::DB11755', 'Compound::DB12184']


In [103]:
# # Quick check to make sure my embeddings from the dataset match the original embeddings from drkg
# entity_idmap_file = '../data/embed/entities.tsv'
# entity_map = {}
# entity_id_map = {}
   
# with open(entity_idmap_file, newline='', encoding='utf-8') as csvfile:
#     reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['name', 'id'])
#     for row_val in reader:
#         entity_map[row_val['name']] = int(row_val['id'])
#         entity_id_map[int(row_val['id'])] = row_val['name']

# entity_emb = np.load('../data/embed/DRKG_TransE_l2_entity.npy')
# drug_ids = [entity_map[drug] for drug in drug_list]
# drug_ids = th.tensor(drug_ids).long()
# drug_emb = th.tensor(entity_emb[drug_ids]) # embedding from knowledge graph

# my_drug_emb = th.tensor(original_embeddings) # my extracted embeddings
# display(drug_emb)
# display(my_drug_emb)

# print(len(drug_emb))
# print(len(my_drug_emb))

tensor([[-0.2715, -0.5940, -0.3701,  ..., -0.5073,  0.1592, -0.6702],
        [-0.4294, -0.3552, -0.4526,  ...,  0.6304,  0.4417, -0.4394],
        [-0.6724, -0.2223, -0.5301,  ...,  0.3787,  0.3645, -0.3003],
        ...,
        [ 0.1658,  0.6800, -0.5095,  ..., -0.6383, -0.5285, -0.6048],
        [ 0.5265,  0.6320, -0.5831,  ...,  0.5215, -0.5397,  0.5052],
        [-0.5288, -0.4380, -0.6211,  ..., -0.5878, -0.5739, -0.5481]])

tensor([[-0.2715, -0.5940, -0.3701,  ..., -0.5073,  0.1592, -0.6702],
        [-0.4294, -0.3552, -0.4526,  ...,  0.6304,  0.4417, -0.4394],
        [-0.6724, -0.2223, -0.5301,  ...,  0.3787,  0.3645, -0.3003],
        ...,
        [ 0.1658,  0.6800, -0.5095,  ..., -0.6383, -0.5285, -0.6048],
        [ 0.5265,  0.6320, -0.5831,  ...,  0.5215, -0.5397,  0.5052],
        [-0.5288, -0.4380, -0.6211,  ..., -0.5878, -0.5739, -0.5481]],
       dtype=torch.float64)

6521
6521


In [7]:
gamma = 12.0

def transE_l2(head, rel, tail):
    score = head + rel - tail
    return gamma - th.norm(score, p=2, dim=-1)


def edge_score(embeddings):
    '''Function to calculate the edge scores.

    Argument
    ---------
    embeddings (array). Array of size 400 containing 
            the embeddings of the SMILES molecule.

    Returns
    --------
    scores (tensor). Tensor showing the edge score for 
            each disease based on the drug_embeddings, relation_embeddings,
            and COVID_disease embeddings.
    '''
    
    # Load entity and relation mapping files
    entity_idmap_file = '../data/embed/entities.tsv'
    relation_idmap_file = '../data/embed/relations.tsv'

    # Get drugname/disease name to entity ID mappings
    entity_map = {}
    entity_id_map = {}
    relation_map = {}
    
    with open(entity_idmap_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['name', 'id'])
        for row_val in reader:
            entity_map[row_val['name']] = int(row_val['id'])
            entity_id_map[int(row_val['id'])] = row_val['name']

    with open(relation_idmap_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['name', 'id'])
        for row_val in reader:
            relation_map[row_val['name']] = int(row_val['id'])

    # Handle the ID mapping
    drug_ids = [entity_map[drug] for drug in drug_list]
    disease_ids = [entity_map[disease] for disease in COV_disease_list]
    treatment_rid = [relation_map[treat] for treat in treatment]

    # Load embeddings
    entity_emb = np.load('../data/embed/DRKG_TransE_l2_entity.npy')
    rel_emb = np.load('../data/embed/DRKG_TransE_l2_relation.npy')

    drug_ids = th.tensor(drug_ids).long()
    disease_ids = th.tensor(disease_ids).long()
    treatment_rid = th.tensor(treatment_rid)

    # Use our model's embeddings here
    drug_emb = th.tensor(embeddings)
    # drug_emb = th.tensor(entity_emb[drug_ids]) # get embeddings from knowledge graph
    
    treatment_embs = [th.tensor(rel_emb[rid]) for rid in treatment_rid]

    scores_per_disease = []
    dids = []
    for rid in range(len(treatment_embs)):
        treatment_emb=treatment_embs[rid]
        for disease_id in disease_ids:
            disease_emb = entity_emb[disease_id]
            score = fn.logsigmoid(transE_l2(drug_emb, treatment_emb, disease_emb))
            scores_per_disease.append(score)
            dids.append(drug_ids)
    scores = th.cat(scores_per_disease)
    dids = th.cat(dids)
    
    # Sort scores in descending order
    idx = th.flip(th.argsort(scores), dims=[0])
    scores = scores[idx].numpy()
    dids = dids[idx].numpy()

    # Now we output proposed treatments
    _, unique_indices = np.unique(dids, return_index=True)
    topk=100
    topk_indices = np.sort(unique_indices)[:topk]
    proposed_dids = dids[topk_indices]
    proposed_scores = scores[topk_indices]

    # Now we list the pairs in the form of (drug, treat, disease, score)
    # We select the top K relevant drugs according to the edge score
    edge_score_dict = {}
    for i in range(topk):
        drug = int(proposed_dids[i])
        score = proposed_scores[i]
        edge_score_dict.update({smiles_dict.get(entity_id_map[drug]): score})
        # remove the print function later
        print("{}\t{}".format(entity_id_map[drug], score))
        # show the smiles and the score
        # print("{}\t{}\t{}".format(smiles_dict.get(entity_id_map[drug]), entity_id_map[drug], score))
                    
    return edge_score_dict

In [10]:
# Check edge score
print(len(original_embeddings))
original_score = edge_score(original_embeddings)
print(original_score)

6521
Compound::DB00811	-0.21416790981372616
Compound::DB00993	-0.835088998203896
Compound::DB00635	-0.8974785851566892
Compound::DB01082	-0.9854882290451629
Compound::DB01234	-0.9984012514597904
Compound::DB00982	-1.0160717658176406
Compound::DB00563	-1.018946852539912
Compound::DB00290	-1.0641067344374928
Compound::DB01394	-1.0806760468321792
Compound::DB01222	-1.0845477037354971
Compound::DB00415	-1.0853976195309496
Compound::DB01004	-1.096669807865025
Compound::DB00860	-1.1004784467889137
Compound::DB00681	-1.1011557352112846
Compound::DB00688	-1.1256872785724772
Compound::DB00624	-1.1428297627951873
Compound::DB00959	-1.1618414920458742
Compound::DB00115	-1.186812089630245
Compound::DB00091	-1.1906721734060766
Compound::DB01024	-1.2051163728597931
Compound::DB00741	-1.2147067702585472
Compound::DB00441	-1.2320411184654292
Compound::DB00158	-1.2346547610031624
Compound::DB00499	-1.2525165420743032
Compound::DB00929	-1.273049804221849
Compound::DB00770	-1.282553108383881
Compound::DB

The output for the first drug and some other drugs point matches those in the original code.

Note: not all the drugs in this list are in the drug list the original authors used in the `infer_drug.tsv` file. 
Remember that the SMILES of some drugs were missing. The input file here is filtered to only consider the drug from the `infer_drug.tsv` file that have an associated SMILES data.

In [11]:
# print(len(scores))

## Using Morgan Fingerprint model predicted embeddings

In [12]:
def calculate_morgan_fingerprint(smiles_series, radius=2, n_bits=2048):
    '''Function to convert SMILES to fingerprint using the Morgan Fingerprint for a Pandas Series.

    Parameters
    -----------
    smiles_series (Pandas Series): Series containing SMILES of the compounds.
    radius (int): controls the radius of the fingerprint.
    n_bits (int): controls the length of the fingerprint bit vector.

    Returns
    -------
    fingerprints (NumPy Array): fingerprints of SMILES in the input series
    '''
    fingerprints = []

    for smiles in smiles_series:
        # Convert the input SMILES string into an RDKit molecule object.
        mol = Chem.MolFromSmiles(smiles)
        # If the molecule conversion is successful, then generate the fingerprint
        if mol is not None:
            fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
            arr = np.zeros((1,))
            AllChem.DataStructs.ConvertToNumpyArray(fingerprint, arr)
            fingerprints.append(arr)

    return np.array(fingerprints)


morgan_fingerprints = calculate_morgan_fingerprint(drug_smiles)
print("The length of drug smiles is:", len(drug_smiles))
print("The length of morgan_fingerprints is:", len(morgan_fingerprints))
print(morgan_fingerprints)

[09:38:01] Unusual charge on atom 0 number of radical electrons set to zero


The length of drug smiles is: 6521
The length of morgan_fingerprints is: 6521
[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [13]:
# load the saved model
model = keras.models.load_model("mf_model")

# predict the embeddings of the drug Morgan fingerprints
mf_embeddings = model.predict(morgan_fingerprints)
print(mf_embeddings)

[[-0.14539145 -0.07305996 -0.28591534 ... -0.17516898  0.1930192
  -0.24724808]
 [-0.19003028 -0.01532836 -0.34017324 ... -0.03350564  0.27475265
  -0.4107109 ]
 [-0.18157184 -0.18642853 -0.53062004 ... -0.19943923  0.17700678
  -0.42576954]
 ...
 [ 0.15702343  0.5273648  -0.18834566 ...  0.09023038 -0.00148031
  -0.07461459]
 [ 0.26712334  0.35497493 -0.05726355 ...  0.06099455  0.00921259
   0.13521963]
 [-0.14691627  0.14291099 -0.08404915 ... -0.0940645   0.15892325
  -0.3037098 ]]


In [14]:
# Check edge score
mf_score = edge_score(mf_embeddings)
# print(mf_score)

Compound::DB00140	-0.4077950716018677
Compound::DB00348	-0.41536837816238403
Compound::DB00495	-0.41737428307533264
Compound::DB12028	-0.42042261362075806
Compound::DB00432	-0.42594149708747864
Compound::DB11797	-0.4264325499534607
Compound::DB00249	-0.4421328902244568
Compound::DB00507	-0.4443260729312897
Compound::DB04335	-0.4454585909843445
Compound::DB00200	-0.45016348361968994
Compound::DB13421	-0.4518240988254547
Compound::DB00446	-0.4569593667984009
Compound::DB01002	-0.45817339420318604
Compound::DB00297	-0.45817339420318604
Compound::DB00198	-0.46341702342033386
Compound::DB01048	-0.4650264084339142
Compound::DB03247	-0.4653477966785431
Compound::DB13795	-0.4672207832336426
Compound::DB00707	-0.46888118982315063
Compound::DB00811	-0.4717784821987152
Compound::DB00781	-0.47182223200798035
Compound::DB01280	-0.4769327640533447
Compound::DB00442	-0.478443443775177
Compound::DB11869	-0.4797147214412689
Compound::DB00770	-0.481482595205307
Compound::DB00460	-0.4841911494731903
Comp

## Using Morgan Fingerprint Count Embeddings

In [15]:
def calculate_morgan_fingerprint_count(smiles_series, radius=2, n_bits=2048):
    '''Function to convert SMILES to fingerprint using the Morgan Fingerprint Count for a Pandas Series.

    Parameters
    -----------
    smiles_series (Pandas Series): Series containing SMILES of the compounds.
    radius (int): controls the radius of the fingerprint.
    n_bits (int): controls the length of the fingerprint bit vector.

    Returns
    -------
    fingerprints (NumPy Array): fingerprints of SMILES in the input series
    '''
    fingerprints = []

    for smiles in smiles_series:
        # Convert the input SMILES string into an RDKit molecule object.
        mol = Chem.MolFromSmiles(smiles)
        # If the molecule conversion is successful, then generate the fingerprint
        if mol is not None:
            fingerprint = rd.GetHashedMorganFingerprint(mol, radius=radius, nBits=n_bits)
            arr = np.zeros((n_bits,), dtype=np.uint8)
            for idx, count in fingerprint.GetNonzeroElements().items():
                arr[idx] = count if count < 255 else 255
            fingerprints.append(arr)

    return np.array(fingerprints)


morgan_fingerprint_counts = calculate_morgan_fingerprint_count(drug_smiles)
print("The length of drug smiles is:", len(drug_smiles))
print("The length of morgan_fingerprint_counts is:", len(morgan_fingerprint_counts))
print(morgan_fingerprint_counts)

[09:38:32] Unusual charge on atom 0 number of radical electrons set to zero


The length of drug smiles is: 6521
The length of morgan_fingerprint_counts is: 6521
[[0 0 0 ... 0 0 0]
 [0 2 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [16]:
# load the saved model
model = keras.models.load_model("mfcount_model")

# predict the embeddings of the drug Morgan fingerprints
mfcount_embeddings = model.predict(morgan_fingerprints)
print(mfcount_embeddings)

[[-0.06825772  0.30465627 -0.12158845 ... -0.12277345  0.2011324
  -0.23716313]
 [-0.20824523 -0.01959234 -0.48053306 ...  0.12968855  0.3793604
  -0.48335794]
 [-0.08968617  0.09335519 -0.5797926  ... -0.05975341  0.31130734
  -0.5547119 ]
 ...
 [ 0.20823444  0.6298635  -0.24004918 ...  0.13009495  0.05724981
   0.09389218]
 [ 0.15205142  0.5382722  -0.35446736 ...  0.25447556 -0.01474598
   0.1489524 ]
 [ 0.04268427  0.48600382 -0.48005092 ...  0.08301169  0.08417349
   0.06148876]]


In [17]:
# Check edge score
mfc_score = edge_score(mfcount_embeddings)
# print(mfc_score)

Compound::DB04815	-0.3808886408805847
Compound::DB04335	-0.3813525438308716
Compound::DB13776	-0.38916975259780884
Compound::DB00432	-0.39102819561958313
Compound::DB00495	-0.40762394666671753
Compound::DB00249	-0.4102410078048706
Compound::DB01280	-0.4227445721626282
Compound::DB00242	-0.43035271763801575
Compound::DB00811	-0.4365576207637787
Compound::DB12028	-0.4375351071357727
Compound::DB00631	-0.4496358335018158
Compound::DB02315	-0.4517326056957245
Compound::DB00445	-0.45474064350128174
Compound::DB00997	-0.45474064350128174
Compound::DB01181	-0.45863136649131775
Compound::DB01629	-0.46170029044151306
Compound::DB09115	-0.4620950222015381
Compound::DB00299	-0.4674661159515381
Compound::DB02103	-0.4677414298057556
Compound::DB00531	-0.4702516794204712
Compound::DB13182	-0.47361448407173157
Compound::DB13069	-0.4784637689590454
Compound::DB00380	-0.47886425256729126
Compound::DB15427	-0.4801975190639496
Compound::DB13389	-0.4813809394836426
Compound::DB12901	-0.48162218928337097
C

## Using Ersilia Embeddings

In [18]:
def calculate_ersilia_fingerprint(smiles_series):
    '''Function to convert SMILES to embedding using the Ersilia Compound Embeddings

    Parameters
    -----------
    smiles_series (Pandas Series): Series containing SMILES of the compounds.
    
    Returns
    -------
    embeddings (NumPy Array): embeddings of SMILES in the input series
    '''
    fingerprints = []

    for smiles in smiles_series:
        model = ErsiliaCompoundEmbeddings()
        embeddings = model.transform([smiles])
        fingerprints.append(embeddings[0])
        
    return np.array(fingerprints)

ersilia_descriptor = calculate_ersilia_fingerprint(drug_smiles)
print("The length of drug smiles is:", len(drug_smiles))
print("The length of ersilia descriptor is:", len(ersilia_descriptor))
print(ersilia_descriptor)

[09:42:51] Unusual charge on atom 0 number of radical electrons set to zero


The length of drug smiles is: 6521
The length of ersilia descriptor is: 6521
[[ 0.0257058  -0.04700303  0.0038386  ... -0.04954001  0.00230189
   0.06074152]
 [-0.01488564  0.00200031  0.0076263  ... -0.02522343  0.00974543
  -0.05920968]
 [ 0.03319988  0.00829009  0.11105975 ... -0.06268101 -0.05922099
  -0.13830455]
 ...
 [ 0.14135769  0.04299419 -0.05294156 ... -0.11926071 -0.04383421
  -0.07094503]
 [-0.00320739 -0.07315785  0.02023898 ... -0.00566167 -0.04526901
   0.02813001]
 [ 0.05420799  0.01565388 -0.06837106 ... -0.04647554 -0.05509707
   0.00550069]]


In [19]:
print(ersilia_descriptor.shape)
# ersilia_descriptor = ersilia_descriptor.reshape((6521, 1024))
# print(ersilia_descriptor.shape)

(6521, 1024)


In [20]:
print(len(ersilia_descriptor))

6521


In [21]:
# load the saved model
model = keras.models.load_model("ersilia_model")

# predict the embeddings of the Ersilia descriptor
ersilia_embeddings = model.predict(ersilia_descriptor)
print(ersilia_embeddings)

[[ 0.01235009  0.13490671 -0.4879381  ... -0.19719847  0.48371866
  -0.45392495]
 [-0.2091402   0.11309679 -0.3237701  ...  0.02232243  0.21990132
  -0.2811888 ]
 [ 0.04469691  0.18003142 -0.41040844 ... -0.10965852  0.16823182
  -0.30109778]
 ...
 [-0.02620227  0.41637787 -0.0159163  ...  0.04635789  0.10165697
   0.00709242]
 [ 0.13899654  0.37279862 -0.20471048 ...  0.09301486  0.00790367
   0.1540771 ]
 [-0.10602047  0.17622861 -0.24496007 ... -0.10999662  0.29111537
  -0.1998289 ]]


In [22]:
# Check edge score
print(len(ersilia_embeddings))
ersilia_score = edge_score(ersilia_embeddings)
# print(ersilia_score)

6521
Compound::DB03609	-0.4530784487724304
Compound::DB12531	-0.4638391435146332
Compound::DB12028	-0.4656132757663727
Compound::DB00718	-0.470846951007843
Compound::DB06683	-0.47480902075767517
Compound::DB01629	-0.4816054105758667
Compound::DB01610	-0.4853154718875885
Compound::DB05020	-0.4909569025039673
Compound::DB06575	-0.4927115738391876
Compound::DB00406	-0.5021226406097412
Compound::DB15427	-0.5057070255279541
Compound::DB00249	-0.509706974029541
Compound::DB00495	-0.5119695067405701
Compound::DB12901	-0.516657292842865
Compound::DB00460	-0.5180578231811523
Compound::DB13421	-0.5205394625663757
Compound::DB02857	-0.5230657458305359
Compound::DB00442	-0.5237544178962708
Compound::DB00631	-0.5266939401626587
Compound::DB08596	-0.5288973450660706
Compound::DB04198	-0.5298973917961121
Compound::DB01157	-0.5334029197692871
Compound::DB00432	-0.5351057052612305
Compound::DB12753	-0.5413323044776917
Compound::DB02285	-0.541774570941925
Compound::DB03986	-0.5490472316741943
Compound::

## Compare the scores to determine the best embeddings
Use the R-squared score metric to see if the embedding edge score matches the original embedding edge score.

**Do you think I should make sure the dictionaries are not sorted based on their edge scores before comparing them with the R-squared metric?**

Since different drugs are in each list (top 100), this may not be a good approach to evaluating the best embedding.

In [1]:
# from sklearn.metrics import r2_score

# # Calculate R^2
# r_squared_mf = r2_score(list(original_score.values()), list(mf_score.values()))
# r_squared_mfc = r2_score(list(original_score.values()), list(mfc_score.values()))
# r_squared_ersilia = r2_score(list(original_score.values()), list(ersilia_score.values()))

# print(f"The R^2 score for the Morgan FingerPrint embedding is: {r_squared_mf}")
# print(f"The R^2 score for the Morgan FingerPrint Count embedding is: {r_squared_mfc}")
# print(f"The R^2 score for the Ersilia embedding is: {r_squared_ersilia}")

I'll also try using Mean Squared Error.

In [2]:
# from sklearn.metrics import mean_squared_error

# mse_mf = mean_squared_error(list(original_score.values()), list(mf_score.values()))
# mse_mfc = mean_squared_error(list(original_score.values()), list(mfc_score.values()))
# mse_ersilia = mean_squared_error(list(original_score.values()), list(ersilia_score.values()))

# print(f"The Mean Squared Error for the Morgan FingerPrint embedding is: {mse_mf}")
# print(f"The Mean Squared Error for the Morgan FingerPrint Count embedding is: {mse_mfc}")
# print(f"The Mean Squared Error for the Ersilia embedding is: {mse_ersilia}")