In this part of the tutorial, we run two ontology based methods to produce vector representations of biological entities: Onto2Vec and OPA2Vec.  

## Onto2vec

Onto2vec produces vectory representations based on the logical axioms of an ontology and the known associations between ontology classes and biological entities. In the case study below, we use Onto2vec to produce vector representations of proteins based on their GO annotations and the GO logical axioms.

In [10]:
org_id ='4932' #or 9606 for human data 
!python onto2vec/runOnto2Vec.py  -ontology data/go.owl -associations data/train/{org_id}.OPA_associations.txt -outfile data/{org_id}.onto2vec_vecs -entities data/train/{org_id}.protein_list.txt  

	
		*********** Onto2Vec Running ... ***********


		1.Reasoning over ontology ...

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Loading of Axioms ...
Loading ...
    1%
    2%
    3%
    4%
    6%
    7%
    9%
    10%
    12%
    14%
    16%
    18%
    19%
    21%
    23%
    25%
    27%
    28%
    30%
    32%
    34%
    36%
    37%
    39%
    42%
    44%
    46%
    48%
    50%
    52%
    54%
    56%
    57%
    59%
    61%
    63%
    65%
    67%
    69%
    71%
    74%
    76%
    78%
    79%
    80%
    81%
    82%
    83%
    84%
    86%
    87%
    88%
    89%
    90%
    91%
    93%
    94%
    96%
    97%
    98%
    ... finished
    ... finished
Property Saturation Initialization ...
    ... finished
Reflexive Property Computation ...
    ... finished
Object Property Hierarchy and Composition Computation ...

## OPA2Vec

In addition to the ontology axioms and their entity associations, OPA2Vec also uses the ontology metadata and literature to represent biological entities. The code below runs OPA2Vec on GO and protein-GO associations to produce protein vector representations

In [9]:
!python opa2vec/runOPA2Vec.py  -ontology data/go.owl -associations data/train/{org_id}.OPA_associations.txt -outfile data/{org_id}.opa2vec_vecs -entities data/train/{org_id}.protein_list.txt

	
		*********** OPA2Vec Running ... ***********


		1.Ontology Processing ...

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Loading of Axioms ...
Loading ...
    1%
    2%
    3%
    4%
    5%
    6%
    7%
    8%
    9%
    10%
    11%
    12%
    13%
    14%
    15%
    16%
    18%
    19%
    20%
    21%
    23%
    25%
    26%
    28%
    29%
    31%
    33%
    34%
    36%
    37%
    38%
    40%
    42%
    44%
    45%
    47%
    49%
    51%
    53%
    55%
    57%
    59%
    61%
    63%
    64%
    66%
    67%
    68%
    69%
    70%
    71%
    73%
    74%
    75%
    76%
    78%
    80%
    81%
    83%
    84%
    86%
    87%
    89%
    91%
    92%
    94%
    95%
    97%
    99%
    ... finished
    ... finished
Property Saturation Initialization ...
    ... finished
Reflexive Property Computation ...
    ... f

## Generate features

Map proteins to corresponding vectors

In [11]:
org_id = '9606' #org_id = '4932'
onto2vec_map = {} 
opa2vec_map = {}
with open (f'data/{org_id}.onto2vec_vecs','r') as f:
       for line in f:
           protein, vector=line.strip().split(" ",maxsplit=1)
           onto2vec_map [protein]=vector
with open (f'data/{org_id}.opa2vec_vecs','r') as f:
       for line in f:
            protein, vector=line.strip().split(" ",maxsplit=1)
            opa2vec_map [protein]=vector


Generate pair features for the training/validation/testing datasets

In [13]:
import random 
data_type = ['train', 'valid', 'test']
for i in data_type:
        pair_data = []
        feature_vecs =[]
        label_map ={}
        with open (f'data/{i}/{org_id}.protein.links.v11.0.txt','r') as f1:
              for line in f1:
                  prot1, prot2 = line.strip().split()
                  pair_data.append((prot1,prot2))
                  label_map[(prot1, prot2)] = 1
        with open (f'data/{i}/{org_id}.negative_interactions.txt','r') as f2:
             for line in f2:
                  prot1, prot2 = line.strip().split()
                  pair_data.append((prot1, prot2))
                  label_map[(prot1, prot2)] = 0 
        random.shuffle(pair_data)
        with open (f'data/{i}/{org_id}.onto2vec_features','w') as f3:
              with open (f'data/{i}/{org_id}.opa2vec_features', 'w') as f4:
                   with open (f'data/{i}/{org_id}.labels','w') as f5:
                        with open (f'data/{i}/{org_id}.pairs','w') as f6:
                             for prot1, prot2 in pair_data:
                                 if (prot1 in onto2vec_map and prot1 in opa2vec_map and prot2 in onto2vec_map and prot2 in opa2vec_map):
                                   f6.write (f'{prot1} {prot2}\n')
                                   f5.write (f'{label_map[(prot1,prot2)]}\n')
                                   f4.write (f'{opa2vec_map[prot1]} {opa2vec_map[prot2]}\n')
                                   f3.write (f'{onto2vec_map[prot1]} {onto2vec_map[prot2]}\n')   
                                    

## Cosine similarity

Calculating cosine similarity to explore neighbors of each protein and finding most similar protein vectors. The interaction prediction is then performed based on similarity value based on the assumption that proteins with highly similar feature vectors are more like to interact

In [None]:
import os
import sys
import numpy
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine 
from itertools import islice


for prot1, prot2 in pair_data:
    v1_onto = onto2vec_map[prot1]
    v2_onto = onto2vec_map [prot2]
    v1_opa = opa2vec_map [prot1]
    v2_opa = opa2vec_map [prot2]
    if (prot1 in onto2vec_map and prot1 in opa2vec_map and prot2 in onto2vec_map and prot2 in opa2vec_map):
        cosine_onto = cosine(v1_onto, v2_onto)
        cosine_opa = cosine (v1_opa, v2_opa)
    with open (f'data/{i}/{org_id}.onto_sim','w') as onto_cos:
         with open (f'data/{i}/{org_id}.opa_sim','w') as opa_cos:
            onto_cos.write (f'{cosine_onto}\n')
                opa_cos.write (f'{cosine_onto}\n')
        
        
query =str(sys.argv[1])
n = int (sys.argv[2])
#query ="A0A024RBG1"
#n=10
vectors=numpy.loadtxt("data/{org_id}.opa2vec_vecs");
text_file="data/train/protein_list"
classfile=open (text_file)
mylist=[]
for linec in classfile:
	mystr=linec.strip()
	mylist.append(mystr)


#3.Mapping Entities to Vectors
vectors_map={}
for i in range(0,len(mylist)):
	vectors_map[mylist[i]]=vectors[i,:]
	


cosine_sim={}
for x in range(0,len(mylist)):
	if (mylist[x]!=query): 	
		v1=vectors_map[mylist[x]]
		v2=vectors_map[query]
		value=cosine(v1,v2)
		cosine_sim[mylist[x]]=value
classes = mylist
#5.Retrieving neighbors 
sortedmap=sorted(cosine_sim,key=cosine_sim.get, reverse=True)
iterator=islice(sortedmap,n)
i =1
for d in iterator:
	print (str(i)+". "+ str(d) +"\t"+str(cosine_sim[d])+"\n")
	i +=1

##  Evaluation

In [None]:
from scipy.stats import rankdata

def load_test_data(data_file, classes):
    data = []
    with open(data_file, 'r') as f:
        for line in f:
            it = line.strip().split()
            id1 = f'http://{it[0]}'
            id2 = f'http://{it[1]}'
            data.append((id1, id2))
    return data

def compute_rank_roc(ranks, n_prots):
    auc_x = list(ranks.keys())
    auc_x.sort()
    auc_y = []
    tpr = 0
    sum_rank = sum(ranks.values())
    for x in auc_x:
        tpr += ranks[x]
        auc_y.append(tpr / sum_rank)
    auc_x.append(n_prots)
    auc_y.append(1)
    auc = np.trapz(auc_y, auc_x) / n_prots
    return auc



# Load test data and compute ranks for each protein
test_data = load_test_data(f'data/test/{org_id}.protein.links.v11.0.txt', classes)
top1 = 0
top10 = 0
top100 = 0
mean_rank = 0
ftop1 = 0
ftop10 = 0
ftop100 = 0
fmean_rank = 0
labels = {}
preds = {}
ranks = {}
franks = {}
eval_data = test_data
n = len(eval_data)
for c, d in eval_data:
    c, d = prot_dict[classes[c]], prot_dict[classes[d]]
    labels = np.zeros((len(onto2vec_map), len(onto2vec_map)), dtype=np.int32)
    preds = np.zeros((len(onto2vec_map), len(onto2vec_map)), dtype=np.float32)
    labels[c, d] = 1
    ec = onto2vec_map[c, :]
    #er = rembeds[r, :]
    #ec += er

    # Compute distance
    #dst = np.linalg.norm(prot_embeds - ec.reshape(1, -1), axis=1)
    res = numpy.loadtxt('onto_cos.write')

    preds[c, :] = res
    index = rankdata(res, method='average')
    rank = index[d]
    if rank == 1:
        top1 += 1
    if rank <= 10:
        top10 += 1
    if rank <= 100:
        top100 += 1
    mean_rank += rank
    if rank not in ranks:
        ranks[rank] = 0
    ranks[rank] += 1

    # Filtered rank
    index = rankdata((res * trlabels[c, :]), method='average')
    rank = index[d]
    if rank == 1:
        ftop1 += 1
    if rank <= 10:
        ftop10 += 1
    if rank <= 100:
        ftop100 += 1
    fmean_rank += rank

    if rank not in franks:
        franks[rank] = 0
    franks[rank] += 1
top1 /= n
top10 /= n
top100 /= n
mean_rank /= n
ftop1 /= n
ftop10 /= n
ftop100 /= n
fmean_rank /= n

rank_auc = compute_rank_roc(ranks, len(proteins))
frank_auc = compute_rank_roc(franks, len(proteins))

print(f'{top10:.2f} {top100:.2f} {mean_rank:.2f} {rank_auc:.2f}')
print(f'{ftop10:.2f} {ftop100:.2f} {fmean_rank:.2f} {frank_auc:.2f}')

## Siamese neural network

In [None]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from random import randint 
import numpy as np
import time
import os 
import sys 
import numpy
import sklearn 

#Hyperparameters
num_epochs = 100
num_classes = 2
batch_size = 50
learning_rate = 0.0001


#Load dataset 
X_train_1= numpy.loadtxt("data/train/{org_id}.embeddings_1")
X_train_2= numpy.loadtxt("data/train/{org_id}.embeddings_2")
y_train= numpy.loadtxt("data/train/{org_id}.labels")

X_test_1= numpy.loadtxt("data/test/{org_id}.embeddings_1")
X_test_2= numpy.loadtxt("data/test/{org_id}.embeddings_2")
y_test= numpy.loadtxt("data/test/{org_id}.labels")

#transform to torch
train_x1= torch.from_numpy(X_train_1).float()
train_x2= torch.from_numpy(X_train_2).float()
train_x = [train_x1, train_x2]
train_label= torch.from_numpy(y_train).long()


test_x1 = torch.from_numpy(X_test_1).float()
test_x2 = torch.from_numpy(X_test_2).float()
test_x=[test_x1, test_x2]
test_label= torch.from_numpy(y_test).long()


train_data = []
train_data.append([train_x, train_label])

test_data = []
test_data.append([test_x,test_label])

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

#Define Network 
class Net (nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.layer1 = nn.Sequential(
			nn.Linear (200, 600),
			nn.ReLU())
		self.layer2 = nn.Sequential (
			nn.Linear (600,400),
			nn.ReLU())
		self.layer3 = nn.Sequential(
			nn.Linear (400, 200),
			nn.ReLU())
		self.drop_out = nn.Dropout()
		self.dis = nn.Linear (200,2)

				
	def forward (self, data):
		res = []
		for i in range(2):
			x = data[i]
			out = self.layer1(x)
			out = self.layer2(out)
			out = self.layer3(out)
			out = self.drop_out(out)
			#out = out.reshape(out.size(0),-1)
			res.append(out)
		output = torch.abs(res[1] - res[0])
		#output = torch.mm(res[1] , res[0])		
		output = self.dis(output)
		return output

#Create network 
network = Net()

# Use Cross Entropy for back propagation 
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam (network.parameters(),lr=learning_rate)

# Train the model 
total_step = len(train_loader)
loss_list = []
acc_list = []
for epoch in range (num_epochs):
	for i, (train_x, train_label) in enumerate (train_loader):
		# Get data
		inputs = train_x
		labels = train_label

		# Run the forward pass
		outputs = network (inputs)
		outputs=outputs.reshape(-1,2)
		labels=labels.reshape(-1)				
		#print (outputs.size())
		#print (labels.size())
		loss = criterion (outputs, labels)
		loss_list.append(loss.item())
	
		# Back propagation and optimization
		optimizer.zero_grad()
		loss.backward()
		optimizer.step()

		# Get prediction
		total = labels.size(0)
		_,predicted = torch.max(outputs.data,1)
		correct = (predicted == labels).sum().item()
		acc_list.append (correct/total)
		
		#if (i + 1) % 100 == 0:
		print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'.format(epoch + 1, num_epochs, i + 1, total_step, loss.item(), (correct/total)*100))


# Test the model 
network.eval()
with torch.no_grad():
	correct = 0
	total = 0
	for test_x,test_label in  test_loader:
		outputs = network (test_x)
		labels = test_label
		outputs=outputs.reshape(-1,2)		
		array = outputs.data.cpu().numpy()
		numpy.savetxt('output.csv',array)
		labels=labels.reshape(-1)	
		_, predicted = torch.max(outputs.data,1)
		total += labels.size(0)
		correct += (predicted == labels).sum().item()
	#print ('Accuracy of model on test dataset is: {} %'.format((correct / total) *100))





















