# Welcome to the OPA2Vec tutorial session.

## Introduction

OPA2Vec is a tool that can be used to produce feature vectors for biological entities based on an ontology and its annotations. 

The source code of OPA2Vec is available at: https://github.com/bio-ontology-research-group/opa2vec/.

## Dependencies

First of all, we need to prepare the environment to run OPA2Vec. OPA2Vec is implemented in 3 different programming languages python, groovy and perl. The versions used for each language are the following: 
- python 2.7.5
- groovy 2.4.10 JVM:1.8.0_121
- perl: v5.16.3

OPA2Vec also uses the gensim python library which requires scipy and numpy. Assuming you have numpy and scipy installed, you can install gensim by running the following in your terminal: 

In [None]:
%%bash
pip install --upgrade gensim

Now that our enviornment is hopefully ready, we can go ahead and run OPA2Vec. The folder where you found this tutorial contains some files that we could use as input samples, in particular:
- *phenomenet.owl*, the file containing the ontology we would like to use in owl format.
- *go_associations*, a file containing protein to GO function annotations. 
- *entities.lst*, an optional file containing the list of biological entities we are interested in getting a vector representation for.


Let's go ahead and run OPA2Vec from the command line: 

In [None]:
%%bash
python runOPA2Vec.py phenomenet.owl go_associations -entities entities.lst -annotations all -reasoner elk

If everything goes well, an output file, *AllVectorResults.lst*, will be created in no more than 20 min and should contain the obtained vector representations. A sample output file *AllVectorResults1* is available in this tutorial.

As you can see in the command above, in addition to the two mandatory input files , *phenomenet.owl* and *go_associations*, we have also specified three optional parameters: *all* for the *-annotations* parameter to make OPA2Vec use all annotation properties from the metadata of the ontology, *elk* as our reasoner of choice and the *entities.lst* file.

OPA2Vec allows you to choose more optional parameters depending on your data and type of application.  
In particular, the optional parameters we are allowed to specify in the command line are :
 
    -embedsize [embedding size]
    Size of obtained vectors (will depend on training model)

    -windsize [window size]
    Window size for word2vec model

    -mincount [min count]
    Minimum count value for word2vec model

    -model [model]
    Preferred word2vec architecture, sg or cbow

    -annotations [metadata annotations] List of full URIs of annotation properties to be included in metadata separated by a comma . Use 'all' for all annotation properties (default) or 'none' for no annotation property


    -pretrained [pre-trained model] Pre-trained word2vec model for background knowledge. If no pre-trained model is specified, the program will assume you have downloaded the default pre-trained model from http://bio2vec.net/data/pubmed_model/ 

Let's now have a look at what the output file looks like:

In [None]:
%%bash
head -45 AllVectorResults1

The vectors are printed accross multiple lines which makes them a bit hard to process. We can transform the vectors using the following script to a more convenient format: 

In [None]:
import os
import sys

# process vectors
file=open("processed_vectors",'w')

input_file="AllVectorResults1"

inf =open (input_file)
for line in inf:
	 line.strip().replace ('[',"").replace(']',"\n")
	 file.write (line.strip().replace ('[',"").replace(']',"\n")),

file.close

Our vectors look much better now :

In [None]:
%%bash
head -20 processed_vectors

Let's try a few things we can do with our vectors.
As an example, given a query protein *p*,let's find the *n* closest proteins to it based on pairwise cosine similarity of the obtained vectors. To do so, we first need to install the sklearn package containing the cosine similarity function in python:  

In [None]:
%%bash
 apt-get -y install python-sklearn

We can now go ahead and try to find the 10 closest neighbors to protein *A0A024RBG1* as an example. To speed up the calculation, we fix the set of entities we compare to a set of 1000 entity only.

In [7]:
import os
import sys
import numpy
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine 
from itertools import islice

#1.Defining query and # of neighbors (coud be given as input)
#query =str(sys.argv[1])
#n = int (sys.argv[2])
query ="A0A024RBG1"
n=10

#2.Reading input: vectors and entities
vectors=numpy.loadtxt("sample_vectors");
text_file="sample_entities"
classfile=open (text_file)
mylist=[]
for linec in classfile:
	mystr=linec.strip()
	mylist.append(mystr)


#3.Mapping Entities to Vectors
vectors_map={}
for i in range(0,len(mylist)):
	vectors_map[mylist[i]]=vectors[i,:]
	

#4.Calculating cosine similarity to query
cosine_sim={}
for x in range(0,len(mylist)):
	if (mylist[x]!=query): 	
		v1=vectors_map[mylist[x]]
		v2=vectors_map[query]
		value=cosine(v1,v2)
		cosine_sim[mylist[x]]=value

#5.Retrieving the n closest neighbors to query
sortedmap=sorted(cosine_sim,key=cosine_sim.get, reverse=True)
iterator=islice(sortedmap,n)
i =1
for d in iterator:
	print (str(i)+". "+ str(d) +"\t"+str(cosine_sim[d])+"\n")
	i +=1

1. A6NL82	0.24056068519081897

2. A0A0U1RR11	0.21344038506854024

3. A0A2R8YFB7	0.20980879136776454

4. A0A286YFB4	0.19988283188103195

5. A0A0U1RRI6	0.19704075439528834

6. A0A286YET3	0.19369666960546184

7. A0A286YFG1	0.1928914756300507

8. A6NJ78	0.19175714812144584

9. A0A286YF60	0.1890638514071129

10. A6NNA2	0.18722186307718114

