Conda environment:

In [None]:
# TODO create and activate conda env
# biovec only works with python 3.6 because of old dependencies, here is conda env for that:

# conda create -n biovec python=3.6 ipython pandas ipykernel
# conda activate biovec
# pip3 install biovec

Raw data download:

In [None]:
!wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz -O - | gunzip -c > swissprot.fasta

## Testing biovec package

### Training (takes a few hours)

In [None]:
import biovec

retrain = False
if retrain:
    pv = biovec.models.ProtVec("swissprot.fasta", corpus_fname="output_corpusfile_path.txt", workers=80)
    pv.save('swissprot.protvec.model')
pv = biovec.models.load_protvec('swissprot.protvec.model')

### Application

In [None]:
import pandas as pd
import numpy as np
def get_nlp_feature(sequence:str):
    arrays = pv.to_vecs(sequence)
    return pd.Series(np.concatenate(arrays))

sequences = pd.read_table("/home/ad/biovec_test/uniprot_transporter_sequences.tsv", index_col=0, squeeze=True)
encoded = sequences.apply(get_nlp_feature)
encoded.to_csv("/home/ad/biovec_test/uniprot_transporter_sequences_encoded.tsv", sep="\t")
encoded.head()

### Trying to write own code that uses Word2Vec:

In [9]:
import numpy as np
import re

# from biovec repo
def split_ngrams(seq, n):
    """
    'AGAMQSASM' => [['AGA', 'MQS', 'ASM'], ['GAM','QSA'], ['AMQ', 'SAS']]
    """
    a, b, c = zip(*[iter(seq)]*n), zip(*[iter(seq[1:])]*n), zip(*[iter(seq[2:])]*n)
    str_ngrams = []
    for ngrams in [a,b,c]:
        x = ["".join(ngram) for ngram in ngrams]
        str_ngrams.append(x)
    return str_ngrams


#### Testing Word2Vec:

Word embeddings are explained here in detail: https://www.youtube.com/watch?v=gQddtTdmG_8 . 

In short, the algorithm is trained on a corpus, i.e. a text. After training, it can produce a vector of length n (for example n=100) for each word. The similarity between two words is then calculated through the cosine distance, which is given as dot(X,Y)/(norm(X)*norm(Y)), where X and Y are the vectors for a given word/ngram/kmer and norm() is the euclidean norm. 

A higher similarity score means that the words are more similar, in the context of the text that was fed to the neural network.

The idea behind the Biovec package is to transform a sequence of amino acids into three vectors of 3-mers. Each starts at a different sequence position (0,1,2). This is probably supposed to teach the kmer order to the algorithm.

Every sequence is then treated as a "sentence", where the words are the kmers in the sequence. One of the standard methods for calculating the similarity between two sentences is to simply add their vectors/embeddings together, and then calculate the cosine distance between the resulting vectors. Here is an example, using normal words and sentences:

In [108]:
# example: from https://stackoverflow.com/a/66127454

from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data

s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'

def preprocess(s):
    return [i.lower() for i in s.split()]

def get_vector(s):
    return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)

def get_similarity(v0,v1):
    return 1 - spatial.distance.cosine(v0, v1)

print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))

s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071


These models and datasets are from a github repository that lists example datasets. The number at the end stands for the vector size. The example above uses vectors of length 50, the highest number seems to be 300. This seems to be a tradeoff between storage space/computation time, and accuracy.

An alternative that has not been tried before would be to use Doc2Vec, which is available in the same python package (gensim). TODO I should try that. This post explains Doc2Vec, and also a Word2Vec solution that used Cosine similarity and average vectors: https://datascience.stackexchange.com/a/23998

Word movers distance algorithm is another method, and is also included in the gensim package: https://datascience.stackexchange.com/a/31497

Finally, another score I came across was the DICE score, which is a variant of the F1 score. I would create a set of kmers, then divide the number of kmers in the intersection set by the total number of kmers. This would not even need any word embeddings.

In [None]:
# TODO