Unsupervised models

Word mover's distance: paper, gensim example
Extension of BOW based on word2vec: The sentence vector contains top 10 word vector similarity between each word of the sentence and the vocabulary.
LSTM + AutoEncoder
SIF
Mean-pooling / max-pooling of word vectors / LSTM hidden states
Doc2vec
LDA
LSI
Sent2vec
Simhash
Skip-Thought
Quick thoughts

Supervised models

Sentence-BERT: pre-train Siamese BERT on SNLI data

Sentence encoder example

https://hanxiao.github.io/2018/06/24/4-Encoding-Blocks-You-Need-to-Know-Besides-LSTM-RNN-in-Tensorflow/
sentence-transformers, can be trained with sentence pairs (e.g., for copurs from text classification)

Distance

Cosine distance

# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

a = np.array([0, 1, 2, 3])
b = np.array([0, 1, 3, 3]) 
print 'scipy cosine similarity: {}, sklearn similarity: {}'.format(1 - cosine(a, b), cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0])
# scipy cosine similarity: 0.981022943176, sklearn similarity: 0.981022943176

# scipy vectorization
distances = scipy.spatial.distance.cdist(np.array([query_embedding]), [list of array], "cosine")[0]
tmp_sims = 1 - distances

Edit distance

Jaro–Winkler distance

Notes of paper: Distributed Representations of Sentences and Documents

Two models are proposed: PV-DM (Paragraph Vector-Distributed Memory) and PV-DBOW (Paragraph Vector-Distributed Bag of Words). PV-DM just likes CBOW in Word2vec and PV-DBOW just likes Skip-gram in Word2vec. PV-DM is consistently better than PV-DBOW.
Model principles: during training, concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. While paragraph vectors are unique among paragraphs, the word vectors are shared. At prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until convergence.
For PV-DM: using concatenation in PV-DM is often better than sum.

For PV-DBOW:

BOW features lose the ordering of the words and also ignore semantics of the words (Dot product of any two word vector is zero). Word vector concatenation reserve the word order.
Weighted averaging of word vectors loses the word order in the same way as the standard bag-of-words models do.
For long documents, bag-of-words models perform quite well.

Split Chinese sentence

# from https://github.com/fxsjy/jieba/issues/575
resentencesp = re.compile(r'([﹒﹔﹖﹗．；。！？]["’”」』]{0,2}|：(?=["‘“「『]{1,2}|$))')

def split_paragraph(sentence):
    s = sentence
    slist = []
    for i in resentencesp.split(s):
        if resentencesp.match(i) and slist:
            slist[-1] += i
        elif i:
            slist.append(i)
    return slist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unsupervised models

Supervised models

Sentence encoder example

Distance

Cosine distance

Edit distance

Jaro–Winkler distance

Notes of paper: Distributed Representations of Sentences and Documents

Split Chinese sentence

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unsupervised models

Supervised models

Sentence encoder example

Distance

Cosine distance

Edit distance

Jaro–Winkler distance

Notes of paper: Distributed Representations of Sentences and Documents

Split Chinese sentence