Easily generate document/paragraph/sentence vectors and calculate similarity.
Switch branches/tags
Nothing to show
Clone or download
Latest commit 3afb0d0 May 24, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore refactor and add simical Mar 30, 2018
LICENSE first commit Feb 22, 2018
README.md Update README.md May 24, 2018
text2vec.py Update text2vec.py May 24, 2018
wv_wrt_tfidf.md modify Apr 10, 2018

README.md

Text2Vec

Easily generate document/paragraph/sentence vectors and calculate similarity.

中文Blog

Goal of this repository is to build a tool to easily generate document/paragraph/sentence vectors for similarity calculation and as input for further machine learning models.

Requirements

  • spacy2.0 (with English model downloaded and installed)
  • gensim
  • numpy

Usage of Text to Vector (text2vec)

  • Initialize: Pre-trained Doc2Vec/Word2Vec model
import text2vec
  • input: List of Documents, doc_list is a list of documents/paragraphs/sentences.
t2v = text2vec.text2vec(doc_list)
  • output: List of Vectors of dimention N

We do such transformation by the following ways.

# Use TFIDF
docs_tfidf = t2v.get_tfidf()

# Use Latent Semantic Indexing(LSI)
docs_lsi = t2v.get_lsi()

# Use Random Projections(RP)
docs_rp = t2v.get_rp()

# Use Latent Dirichlet Allocation(LDA)
docs_lda = t2v.get_lda()

# Use Hierarchical Dirichlet Process(HDP)
docs_hdp = t2v.get_hdp()

# Use Average of Word Embeddings
docs_avgw2v = t2v.avg_wv()

# Use Weighted Word Embeddings wrt. TFIDF
docs_emb = t2v.tfidf_weighted_wv()

For a more detailed introduction of using Weighted Word Embeddings wrt. TFIDF, please read here.

Usage of Similarity Calculation (simical)

For example, we want to calculate the similarity/distance between the first two sentences in the docs_emb we just computed.

Note that cosine similarity is between 0-1 (1 is most similar while 0 is least similar). For the other similarity measurements the results are actually distance (the larget the less similar). It's better to calculate distance for all possible pairs and then rank.

# Initialize
import text2vec
sc = text2vec.simical(docs_emb[0], docs_emb[1])

# Use Cosine
simi_cos = sc.Cosine()

# Use Euclidean
simi_euc = sc.Euclidean()

# Use Triangle's Area Similarity (TS)
simi_ts = sc.Triangle()

# Use Sector's Area Similairity (SS)
simi_ss = sc.Sector()

# Use TS-SS
simi_ts_ss = sc.TS_SS()

Reference

https://radimrehurek.com/gensim/tut2.html

https://github.com/sdimi/average-word2vec

https://github.com/taki0112/Vector_Similarity