Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/ddangelov/top2vec
Browse files Browse the repository at this point in the history
  • Loading branch information
ddangelov committed Mar 23, 2020
2 parents 87d7895 + 82ec778 commit d031f64
Showing 1 changed file with 70 additions and 2 deletions.
72 changes: 70 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,70 @@
# Top2Vec
Generate topic, document and word embeddings.
Top2Vec
=======

Topic2Vector is a an algorithm for topic modeling. It automatically detects topics present in text
and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model
you can:
* Get number of detected topics.
* Get topics.
* Search topics by keywords.
* Search documents by topic.
* Find similar words.
* Find similar documents.

Benefits
--------
1. Automatically finds number of topics.
2. No stop words required.
3. No need for stemming/lemmatizing.
4. Works on short text.
5. Creates jointly embedded topic, document, and word vectors.
6. Has search functions built in.

How does it work?
-----------------

The assumption under the algorithm is that many documents that are semantically similar
are indicative of an underlying topic. The first step is to create a joint embedding of
document and word vectors. Once documents and words are embedded in a vector
space the goal of the algorithm is to find dense clusters of documents, then identify which
words attracted those documents together. Each dense area is a topic and the words that
attracted the documents to the dense area are the topic words.

**The Algorithm:**

1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).
2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap)
3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan)
4. For each dense area calculate centroid of document vectors in original dimension. (centroid = topic vector)
5. Find n-closest word vectors to the resulting topic vector


Installation
------------

The easy way to install Top2Vec is:

pip install top2vec


Usage
-----

```python

from top2vec import Top2Vec

model = Top2Vec(documents)
```
Parameters:

* ``documents``: Input corpus, should be a list of strings.

* ``speed``: This parameter will determine how fast the model takes to train.
The 'fast-learn' option is the fastest and will generate the lowest quality
vectors. The 'learn' option will learn better quality vectors but take a longer
time to train. The 'deep-learn' option will learn the best quality vectors but
will take significant time to train.

* ``workers``: The amount of worker threads to be used in training the model. Larger
amount will lead to faster training.

0 comments on commit d031f64

Please sign in to comment.