Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ddangelov committed Mar 23, 2020
1 parent 6e6cde0 commit 62968ec
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,27 +35,27 @@ attracted the documents to the dense area are the topic words.
**1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).**
>Documents will be placed close to other similar documents and close to the most distinguishing words.
![Joint Document and Word Embedding](images/doc_word_embedding.svg)
![Joint Document and Word Embedding](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/doc_word_embedding.svg?sanitize=true)

**2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap).**
>Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.
![UMAP dimension reduced Documents](images/umap_docs.png)
![UMAP dimension reduced Documents](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/umap_docs.png?sanitize=true)

**3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan).**
>The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.
![HDBSCAN Document Clusters](images/hdbscan_docs.png)
![HDBSCAN Document Clusters](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/hdbscan_docs.png?sanitize=true)

**4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.**
>The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.
![Topic Vector](images/topic_vector.svg)
![Topic Vector](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_vector.svg?sanitize=true)

**5. Find n-closest word vectors to the resulting topic vector**
>The closest word vectors in order of proximity become the topic words.
![Topic Words](images/topic_words.svg)
![Topic Words](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true)

Installation
------------
Expand Down

0 comments on commit 62968ec

Please sign in to comment.