Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ddangelov committed Mar 23, 2020
1 parent 9e1457d commit bcae769
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,19 @@ space the goal of the algorithm is to find dense clusters of documents, then ide
words attracted those documents together. Each dense area is a topic and the words that
attracted the documents to the dense area are the topic words.

**The Algorithm:**
# The Algorithm:

1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).
**1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).**
>Documents will be placed to other similar documents and close to most distinguishing words.
![Joint Document and Word Embedding](images/doc_word_embedding.svg)
2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap)
**2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap)**
>Document vectors in high dimensional space are very sparse, dimension reduction allows the discovery of dense areas.
![UMAP dimension reduced Documents](images/umap_docs.png)
3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan)
**3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan)**
>The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific topic.
![HDBSCAN Document Clusters](images/hdbscan_docs.png)
4. For each dense area calculate centroid of document vectors in original dimension. (centroid = topic vector)
5. Find n-closest word vectors to the resulting topic vector
**4. For each dense area calculate centroid of document vectors in original dimension. (centroid = topic vector)**
**5. Find n-closest word vectors to the resulting topic vector**


Installation
Expand Down

0 comments on commit bcae769

Please sign in to comment.