-
Notifications
You must be signed in to change notification settings - Fork 117
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #149 from sashafrey/master
Documentation whitepaper - BigARTM as a Service [skip ci]
- Loading branch information
Showing
4 changed files
with
25 additions
and
2 deletions.
There are no files selected for viewing
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
BigARTM as a Service | ||
==================== | ||
|
||
The following diagram shows a suggested topology for a query service that involve topic modelling on Big Data. | ||
|
||
.. image:: _images/cloud_service.png | ||
:alt: cloud_service | ||
|
||
Here the main use for Hadoop / MapReduce is to process your Big Unstructured Data into a compact bag-of-words representation. | ||
Due to out-of-core design and extreme performance BigARTM will be able to handle this data on a single compute-optimized node. | ||
The resulting topic model should be replicated on all query instances that serve user requests. | ||
|
||
To avoid query-time dependency on BigARTM component you may want to infer topic distributions ``theta_{td}`` for new documents in your code. | ||
This can be done as follows. Start from uniform topic assigment ``theta_{td} = 1 / |T|`` and update it in the following loop: | ||
|
||
.. image:: _images/theta_update.png | ||
:alt: theta_update | ||
|
||
where ``n_dw`` is the number of word ``w`` occurences in document ``d``, ``phi_wt`` is an element of the Phi matrix. | ||
In BigARTM the loop is repeated :attr:`ModelConfig.inner_iterations_count` times (defaulst to ``10``). | ||
To precisely replicate BigARTM behavior one needs to account for class weights and include regularizers. | ||
Please contact us if you need more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters