Skip to content

Commit

Permalink
BatchVect should not gather dictionary when loading ready batches (#831)
Browse files Browse the repository at this point in the history
* BatchVect should not gather dictionary when loading ready batches

* shameful bug fix in docs
  • Loading branch information
MelLain committed Aug 8, 2017
1 parent 7056e08 commit 527d30e
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 7 deletions.
2 changes: 1 addition & 1 deletion docs/tutorials/python_userguide/loading_data.txt
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ In this case the token order in the dictionary (and in further :math:`\Phi` matr

Take into consideration the fact that library will ignore any token from batches, that was not presented into vacab file, if you used it. ``Dictionary`` contains a lot of useful information about the collection. For example, each unique token in it has the corresponding variable - value. When BigARTM gathers the dictionary, it puts the relative frequency of this token in this variable. You can read about the use-cases of this variable in further sections.

Well, now you have a dictionary. It can be saved on the dick to prevent it's re-creation. You can save it in the binary format:
Well, now you have a dictionary. It can be saved on the disk to prevent it's re-creation. You can save it in the binary format:

.. code-block:: python

Expand Down
9 changes: 3 additions & 6 deletions python/artm/batches_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ def __init__(self, batches=None, collection_name=None, data_path='', data_format
:param dict vocabulary: dict with vocabulary, key - index of n_wd, value - token
:param bool gather_dictionary: create or not the default dictionary in vectorizer;\
if data_format == 'bow_n_wd' - automatically set to True;\
and if data_weight is list - automatically set to False
and if data_format == 'batches' or data_weight is list -\
automatically set to False
:param class_ids: list of class_ids or single class_id to parse and include in batches
:type class_ids: list of str or str
:param artm.ARTM process_in_memory_model: ARTM instance that will use this vectorizer, is\
Expand Down Expand Up @@ -99,7 +100,7 @@ def __init__(self, batches=None, collection_name=None, data_path='', data_format
self._batch_size = batch_size

self._dictionary = None
if gather_dictionary and not isinstance(data_weight, list):
if gather_dictionary and not isinstance(data_weight, list) and data_format != 'batches':
self._dictionary = Dictionary()

if data_format == 'bow_n_wd':
Expand Down Expand Up @@ -216,10 +217,6 @@ def _parse_batches(self, data_weight=None, batches=None):
self._batches_list += [Batch(os.path.join(data_p, batch)) for batch in batches]
self._weights += [data_w for i in range(len(batches))]

# next code will be processed only if for-loop has only one iteration
if self._dictionary is not None:
self._dictionary.gather(data_path=data_p)

def _parse_n_wd(self, data_weight=None, n_wd=None, vocab=None):
def __reset_batch():
batch = messages.Batch()
Expand Down

0 comments on commit 527d30e

Please sign in to comment.