BatchVect should not gather dictionary when loading ready batches (#831)

* BatchVect should not gather dictionary when loading ready batches * shameful bug fix in docs
bigartm · Aug 8, 2017 · 527d30e · 527d30e
1 parent 7056e08
commit 527d30e
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 7 deletions.
diff --git a/docs/tutorials/python_userguide/loading_data.txt b/docs/tutorials/python_userguide/loading_data.txt
@@ -66,7 +66,7 @@ In this case the token order in the dictionary (and in further :math:`\Phi` matr
 
 Take into consideration the fact that library will ignore any token from batches, that was not presented into vacab file, if you used it. ``Dictionary`` contains a lot of useful information about the collection. For example, each unique token in it has the corresponding variable - value. When BigARTM gathers the dictionary, it puts the relative frequency of this token in this variable. You can read about the use-cases of this variable in further sections.
 
-Well, now you have a dictionary. It can be saved on the dick to prevent it's re-creation. You can save it in the binary format:
+Well, now you have a dictionary. It can be saved on the disk to prevent it's re-creation. You can save it in the binary format:
 
 .. code-block:: python
 

diff --git a/python/artm/batches_utils.py b/python/artm/batches_utils.py
@@ -69,7 +69,8 @@ def __init__(self, batches=None, collection_name=None, data_path='', data_format
         :param dict vocabulary: dict with vocabulary, key - index of n_wd, value - token
         :param bool gather_dictionary: create or not the default dictionary in vectorizer;\
                                        if data_format == 'bow_n_wd' - automatically set to True;\
-                                       and if data_weight is list - automatically set to False
+                                       and if data_format == 'batches' or data_weight is list -\
+                                       automatically set to False
         :param class_ids: list of class_ids or single class_id to parse and include in batches
         :type class_ids: list of str or str
         :param artm.ARTM process_in_memory_model: ARTM instance that will use this vectorizer, is\
@@ -99,7 +100,7 @@ def __init__(self, batches=None, collection_name=None, data_path='', data_format
         self._batch_size = batch_size
 
         self._dictionary = None
-        if gather_dictionary and not isinstance(data_weight, list):
+        if gather_dictionary and not isinstance(data_weight, list) and data_format != 'batches':
             self._dictionary = Dictionary()
 
         if data_format == 'bow_n_wd':
@@ -216,10 +217,6 @@ def _parse_batches(self, data_weight=None, batches=None):
                 self._batches_list += [Batch(os.path.join(data_p, batch)) for batch in batches]
                 self._weights += [data_w for i in range(len(batches))]
 
-            # next code will be processed only if for-loop has only one iteration
-            if self._dictionary is not None:
-                self._dictionary.gather(data_path=data_p)
-
     def _parse_n_wd(self, data_weight=None, n_wd=None, vocab=None):
         def __reset_batch():
             batch = messages.Batch()