/
v0.7.4.txt
147 lines (112 loc) · 7.35 KB
/
v0.7.4.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
BigARTM v0.7.4 Release notes
============================
BigARTM v0.7.4 is a big release that includes major rework of dictionaries and `MasterModel <https://github.com/bigartm/bigartm/issues/325>`_.
`bigartm/stable` branch
-----------------------
Up until now BigARTM has only one ``master`` branch, containing the latest code.
This branch potentially includes untested code and unfinished features.
We are now introducing ``bigartm/stable`` branch, and encourage all users to
stop using ``master`` and start fetching from ``stable``.
``stable`` branch will be lagging behind ``master``, and moved forward to ``master``
as soon as mainteiners decide that it is ready.
At the same point we will introduce a new tag (something like `v0.7.3 <https://github.com/bigartm/bigartm/tree/v0.7.3>`_ )
and produce a new release for Windows.
In addition, ``stable`` branch also might receive small urgent fixes in between releases,
typically to address critical issues reported by our users.
Such fixes will be also included in ``master`` branch.
MasterModel
-----------
MasterModel is a new set of low-level APIs that allow users of C-interface to infer models and apply them to new data.
The APIs are ``ArtmCreateMasterModel``, ``ArtmReconfigureMasterModel``, ``ArtmFitOfflineMasterModel``, ``ArtmFitOnlineMasterModel`` and ``ArtmRequestTransformMasterModel``,
togehter with corresponding protobuf messages. For a usage example see ``src/bigartm/srcmain.cc``.
This APIs should be easy to understand for the users who are familiar with Python interface. Basically, we take ``ARTM`` class in Python,
and push it down to the core.
Now users can create their model via ``MasterModelConfig`` (protobuf message),
fit via ``ArtmFitOfflineMasterModel`` or ``ArtmFitOnlineMasterModel``, and apply to the new data via ``ArtmRequestTransformMasterModel``.
This means that the user no longer has to orchestrate low-level building blocks such as ``ArtmProcessBatches``, ``ArtmMergeModel``, ``ArtmRegularizeModel`` and ``ArtmNormalizeModel``.
``ArtmCreateMasterModel`` is similar to ``ArtmCreateMasterComponent`` in a sence that it returns ``master_id``,
which can be later passed to all other APIs. This mean that most APIs will continue working as before.
This applies to ``ArtmRequestThetaMatrix``, ``ArtmRequestTopicModel``, ``ArtmRequestScore``, and many others.
Rework of dictionaries
----------------------
Previous implementation of the dictionaries was really messy, and we are trying to clean this up. This effort is not finished yet, however we decided to release current version because
it is a major improvement comparing to the previous version.
At the low-level (``c_interface``), we now have the following methods to work with dictionaries:
* ``ArtmGatherDictionary`` collects a dictionary based on a folder with batches,
* ``ArtmFilterDictionary`` filter tokens from the dictinoary based on their term frequency or document frequency,
* ``ArtmCreateDictionary`` creates a dictionary from a custom ``DictionaryData`` object (protobuf message),
* ``ArtmRequestDictionary`` retrieves a dictionary as ``DictionaryData`` object (protobuf message),
* ``ArtmDisposeDictionary`` deletes dictionary object from BigARTM,
* ``ArtmImportDictionary`` import dictionary from binary file,
* ``ArtmExportDictionary`` expor tdictionary into binary file.
All dictionaries are identified by a string ID (``dictionary_name``).
Dictionaries can be used to initialize the model, in regularizers or in scores.
Note that ``ArtmImportDictionary`` and ``ArtmExportDictionary`` now uses a different format.
For this reason we require that all imported or exported files end with ``.dict`` extension.
This limitation is only introduced to make users aware of the change in binary format.
.. warning::
Please note that you have to re-generate all dictionaries, created in previous BigARTM versions.
To force this limitation we decided that
``ArtmImportDictionary`` and ``ArtmExportDictionary`` will require
all imported or exported files end with ``.dict`` extension.
This limitation is only introduced to make users aware of the change in binary format.
Please note that in the next version (`BigARTM v0.8.0`) we are planing to break dictionary format once again.
This is because we will introduce ``boost.serialize`` library for all import and export methods.
From that point ``boost.serialize`` library will allow us to upgrade formats without breaking backwards compatibility.
The following example illustrate how to work with new dictionaries from Python.
.. code-block:: bash
# Parse collection in UCI format from D:\Datasets\docword.kos.txt and D:\Datasets\vocab.kos.txt
# and store the resulting batches into D:\Datasets\kos_batches
batch_vectorizer = artm.BatchVectorizer(data_format='bow_uci',
data_path=r'D:\Datasets',
collection_name='kos',
target_folder=r'D:\Datasets\kos_batches')
# Initialize the model. For now dictionaries exist within the model,
# but we will address this in the future.
model = artm.ARTM(...)
# Gather dictionary named `dict` from batches.
# The resulting dictionary will contain all distinct tokens that occur
# in those batches, and their term frequencies
model.gather_dictionary("dict", "D:\Datasets\kos_batches")
# Filter dictionary by removing tokens with too high or too low term frequency
# Save the result as `filtered_dict`"
model.filter_dictionary(dictionary_name='dict',
dictionary_target_name='filtered_dict',
min_df=10, max_df_rate=0.4)
# Initialize model from `diltered_dict`
model.initialize("filtered_dict")
# Import/export functionality
model.save_dictionary("filtered_dict", "D:\Datasets\kos.dict")
model.load_dictionary("filtered_dict2", "D:\Datasets\kos.dict")
Changes in the infrastructure
-----------------------------
* Static linkage for bigartm command-line executable on Linux.
To disable static linkage use ``cmake -DBUILD_STATIC_BIGARTM=OFF ..``
* Install BigARTM python API via ``python setup.py install``
Changes in core functionality
-----------------------------
* Custom transform function for KL-div regularizers
* Ability to initialize the model with custom seed
* ``TopicSelection`` regularizers
* ``PeakMemory`` score (Windows only)
* Different options to name batches when parsing collection
(``GUID`` as today, and ``CODE`` for sequential numbering)
Changes in Python API
---------------------
* ``ARTM.dispose()`` method for managing native memory
* ``ARTM.get_info()`` method to retrieve internal state
* Performance fixes
* Expose class prediction functionality
Changes in C++ interface
------------------------
* Consume ``MasterModel`` APIs in C++ interface.
Going forward this is the only C++ interface that we will support.
Changes in console interface
----------------------------
* Better options to work with dictionaries
* ``--write-dictionary-readable`` to export dictionary
* ``--force`` switch to let user overwrite existing files
* ``--help`` generates much better examples
* ``--model-v06`` to experiment with old APIs (``ArtmInvokeIteration`` / ``ArtmWaitIdle`` / ``ArtmSynchronizeModel``)
* ``--write-scores`` switch to export scores into file
* ``--time-limit`` option to time-box model inference(as an alternative to ``--passes`` switch)