Skip to content

Commit

Permalink
Merge pull request #851 from bigartm/master
Browse files Browse the repository at this point in the history
Bring stable branch forward and release v0.9.0
  • Loading branch information
JeanPaulShapo committed Nov 4, 2017
2 parents 4bf8a7a + e38027c commit 41b9109
Show file tree
Hide file tree
Showing 200 changed files with 6,529 additions and 2,203 deletions.
17 changes: 17 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ datasets/docword.*.txt
*/ipch/*
*.vcxproj.user

# CLion garbage
.idea/

# Auto-generated files (by pdflatex)
collateral/*.aux
collateral/*.log
Expand All @@ -59,6 +62,8 @@ src/Win32/*
*.log.ERROR.*
*.batch

*\#*\#*

# Binaries
/src/artm/unittests/
/src/cpp_client/cpp_client
Expand Down Expand Up @@ -128,6 +133,7 @@ src/Win32/*
docs/_build/*

build*
cmake-build-debug*

# protobuf generated files
/python/artm/wrapper/messages_pb2.py
Expand All @@ -153,3 +159,14 @@ src/artm/messages.pb.h
.ycm_extra_conf.py

src/artm/version.h

# MAC OS temp files
*.DS_Store

# bigartm logs
*bigartm.ERROR
*bigartm.INFO
*bigartm.WARNING

# test data
kos/
33 changes: 28 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,21 @@ language:
# Currently Travis doesn't support multiple values for language variable
# - cpp

# compiler:
#compiler:
# - gcc

# not sure whether the following configuration will be available
# with non-C/C++ language (e.g python)
#cache: ccache

cache:
directories:
- $HOME/.ccache

python:
- "2.7"
- "3.4"
- "3.5"

addons:
apt:
Expand All @@ -23,21 +32,35 @@ addons:
- cmake-data
- libboost1.55-all-dev
- g++-4.9
- gcc-4.9

env:
- COMPILER=g++-4.9
global:
# variables for caching
- CCACHE_DIR=$HOME/.ccache
- CCACHE_COMPILERCHECK=content
- CCACHE_COMPRESS=true
- CCACHE_NODISABLE=true
- CCACHE_MAXSIZE=500M
matrix:
- GCC_VER=4.9

before_install:
# we need latest pip to work with only-binary option
- pip install -U pip
- pip install -U pytest pep8
- pip install -U numpy pandas tqdm --only-binary numpy pandas
- pip install -U pytest pep8 wheel
- pip install -U numpy scipy pandas tqdm --only-binary numpy scipy pandas
- pip install protobuf==3.0.0
# configure ccache
# code from https://github.com/urho3d/Urho3D/blob/master/.travis.yml
- export PATH=$(whereis -b ccache |grep -o '\S*lib\S*'):$PATH
- export CXX=g++-$GCC_VER CC=gcc-$GCC_VER
- for compiler in gcc g++; do ln -s $(which ccache) $HOME/$compiler-$GCC_VER; done && export PATH=$HOME:$PATH

install:
- if [[ $TRAVIS_PYTHON_VERSION == 2* ]]; then ./codestyle_checks.sh; fi
- mkdir build
- pushd build && export CXX=$COMPILER && cmake -DBUILD_BIGARTM_CLI_STATIC=OFF .. && make -j2 && file ./bin/bigartm && popd
- pushd build && cmake -DBUILD_BIGARTM_CLI_STATIC=OFF .. && make -j2 && file ./bin/bigartm && popd
- pushd python && python setup.py install && popd

before_script:
Expand Down
4 changes: 0 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,7 @@ option(BUILD_TESTS "Indicates whether to build artm_tests" ON)
option(BUILD_BIGARTM_CLI "Indicates whether to build bigartm-CLI executable" ON)
option(BUILD_INTERNAL_PYTHON_API "Indicates whether to build Python API" ON)

if (MSVC OR APPLE)
set(PYTHON python CACHE INTERNAL "Python command")
else (MSVC OR APPLE)
set(PYTHON python2 CACHE INTERNAL "Python command")
endif (MSVC OR APPLE)

if (BUILD_BIGARTM_CLI AND UNIX AND NOT APPLE)
option(BUILD_BIGARTM_CLI_STATIC "Request build of static executable bigartm (for Linux only)" ON)
Expand Down
54 changes: 36 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The state-of-the-art platform for topic modeling.

# What is BigARTM?

BigARTM is a tool for [topic modeling](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf) based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.
BigARTM is a powerful tool for [topic modeling](https://en.wikipedia.org/wiki/Topic_model) based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

### References

Expand All @@ -29,7 +29,7 @@ BigARTM is a tool for [topic modeling](https://www.cs.princeton.edu/~blei/papers

### Related Software Packages

- [David Blei's List](https://www.cs.princeton.edu/~blei/topicmodeling.html) of Open Source topic modeling software
- [David Blei's List](http://www.cs.columbia.edu/~blei/topicmodeling_software.html) of Open Source topic modeling software
- [MALLET](http://mallet.cs.umass.edu/topics.php): Java-based toolkit for language processing with topic modeling package
- [Gensim](https://radimrehurek.com/gensim/): Python topic modeling library
- [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) has an implementation of [Online-LDA algorithm](https://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation)
Expand Down Expand Up @@ -84,33 +84,51 @@ bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictio

### Interactive Python interface

Check out the documentation for the ARTM Python interface
[in English](http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/ARTM_tutorial_EN.ipynb) and
[in Russian](http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/ARTM_tutorial_RU.ipynb)
BigARTM supports full-featured and clear Python API (see [Installation](http://docs.bigartm.org/en/latest/installation/index.html) to configure Python API for your OS).

Refer to [tutorials](http://docs.bigartm.org/en/latest/tutorials/index.html) for details on how to install and start using Python interface.
Example:

```python
# A stub
import bigartm

model = bigartm.ARTM(num_topics=15)
batch_vectorizer = bigartm.BatchVectorizer(data_format='bow_uci',
collection_name='kos',
target_folder='kos')
model.fit_offline(batches, passes=5)
print model.phi_
import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
n_wd=n_wd,
vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
collection_name='kos',
target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()
```

Refer to [tutorials](http://docs.bigartm.org/en/latest/tutorials/python_tutorial.html) for details on how to start using BigARTM from Python, [user's guide](http://docs.bigartm.org/en/latest/tutorials/python_userguide/index.html) can provide information about more advanced features and cases.

### Low-level API

- [C++ Interface](http://docs.bigartm.org/en/latest/ref/cpp_interface.html)
- [Plain C Interface](http://docs.bigartm.org/en/latest/ref/c_interface.html)
- [C++ Interface](http://docs.bigartm.org/en/latest/api_references/cpp_interface.html)
- [Plain C Interface](http://docs.bigartm.org/en/latest/api_references/c_interface.html)


## Contributing

Refer to the [Developer's Guide](http://docs.bigartm.org/en/latest/devguide.html).
Refer to the [Developer's Guide](http://docs.bigartm.org/en/latest/devguide.html) and follows [Code Style](https://github.com/bigartm/bigartm/wiki/Code-style).

To report a bug use [issue tracker](https://github.com/bigartm/bigartm/issues). To ask a question use [our mailing list](https://groups.google.com/forum/#!forum/bigartm-users). Feel free to make [pull request](https://github.com/bigartm/bigartm/pulls).

Expand Down
10 changes: 5 additions & 5 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ environment:

matrix:

- PYTHON_VERSION: 3.5
MINICONDA: C:\Miniconda35-x64
- PYTHON_VERSION: 3.6
MINICONDA: C:\Miniconda36-x64

- PYTHON_VERSION: 2.7
MINICONDA: C:\Miniconda-x64
Expand All @@ -44,9 +44,9 @@ clone_folder: C:\projects\bigartm
install:
- "set PATH=%MINICONDA%;%MINICONDA%\\Scripts;%PATH%"
- conda config --set always_yes yes --set changeps1 no
- conda update -q conda
- conda update -q -c conda-forge conda
- conda info -a
- conda install numpy pandas pytest
- conda install -c conda-forge numpy scipy pandas pytest
- conda install -c conda-forge tqdm

# scripts to run before build
Expand All @@ -56,7 +56,7 @@ before_build:
- cmd: cd build
- cmd: if "%platform%"=="Win32" set CMAKE_GENERATOR_NAME=Visual Studio 14 2015
- cmd: if "%platform%"=="x64" set CMAKE_GENERATOR_NAME=Visual Studio 14 2015 Win64
- cmd: cmake -DPYTHON=python -G "%CMAKE_GENERATOR_NAME%" -DCMAKE_BUILD_TYPE=%configuration% -DBOOST_ROOT="%BOOST_ROOT%" -DBOOST_LIBRARYDIR="%BOOST_LIBRARYDIR%" ..
- cmd: cmake -DPYTHON=python -G "%CMAKE_GENERATOR_NAME%" -DCMAKE_BUILD_TYPE=%configuration% -DBOOST_ROOT:PATH="%BOOST_ROOT%" -DBOOST_LIBRARYDIR:PATH="%BOOST_LIBRARYDIR%" ..
- cmd: cd %BIGARTM_UNITTEST_DATA%
- ps: Start-FileDownload 'https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt'
- ps: Start-FileDownload 'https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt'
Expand Down
4 changes: 2 additions & 2 deletions docs/api_references/c_interface.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
==================
===========
C Interface
==================
===========

This document explains all public methods of the low level BigARTM interface, written in plain C language.

Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
.. _tutorial:

API References
=========
==============

.. toctree::
:maxdepth: 2
Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Python Interface
====================
================

This document describes all classes and functions
in python interface of BigARTM library.
Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/artm_model.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
ARTM model
================
==========

This page describes ARTM class.

Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/batches_utils.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Batches Utils
================
=============

This page describes BatchVectorizer class.

Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/dictionary.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Dictionary
================
==========

This page describes Dictionary class.

Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/lda_model.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
LDA model
================
=========

This page describes LDA class.

Expand Down
18 changes: 17 additions & 1 deletion docs/api_references/python_interface/regularizers.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Regularizers
================
============

This page describes *KlFunctionInfo* and *Regularizer* classes.

Expand Down Expand Up @@ -43,6 +43,22 @@ See detailed description of regularizers :doc:`../../tutorials/regularizers_desc
:members:
:special-members: __init__

.. autoclass:: BitermsPhiRegularizer
:members:
:special-members: __init__

.. autoclass:: HierarchySparsingThetaRegularizer
:members:
:special-members: __init__

.. autoclass:: TopicSegmentationPtdwRegularizer
:members:
:special-members: __init__

.. autoclass:: SmoothTimeInTopicsPhiRegularizer
:members:
:special-members: __init__

.. autoclass:: NetPlsaPhiRegularizer
:members:
:special-members: __init__
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/score_tracker.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Score Tracker
================
=============

This page describes *ScoreTracker classes.

Expand Down
2 changes: 1 addition & 1 deletion docs/api_references/python_interface/scores.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Scores
================
======

This page describes *Scores classes.

Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/code_style.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Code style
=========================
==========

.. sidebar:: Configure Visual Studio

Expand Down
6 changes: 3 additions & 3 deletions docs/devguide/create_regularizer.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Creating New Regularizer
=========================
========================

This manual describes all necessary steps you need to proceed to create your own regularizer in the core of BigARTM library. We assume you are now in the root directory of BigARTM. The Google Protocol Buffers technology will be used, so we also assume you familiar with it. The instructions will be forwarded with corresponding examples of two regularizers, one per matrix (New Regulrizer Phi and New regularizer Theta).

Expand Down Expand Up @@ -183,7 +183,7 @@ Also take into consideration the notation of parameters naming (for example, cla
for the changes to take effect.

Phi regularizer C++ code
-------------
------------------------

All you need is to implement the method

Expand All @@ -196,7 +196,7 @@ All you need is to implement the method
Here you use p_wt, n_wt and all information you have got as parameters through the config to count r_wt and put it in the ``result`` variable. The multiplication on ``tau`` and usage of coefficients of relative regularzation will be processed in further computations automaticaly and shouldn't worry you.

Theta regularizer C++ code
-------------
--------------------------

You need to create a class implementing the ``RegularizeThetaAgent`` interface (e.g., ``NewRegularizerThetaAgent``) and a class implementing ``RegularizerInterface`` interface (e.g., ``NewRegularizerTheta``).

Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/dev_build_linux.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Build C++ code on Linux
=========================
=======================

Refer to :doc:`/installation/linux`.

Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/downloads.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Downloads (Windows)
=========================
===================

Download and install the following tools:

Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/ipython_windows.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Working with iPython notebooks remotely
=========================
=======================================

It turned out to be common scenario to run BigARTM on a Linux server (for example on Amazon EC2), while connecting to it from Windows through ``putty``.
Here is a convenient way to use ``ipython notebook`` in this scenario:
Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/proto_windows.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Compiling .proto files on Windows
=========================
=================================

1. Open a new command prompt

Expand Down
2 changes: 1 addition & 1 deletion docs/devguide/python_windows.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Python code on Windows
=========================
======================

* Install Python 2.7 (this step is already done if you are following the instructions above),

Expand Down

0 comments on commit 41b9109

Please sign in to comment.