Topic modeling with first-order logic (FOL) domain knowledge
Java Other
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
src/main/java/logiclda replace some missing classes after code re-org Mar 17, 2012
test add doclist to test dataset May 17, 2011
COPYING Add GPL license Oct 7, 2010
EXTENDING revise documentation, explain custom rules/sideinfo May 17, 2011
README add README reference to textproc Jun 16, 2011
RULES greatly expand/re-org README May 17, 2011 bash script for install to local mvn repo May 17, 2011
pom.xml replace some missing classes after code re-org Mar 17, 2012


LogicLDA - Topic modeling with First-Order Logic (FOL) domain knowledge

David Andrzejewski (
Department of Computer Sciences
University of Wisconsin-Madison, USA


This code implements inference for the LogicLDA model [1].  LogicLDA
extends Latent Dirichlet Allocation (LDA) [2] by allowing the user to
specify a weighted first-order logic (FOL) knowledge base (KB), as in
Markov Logic Networks (MLN).  The inferred topics will then be
influenced by both document-word corpus statistics as well as the
user-specified logical KB. This code is implemented in Java and
performs scalable MAP inference via a stochastic gradient descent


The LogicLDA Java project can be built with Maven:

$ mvn package
$ cp ./target/logiclda-0.0.1-SNAPSHOT-jar-with-dependencies.jar ./logiclda.jar


Say that our dataset is named 'nyt' (see INPUT/OUTPUT FILES below).
Then we run LogicLDA inference with:

java -jar logiclda.jar nyt 500 100 10000 25 194582

Which does
500 iterations of LogicLDA collapsed Gibbs sampling to initialize
100 outer / 10000 inner iterations of Mir SGD inference
print out the Top 25 words for each topic to nyt.topics
using 194582 as the random number seed

An example dataset and bash script can be found in ./test


The input and output files obey the a naming convention where all
files consist of the name of the dataset plus a filetype-specific
extesion.  For example, if we were processing a corpus of New York
Times articles, we would supply input files, nyt.vocab, ...
and LogicLDA would give us output files nyt.topics, nyt.sample, ...

Note that *.words should be the integer indices of word tokens, not
the tokens themselves.  For example, if *.vocab is 


Then the string "foo foo baz bar foo" should be in *.words as: 0 0 2 1 0

These kinds of representations can be built from plaintext with the
textproc code (

        .rules          FOL rules (see RULES for details)
        .alpha          [T] alpha hyperparameter
        .beta           [TxW] beta hyperparameter
        .doclist        [D] document names
        .vocab          [W] vocabulary (one word per line)
        .words          [N] word indices for each corpus position  
        .docs           [N] document indices for each corpus position

        .sent           [N] sentence indices for each corpus position 
        .init           [N] initial z-sample state

        .sample         [N] latent topic indices for each corpus position
        .phi            [TxW] topic-word probabilities P(w|z)
        .theta          [DxT] document-topic probabilities P(z|d)
        .topics         plaintext summary of learned topics
        .logic          plaintext summary of logic rule satisfaction


This software is designed to allow the straightforward inclusion of
custom rule types and sources of side information (see EXTENDING for


This software is open-source, released under the terms of the GNU
General Public License version 3, or any later version of the GPL (see


[1] Andrzejewski, D., Zhu, X., Craven, M., and Recht, B. (2011).  A
Framework for Incorporating General Domain Knowledge into Latent
Dirichlet Allocation Using First-Order Logic. In Proceedings of the
22nd International Joint Conference on Artificial Intelligence (IJCAI

[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet
Allocation.  Journal of Machine Learning Research (JMLR) 3
(Mar. 2003), 993-1022.

[3] Domingos, P. and Lowd, D. (2009).  Markov Logic: An Interface
Layer for Artificial Intelligence. Synthesis Lectures on Artificial
Intelligence and Machine Learning, Morgan and Claypool Publishers.