Skip to content

Naive-Bayes-Gibbs Sampler following the Resnik / Hardisty tutorial

Notifications You must be signed in to change notification settings

boerschi/gibbsnaivebayes

Repository files navigation

Benjamin Boerschinger, 02/09/13

three different samplers for Naive Bayes Document classification using only Unigram features

refer to

Resnik, P., Hardisty, E., Gibbs Sampling for the Uninitiated, 2010 - http://drum.lib.umd.edu/handle/1903/10058

for a tutorial about the sampleCollapsed.py - sampler

The three samplers differ as follows:

samplerFullyUncollapsed.py	-	does not integrate out label-distribution and word-distributions
smaplerUncollapsed.py		-	integrates out label-distribution but not word-distributions
samplerCollapsed.py		-	integrates out all latent distributions

there also is a simple random-input generator to make sure the samplers work on data which fits their assumptions perfectly

generator.py <nTopic1> <nTopic2> > <fake-data>

will generate <fake-data> with nTopic paragraphs drawn from one topic, and nTopic2 drawn from another (fixed) topic.

run each sampler as follows:

python <sampler> <input> [niters=20] [nclasses=2]

The input is a single textfile, with one "document" per line. The samplers produce output like

# iteration logProb pi
0 -19794.929792 0.468 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0
1 -19535.626706 0.597 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 -19474.710527 0.548 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 -19388.812707 0.629 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 -19377.120791 0.661 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The first column of each line is the iteration, the second the current log-probability of the state, the third (somewhat arbitrarily) the probability of label 0, followed
by the current labeling of the input paragraphs.

To demonstrate the importance of collapsing, compare performance of the three samplers on the toy-data ari_hes_ari_20_20_20.txt, made up of the first 20 paragraphs
of Aristotle's Categories, followed by Hesse's Siddhartha, followed by Aristotle's poetics, as per Project Gutenberg: www.gutenberg.org/

All but the collapsed sampler fail, whereas the collapsed sampler even properly "brackets" the Hesse-part of the input when run with 2 classes.

On the other hand, on toy data generated by generator.py, all samplers converge reasonably quickly. Try 4_10.txt, or generate your own fake-input using generator.py

About

Naive-Bayes-Gibbs Sampler following the Resnik / Hardisty tutorial

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages