GitHub - boerschi/gibbsnaivebayes: Naive-Bayes-Gibbs Sampler following the Resnik / Hardisty tutorial

boerschi / gibbsnaivebayes Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Naive-Bayes-Gibbs Sampler following the Resnik / Hardisty tutorial

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
4_10.txt		4_10.txt
README		README
ari_hes_ari_20_20_20.txt		ari_hes_ari_20_20_20.txt
arist_hesse_10_10.txt		arist_hesse_10_10.txt
arist_hesse_20_20.txt		arist_hesse_20_20.txt
generator.py		generator.py
samplerCollapsed.py		samplerCollapsed.py
samplerFullyUncollapsed.py		samplerFullyUncollapsed.py
samplerUncollapsed.py		samplerUncollapsed.py

Repository files navigation

Benjamin Boerschinger, 02/09/13

three different samplers for Naive Bayes Document classification using only Unigram features

refer to

Resnik, P., Hardisty, E., Gibbs Sampling for the Uninitiated, 2010 - http://drum.lib.umd.edu/handle/1903/10058

for a tutorial about the sampleCollapsed.py - sampler

The three samplers differ as follows:

samplerFullyUncollapsed.py	-	does not integrate out label-distribution and word-distributions
smaplerUncollapsed.py		-	integrates out label-distribution but not word-distributions
samplerCollapsed.py		-	integrates out all latent distributions

there also is a simple random-input generator to make sure the samplers work on data which fits their assumptions perfectly

generator.py <nTopic1> <nTopic2> > <fake-data>

will generate <fake-data> with nTopic paragraphs drawn from one topic, and nTopic2 drawn from another (fixed) topic.

run each sampler as follows:

python <sampler> <input> [niters=20] [nclasses=2]

The input is a single textfile, with one "document" per line. The samplers produce output like

# iteration logProb pi
0 -19794.929792 0.468 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0
1 -19535.626706 0.597 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 -19474.710527 0.548 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 -19388.812707 0.629 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 -19377.120791 0.661 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The first column of each line is the iteration, the second the current log-probability of the state, the third (somewhat arbitrarily) the probability of label 0, followed
by the current labeling of the input paragraphs.

To demonstrate the importance of collapsing, compare performance of the three samplers on the toy-data ari_hes_ari_20_20_20.txt, made up of the first 20 paragraphs
of Aristotle's Categories, followed by Hesse's Siddhartha, followed by Aristotle's poetics, as per Project Gutenberg: www.gutenberg.org/

All but the collapsed sampler fail, whereas the collapsed sampler even properly "brackets" the Hesse-part of the input when run with 2 classes.

On the other hand, on toy data generated by generator.py, all samplers converge reasonably quickly. Try 4_10.txt, or generate your own fake-input using generator.py