-
Notifications
You must be signed in to change notification settings - Fork 0
boerschi/gibbsnaivebayes
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Benjamin Boerschinger, 02/09/13 three different samplers for Naive Bayes Document classification using only Unigram features refer to Resnik, P., Hardisty, E., Gibbs Sampling for the Uninitiated, 2010 - http://drum.lib.umd.edu/handle/1903/10058 for a tutorial about the sampleCollapsed.py - sampler The three samplers differ as follows: samplerFullyUncollapsed.py - does not integrate out label-distribution and word-distributions smaplerUncollapsed.py - integrates out label-distribution but not word-distributions samplerCollapsed.py - integrates out all latent distributions there also is a simple random-input generator to make sure the samplers work on data which fits their assumptions perfectly generator.py <nTopic1> <nTopic2> > <fake-data> will generate <fake-data> with nTopic paragraphs drawn from one topic, and nTopic2 drawn from another (fixed) topic. run each sampler as follows: python <sampler> <input> [niters=20] [nclasses=2] The input is a single textfile, with one "document" per line. The samplers produce output like # iteration logProb pi 0 -19794.929792 0.468 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 -19535.626706 0.597 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 -19474.710527 0.548 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 -19388.812707 0.629 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 -19377.120791 0.661 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The first column of each line is the iteration, the second the current log-probability of the state, the third (somewhat arbitrarily) the probability of label 0, followed by the current labeling of the input paragraphs. To demonstrate the importance of collapsing, compare performance of the three samplers on the toy-data ari_hes_ari_20_20_20.txt, made up of the first 20 paragraphs of Aristotle's Categories, followed by Hesse's Siddhartha, followed by Aristotle's poetics, as per Project Gutenberg: www.gutenberg.org/ All but the collapsed sampler fail, whereas the collapsed sampler even properly "brackets" the Hesse-part of the input when run with 2 classes. On the other hand, on toy data generated by generator.py, all samplers converge reasonably quickly. Try 4_10.txt, or generate your own fake-input using generator.py
About
Naive-Bayes-Gibbs Sampler following the Resnik / Hardisty tutorial
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published