Skip to content
libbow text classifier framework with patches that make it compile (for me) on OS X Snow Leopard
C Emacs Lisp Perl
Find file
Latest commit af6cbf0 Sep 12, 2009 @fizx compiles :)
Failed to load latest commit information.
argp compiles :) Sep 12, 2009
bow
.gitignore compiles :) Sep 13, 2009
ChangeLog initial commit from 20020213 version Sep 12, 2009
HACKING initial commit from 20020213 version Sep 13, 2009
INSTALL initial commit from 20020213 version Sep 13, 2009
Makefile compiles :) Sep 13, 2009
Makefile.in compiles :) Sep 13, 2009
NEWS initial commit from 20020213 version Sep 13, 2009
README compiles :) Sep 13, 2009
TODO initial commit from 20020213 version Sep 13, 2009
Version initial commit from 20020213 version Sep 13, 2009
active.c initial commit from 20020213 version Sep 13, 2009
archer.c initial commit from 20020213 version Sep 13, 2009
archer.el initial commit from 20020213 version Sep 13, 2009
array.c initial commit from 20020213 version Sep 13, 2009
arrow.c initial commit from 20020213 version Sep 13, 2009
barrel.c compiles :) Sep 13, 2009
bitvec.c initial commit from 20020213 version Sep 13, 2009
bmalloc.c initial commit from 20020213 version Sep 13, 2009
config.cache initial commit from 20020213 version Sep 13, 2009
config.status compiles :) Sep 13, 2009
configure compiles :) Sep 13, 2009
configure.in compiles :) Sep 13, 2009
crossbow.c initial commit from 20020213 version Sep 13, 2009
deflexer.c initial commit from 20020213 version Sep 13, 2009
dice.c initial commit from 20020213 version Sep 13, 2009
dicefactory.c initial commit from 20020213 version Sep 13, 2009
dirichlet.c remove explicit malloc include Sep 12, 2009
docnames.c initial commit from 20020213 version Sep 13, 2009
dv.c compiles :) Sep 13, 2009
em.c initial commit from 20020213 version Sep 13, 2009
email.c initial commit from 20020213 version Sep 13, 2009
emsimple.c initial commit from 20020213 version Sep 13, 2009
error.c initial commit from 20020213 version Sep 13, 2009
evi.c
foilgain.c
goodturing.c initial commit from 20020213 version Sep 13, 2009
heap.c initial commit from 20020213 version Sep 13, 2009
hem.c initial commit from 20020213 version Sep 13, 2009
info_gain.c initial commit from 20020213 version Sep 13, 2009
install-sh initial commit from 20020213 version Sep 13, 2009
install.texi initial commit from 20020213 version Sep 13, 2009
int4docn.c
int4str.c initial commit from 20020213 version Sep 13, 2009
int4word.c initial commit from 20020213 version Sep 13, 2009
io.c initial commit from 20020213 version Sep 13, 2009
istext.c initial commit from 20020213 version Sep 13, 2009
kl-div.c initial commit from 20020213 version Sep 13, 2009
kl.c initial commit from 20020213 version Sep 13, 2009
knn.c
lex-gram.c initial commit from 20020213 version Sep 13, 2009
lex-html.c initial commit from 20020213 version Sep 13, 2009
lex-next.c initial commit from 20020213 version Sep 13, 2009
lex-simple.c initial commit from 20020213 version Sep 13, 2009
lex-suffixing.c initial commit from 20020213 version Sep 13, 2009
libbow-desc.texi initial commit from 20020213 version Sep 13, 2009
libbow.texi initial commit from 20020213 version Sep 13, 2009
maxent.c initial commit from 20020213 version Sep 13, 2009
methods.c initial commit from 20020213 version Sep 13, 2009
mkinstalldirs initial commit from 20020213 version Sep 13, 2009
multiclass.c compiles :) Sep 13, 2009
naivebayes.c initial commit from 20020213 version Sep 13, 2009
nbshrinkage.c initial commit from 20020213 version Sep 13, 2009
nbsimple.c initial commit from 20020213 version Sep 13, 2009
next.c initial commit from 20020213 version Sep 13, 2009
normalize.c initial commit from 20020213 version Sep 13, 2009
opts.c initial commit from 20020213 version Sep 13, 2009
primelist.c initial commit from 20020213 version Sep 13, 2009
primes.c initial commit from 20020213 version Sep 13, 2009
prind.c initial commit from 20020213 version Sep 13, 2009
pv.c initial commit from 20020213 version Sep 13, 2009
rainbow-ac.pl initial commit from 20020213 version Sep 13, 2009
rainbow-be.pl initial commit from 20020213 version Sep 13, 2009
rainbow-pr.pl initial commit from 20020213 version Sep 13, 2009
rainbow-rank.pl
rainbow-stats.pl initial commit from 20020213 version Sep 13, 2009
rainbow.c initial commit from 20020213 version Sep 13, 2009
rainbow.texi initial commit from 20020213 version Sep 13, 2009
random.c initial commit from 20020213 version Sep 13, 2009
readme.texi initial commit from 20020213 version Sep 13, 2009
sarray.c initial commit from 20020213 version Sep 13, 2009
scale.c initial commit from 20020213 version Sep 13, 2009
scan.c initial commit from 20020213 version Sep 13, 2009
split.c initial commit from 20020213 version Sep 13, 2009
stem.c
stoplist.c initial commit from 20020213 version Sep 13, 2009
stopwords.c
strtrie.c initial commit from 20020213 version Sep 13, 2009
svm_al.c initial commit from 20020213 version Sep 13, 2009
svm_base.c compiles :) Sep 13, 2009
svm_fisher.c initial commit from 20020213 version Sep 13, 2009
svm_loqo.c compiles :) Sep 13, 2009
svm_smo.c compiles :) Sep 13, 2009
svm_trans.c initial commit from 20020213 version Sep 13, 2009
tfidf.c initial commit from 20020213 version Sep 13, 2009
treenode.c initial commit from 20020213 version Sep 13, 2009
version.texi initial commit from 20020213 version Sep 13, 2009
vpc.c initial commit from 20020213 version Sep 13, 2009
wa.c initial commit from 20020213 version Sep 13, 2009
wi2dvf.c
wi2pv.c initial commit from 20020213 version Sep 13, 2009
wicoo.c initial commit from 20020213 version Sep 13, 2009
wv.c initial commit from 20020213 version Sep 13, 2009

README

OS X Port by Kyle Maxwell

The following sites were useful:
- http://fugutabetai.com/?postid=170
- http://dev.gentoo.org/~vanquirius/gcc4-porting-guide.html

Bag Of Words Library README
***************************

`libbow', version 1.0.

   Documentation and updates for `libbow' are available at
http://www.cs.cmu.edu/~mccallum/bow

   Rainbow is a C program that performs document classification using
one of several different methods, including naive Bayes, TFIDF/Rocchio,
K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's
Probabilitistic Indexing, and a simple-minded form a shrinkage with
naive Bayes.

   Rainbow's accompanying library, `libbow', is a library of C code
intended for support of statistical text-processing programs.  The
current source distribution includes the library, a text classification
front-end (rainbow), a simple TFIDF-based document retrieval front-end
(arrow), an AltaVista-style document retrieval front-end (archer), and a
unsupported document clustering front-end with hierarchical clustering
and deterministic annealing (crossbow).

The library provides facilities for:
 *  Recursively descending directories, finding text files.
 *  Finding `document' boundaries when there are multiple docs per file.
 *  Tokenizing a text file, according to several different methods.
 *  Including N-grams among the tokens.
 *  Mapping strings to integers and back again, very efficiently.
 *  Building a sparse matrix of document/token counts.
 *  Pruning vocabulary by occurrence counts or by information gain.
 *  Building and manipulating word vectors.
 *  Setting word vector weights according to NaiveBayes, TFIDF, and a
     simple form of Probabilistic Indexing.
 *  Scoring queries for retrieval or classification.
 *  Writing all data structures to disk in a compact format.
 *  Reading the document/token matrix from disk in an efficient,
     sparse fashion.
 *  Performing test/train splits, and automatic classification tests.
 *  Operating in server mode, receiving and answering queries over a
     socket.

   It is known to compile on most UNIX systems, including Linux,
Solaris, SUNOS, Irix and HPUX.  Six months ago, it compiled on
WindowsNT (with a GNU build environment); it would probably work again
with little effort.  Patches to the code are most welcome.

   It is relatively efficient.  Reading, tokenizing and indexing the raw
text of 20,000 UseNet articles takes about 3 minutes.  Building a naive
Bayes classifier from 10,000 articles, and classifying the other 10,000
takes about 1 minute.

   The code conforms to the GNU coding standards.  It is released under
the Library GNU Public License (LGPL).

The library does not:
        Have parsing facilities.
        Do smoothing across N-gram models.
        Claim to be finished.
        Have good documentation.
        Claim to be bug-free.
        ...many other things.

Rainbow
=======

   `Rainbow' is a standalone program that does document classification.
Here are some examples:

   *      rainbow -i ./training/positive ./training/negative

     Using the text files found under the directories `./positive' and
     `./negative', tokenize, build word vectors, and write the
     resulting data structures to disk.

   *      rainbow --query=./testing/254

     Tokenize the text document `./testing/254', and classify it,
     producing output like:

          /home/mccallum/training/positive 0.72
          /home/mccallum/training/negative 0.28

   *      rainbow --test-set=0.5 -t 5

     Perform 5 trials, each consisting of a new random test/train split
     and outputs of the classification of the test documents.


   Typing `rainbow --help' will give list of all rainbow options.

   After you have compiled `libbow' and `rainbow', you can run the
shell script `./demo/script' to see an annotated demonstration of the
classifier in action.

   More information and documentation is available at
http://www.cs.cmu.edu/~mccallum/bow

Rainbow improvements coming eventually:
   Better documentation.
   Incremental model training.

Arrow
=====

   `Arrow' is a standalone program that does document retrieval by
TFIDF.

   Index all the documents in directory `foo' by typing

     arrow --index foo

   Make a single query by typing

     arrow --query

   then typing your query, and pressing Control-D.

   If you want to make many queries, it will be more efficient to run
arrow as a server, and query it multiple times without restarts by
communicating through a socket.  Type, for example,

     arrow --query-server=9876

   And access it through port number 9876.  For example:

     telnet localhost 9876

   In this mode there is no need to press Control-D to end a query.
Simply type your query on one line, and press return.

Crossbow
========

   `Crossbow' is a standalone program that does document clustering.
Sorry, there is no documentation yet.

Archer
======

   `Archer' is a standalone program that does document retrieval with
AltaVista-type queries, using +, -, "", etc.  The commands in the
"arrow" examples above also work for archer.  See "archer -help" for
more information.

Something went wrong with that request. Please try again.