The source behind nutrimatic.org.
Switch branches/tags
Nothing to show
Clone or download
Permalink
Failed to load latest commit information.
openfst-1.5.0 Upgrade openfst to 1.5.0 Dec 8, 2015
.gitignore add slack module, and gitignore Jan 9, 2017
COPYING initial import into new repository Mar 19, 2009
README add changes designed for os x compatibility Apr 16, 2017
build.py add changes designed for os x compatibility Apr 16, 2017
cgi-search.py Strip whitespace from query input Dec 31, 2016
cgi-slack.py add slack module, and gitignore Jan 9, 2017
dump-index.cpp add changes designed for os x compatibility Apr 16, 2017
explore-index.cpp add changes designed for os x compatibility Apr 16, 2017
expr-anagram.cpp Fix a bunch of format strings Dec 8, 2015
expr-filter.cpp initial import into new repository Mar 19, 2009
expr-intersect.cpp initial import into new repository Mar 19, 2009
expr-optimize.cpp initial import into new repository Mar 19, 2009
expr-parse.cpp - update README Mar 25, 2009
expr.h initial import into new repository Mar 19, 2009
find-anagrams.cpp initial import into new repository Mar 19, 2009
find-expr.cpp Simplify treatment of spaces Dec 31, 2016
find-phone-words.cpp Sundry fixes to problems reported by lahosken (thanks!) Nov 25, 2010
index-reader.cpp Mobile friendly HTML; also a different protocol for calling cgi-searc… Dec 24, 2016
index-walker.cpp Switch to 64-bit counts in each trie node. Dec 8, 2015
index-writer.cpp add changes designed for os x compatibility Apr 16, 2017
index.h Switch to 64-bit counts in each trie node. Dec 8, 2015
make-index.cpp Update README and make-index to support WikiExtractor.py Dec 25, 2016
memoize.py add changes designed for os x compatibility Apr 16, 2017
merge-indexes.cpp Switch to 64-bit counts in each trie node. Dec 8, 2015
remove-markup.cpp Fix a bunch of format strings Dec 8, 2015
search-driver.cpp - update README Mar 25, 2009
search-printer.cpp Fixes to pagination, resource accounting, and formatting Apr 27, 2010
search.h initial import into new repository Mar 19, 2009
test-expr.cpp Fixes to pagination, resource accounting, and formatting Apr 27, 2010

README

This is Nutrimatic (http://nutrimatic.org/usage.html).

To build the source, run "./build.py".  You will need the following installed:
   * Python
   * g++
   * libxml2 (ubuntu: apt-get install libxml2-dev; osx: pip install lxml)
   * libtre (ubuntu: apt-get install libtre-dev; osx: brew install tre)

To do anything useful, you will need to build an index from Wikipedia.

1. Download the latest Wikipedia database dump (this is a ~13GB file!):

     wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Extract the text from the articles using Wikipedia Extractor
   (this generates ~12GB of text, and can take several hours!):

     # See http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
     wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
     python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2

   This will write many files named text/??/wiki_??.

3. Index the text (this generates ~50GB of data, and can also take hours!):

     find text -type f | xargs cat | bin/make-index wikipedia

   This will write many files named wikipedia.?????.index.
   (You can break this up; run make-index several times with different
   sets of input files, replacing "wikipedia" with unique names each time.)

4. Merge the indexes; I normally do this in two stages:

     for x in 0 1 2 3 4 5 6 7 8 9
     do bin/merge-indexes 2 wikipedia.????$x.index wiki-merged.$x.index
     done

     bin/merge-indexes 5 wiki-merged.*.index wiki-merged.index

   There's nothing magical about this appproach with 10 batches, you can use
   any way you like to merge the files. The 2 and 5 numbers are minimum phrase
   frequency cutoffs (how many times a string must occur to be included).

5. Enjoy your new index:

     bin/find-expr wiki-merged.index '<aciimnrttu>'

If you want to set up the web interface, write a short shell wrapper that runs
cgi-search.py with arguments pointing it at your binaries and data files, e.g.:

     #!/bin/sh

     export NUTRIMATIC_FIND_EXPR=/path/to/nutrimatic/bin/find-expr
     export NUTRIMATIC_INDEX=/path/to/nutrimatic/data/wiki-merged.index
     exec /path/to/nutrimatic/cgi-search.py

Then arrange for your web server to invoke that shell wrapper as a CGI script.

Have fun,

-- egnor@ofb.net