Inspired by (Madnani, 2009), this project tries to implement a queryable server for an English language model of variable n-gram order.
The most complicated aspect of the installation will be compiling SRILM. Once you have that toolkit downloaded and added to your $PATH, run these commands:
pip install --user -r requirements.txt
to install Python dependencies../bootstrap.sh
to build LM and load databasepython manage.py runserver [::]:8000
to run the API server for ngram queries
Step 2 above can take a few minutes, depending on hardware. Using a batch size of 1000 per database commit:
$ time ./bootstrap.sh
Creating tables ...
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)
Generating countfile from corpus 8/8 ('combined')...
Building language model (/home/conor/gits/language-model-server/corpus/nltk-combined.lm)... done.
Number of 1-grams committed to database: 223000
Number of 2-grams committed to database: 1795000
Number of 3-grams committed to database: 445000
Number of 4-grams committed to database: 287000
Number of 5-grams committed to database: 176000
Finished loading database.
real 2m36.424s
user 2m31.558s
sys 0m3.919s
- write tests (damn it)
- expand API with params for querying
- investigate open source LMs
- kenLM for ngramModel?
- investigate factored language model setup
- SRILM has a built-in for this
- upgrade SRILM to newest version
- consider breaking out custom lm as python lib