simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file
Python
Switch branches/tags
Nothing to show
Clone or download
Latest commit 1ca71de Mar 17, 2011
Permalink
Failed to load latest commit information.
.gitignore ignore some things that should not be committed Mar 17, 2011
README docfix Mar 17, 2011
alto_words.py underscores Mar 17, 2011
example.xml initial commit Mar 17, 2011
make_dictionary.py initial commit Mar 17, 2011

README

This is a simplistic demonstration of how you can calculate the 
ratio of dictionary words to all words in a METS Alto OCR XML file.

The latest dump of the English Wiktionary is used because its available
and somewhat sizable: ~2 million words.

0. install python
1. install lxml
2. wget http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-page.sql.gz
3. make_dictionary.py
4. make a pot of tea while you are waiting for this to finish :)
5. alto_words.py example.xml