Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This is a simplistic demonstration of how you can calculate the ratio of dictionary words to all words in a METS Alto OCR XML file. The latest dump of the English Wiktionary is used because its available and somewhat sizable: ~2 million words. 0. install python 1. install lxml 2. wget http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-page.sql.gz 3. make_dictionary.py 4. make a pot of tea while you are waiting for this to finish :) 5. alto_words.py example.xml