Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file
Python
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore
README
alto_words.py
example.xml
make_dictionary.py

README

This is a simplistic demonstration of how you can calculate the 
ratio of dictionary words to all words in a METS Alto OCR XML file.

The latest dump of the English Wiktionary is used because its available
and somewhat sizable: ~2 million words.

0. install python
1. install lxml
2. wget http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-page.sql.gz
3. make_dictionary.py
4. make a pot of tea while you are waiting for this to finish :)
5. alto_words.py example.xml
Something went wrong with that request. Please try again.