Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 .gitignore
Octocat-spinner-32 README
Octocat-spinner-32 alto_words.py
Octocat-spinner-32 example.xml
Octocat-spinner-32 make_dictionary.py
README
This is a simplistic demonstration of how you can calculate the 
ratio of dictionary words to all words in a METS Alto OCR XML file.

The latest dump of the English Wiktionary is used because its available
and somewhat sizable: ~2 million words.

0. install python
1. install lxml
2. wget http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-page.sql.gz
3. make_dictionary.py
4. make a pot of tea while you are waiting for this to finish :)
5. alto_words.py example.xml
Something went wrong with that request. Please try again.