-
Notifications
You must be signed in to change notification settings - Fork 2
/
old-index.html
50 lines (39 loc) · 5.46 KB
/
old-index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0" encoding="UTF-8"?>
<div class="templates:surround?with=templates/page-margins.html&at=container">
<p>
"Greek, sir, is like lace; every man gets as much of it as he can." (<a href="https://archive.org/stream/lifeofsamueljoh04bosw#page/155/mode/1up/search/greek">Samuel Johnson</a>)
</p>
<h3>Overview</h3>
<p>This site catalogues the results of our on-going campaign to produce high-quality
OCR of polytonic, or 'ancient', Greek texts in a HPC environment. It comprises <span class="app:countCatalog"/>
<a href="catalog.html">volumes</a>, principally from archive.org, but also from original scans and other resources. There are over 12 million <a href="http://heml.mta.ca:8080/exist/apps/lace2/side_by_side_view.html?documentId=624438295&runId=2018-07-22-17-50&classifier=oxford_lunate-00044700.pyrnn.gz&fileNum=25">pages</a> of OCR output in total, including experimental and rejected results.</p>
<div class="app:renderMarkdown">
Results are presented in a hierarchical organization, beginning with the volume identifier. Each of these are associated with one or more 'runs', or attempts at OCRing this volume. A run has a date stamp and is associated with a classifier and an aggregate best b-score (roughly indicating quality of Greek output.) Each run produces various kinds of output. The most important of these are:
1. `raw hocr output:` the data generated by our OCR process, usually with multiple copies for each page, rendered at a range of binarization thresholds
2. `selected hocr output:` a filtered version of the data in (1), with each page image represented by a single, best, output page. Output based in an older process also provide the following steps:
3. `blended hocr output:` the data in (2), but replaced with the corresponding words from the `raw` output in (1), should the `selected` page not comprise a dictionary word and one of the `raw` pages comprises one.
4. `selected hocr output spellchecked:` the data in (3) processed through a weighted levenshtein distance spellchecking algorithm that is meant to correct simple OCR errors
5. `combined hocr output:` where archive.org provides OCR output for Latin script (not Greek), this final step pieces together the data in (4) with archive's output, preferring archive's output where our output suggests that the data is Latin. If archive.org provides Greek output, this step is no different from (4)
###Code
These data were generated with two different OCR processes. All results since 2014 employed the [Ciaconna Greek OCR](https://github.com/brobertson/ciaconna) process. This is based on the Ocropus open source engine, with custom classifiers, image preprocessing and spell-check routines written in Python. Ciaconna's high-level scripts are integrated with Compute Canada's [Sharcnet](https://www.sharcnet.ca/my/front/) scheduling software, since that facilities' resources were used to generate these results.
The earlier process, used from 2012 - 2014, is named 'Rigaudon' and is based on the Gamera image processing
library. All code and classifiers for Rigaudon are posted in a
[github repository](https://github.com/brobertson/rigaudon). This holds
the modified Gamera source code, ancillary python scripts such as the
spellcheck engine, and the bash scripts that coordinate the process in a HPC environment through Sun Grid Engine.
Details of Rigaudon's operation are outlined in a [white paper](https://docs.google.com/document/d/1iYfqflLybd3f9bBfTBk8aY_FTq05kCnq4tiKwYpqtrM/edit?usp=sharing).
Our July 2013 presentation at the London Digital Classicist seminar series is [available online](http://www.digitalclassicist.org/wip/wip2013-07br.mp4) from the Institue of Classical Studies.
###Web Editing Software
The Lace editing and visualizing software you are now using is available as a package for eXist-db in a [GitHub Repository](https://github.com/brobertson/Lace2). A previous version of Lace, which used Python Flask is also [archived](https://github.com/brobertson/Lace) on GitHub.
###Context
This is a continuation of efforts begun through the Digging Into Data Round I project [Toward Dynamic Variorum Editions](http://sites.tufts.edu/dynamicvariorum/), in which -- as the project [white paper](https://securegrants.neh.gov/PublicQuery/main.aspx?f=1&gn=HJ-50013-10) notes -- we discovered both the tantalizing potential of Greek OCR and the poor results that OCR engines at that time produced when operating at scale.
In order to bootstrap that process, we adapted the most extensible and successful of the frameworks to that date, the Gamera Greek [OCR engine](http://gamera.informatik.hsnr.de/addons/greekocr4gamera/) by Dalitz and Brandt. Using the AceNET HPC environment we analyzed a sample of the Google Greek and Latin corpus with twenty classifiers composed by Canadian undergraduate students. From this, we produced a quantitative [report](http://www.perseus.tufts.edu/publications/dve/RobertsonGreekOCR/) on the efficacy of our modified OCR code.
On the basis of this work, we received a 2012/2013 Humanities Computing Grant from Compute Canada, making this large-scale processing possible.
###Support
This work has benefited from the support of:
+ NEH, JISC, SSHRC, though Digging into Data I
+ Compute Canada, which provided the use of a dedicated machine and continues to provide computing resources
+ The [New Brunswick Innovation Foundation](https://www.nbif.ca) (Canada)
+ ILC-CNR, Pisa, which facilitated meetings
+ Greg Crane, whose supportiveness is as unbounded as his enthusiasm</div>
</div>