OCR'd texts present special challenges to tokenization.  Consider this selection from an OCR'd version of Darwin's Origin of Species from the [Internet Archive](https://archive.org/download/originofspecies00darwuoft/originofspecies00darwuoft_djvu.txt):

```
the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may
```

Here the printing convention of line-break hyphenization would, under a standard tokenizer, generate incorrect tokens like `interest-ing` (or perhaps `interest-` and `ing`).  Design a better tokenizer (even just using pre- and post-processing) for these texts.  Note here the correct tokenization of `interest-ing` is `interesting` but the correct tokenization for `newly-formed` is still `newly-formed`.

For a more thorough library for handling OCR'd book data, see https://github.com/tedunderwood/DataMunging


In [1]:
import sys, nltk, re

In [2]:
def read_text(filename):
    lines=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            lines.append(line.rstrip())
    return lines        

In [3]:
filename="../data/darwin_origin_ia.txt"

In [4]:
lines=read_text(filename)

In [5]:
testText="""the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may"""

In [46]:
no_line_breaks = testText.replace("\n", " ").replace("- ", "-").lower().split(" ")
no_line_breaks

# Issue: need to differentiate between newly-formed and other words like im-portance

['the',
 'inhabitants',
 'of',
 'the',
 'surrounding',
 'districts',
 'will,',
 'also,',
 'be',
 'thus',
 'prevented.',
 'moritz',
 'wagner',
 'has',
 'lately',
 'published',
 'an',
 'interest-ing',
 'essay',
 'on',
 'this',
 'subject,',
 'and',
 'has',
 'shown',
 'that',
 'the',
 'service',
 'rendered',
 'by',
 'isolation',
 'in',
 'preventing',
 'crosses',
 'between',
 'newly-formed',
 'varieties',
 'is',
 'probably',
 'greater',
 'even',
 'than',
 'i',
 'supposed.',
 'but',
 'from',
 'reasons',
 'already',
 'assigned',
 'i',
 'can',
 'by',
 'no',
 'means',
 'agree',
 'with',
 'this',
 'naturalist,',
 'that',
 'migration',
 'and',
 'isolation',
 'are',
 'neces-sary',
 'elements',
 'for',
 'the',
 'formation',
 'of',
 'new',
 'species.',
 'the',
 'im-portance',
 'of',
 'isolation',
 'is',
 'likewise',
 'great',
 'in',
 'preventing,',
 'after',
 'any',
 'physical',
 'change',
 'in',
 'the',
 'conditions',
 'such',
 'as',
 'of',
 'climate',
 'ele-vation',
 'of',
 'the',
 'land,',
 '&c.,

In [44]:
nltk.word_tokenize(no_line_breaks)

['the',
 'inhabitants',
 'of',
 'the',
 'surrounding',
 'districts',
 'will',
 ',',
 'also',
 ',',
 'be',
 'thus',
 'prevented',
 '.',
 'moritz',
 'wagner',
 'has',
 'lately',
 'published',
 'an',
 'interest-ing',
 'essay',
 'on',
 'this',
 'subject',
 ',',
 'and',
 'has',
 'shown',
 'that',
 'the',
 'service',
 'rendered',
 'by',
 'isolation',
 'in',
 'preventing',
 'crosses',
 'between',
 'newly-formed',
 'varieties',
 'is',
 'probably',
 'greater',
 'even',
 'than',
 'i',
 'supposed',
 '.',
 'but',
 'from',
 'reasons',
 'already',
 'assigned',
 'i',
 'can',
 'by',
 'no',
 'means',
 'agree',
 'with',
 'this',
 'naturalist',
 ',',
 'that',
 'migration',
 'and',
 'isolation',
 'are',
 'neces-sary',
 'elements',
 'for',
 'the',
 'formation',
 'of',
 'new',
 'species',
 '.',
 'the',
 'im-portance',
 'of',
 'isolation',
 'is',
 'likewise',
 'great',
 'in',
 'preventing',
 ',',
 'after',
 'any',
 'physical',
 'change',
 'in',
 'the',
 'conditions',
 'such',
 'as',
 'of',
 'climate',
 'ele-

In [49]:
hyphenated_words = []
for string in no_line_breaks:
    if "-" in string:
        a = string.split("-")
        if a[0] + a[1] in {some_dictionary}:
            # create a token

['interest', 'ing']
['newly', 'formed']
['neces', 'sary']
['im', 'portance']
['ele', 'vation']
['or', 'ganisms;']
