### Problem Definition: Tokenizing OCR'd Text

This initial markdown cell sets up the problem. Texts scanned using Optical Character Recognition (OCR) often contain artifacts from the original printed format. A common issue is **line-break hyphenation**, where a word is split with a hyphen at the end of a line (e.g., `interest-` on one line and `ing` on the next). A standard tokenizer would incorrectly treat these as two separate tokens (`interest-` and `ing`) or a single incorrect token (`interest-ing`).

The goal is to design a smarter tokenizer that can correctly join words split across lines (like `interesting`) while preserving legitimate hyphenated words (like `newly-formed`).

```
the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may
```

Here the printing convention of line-break hyphenization would, under a standard tokenizer, generate incorrect tokens like `interest-ing` (or perhaps `interest-` and `ing`).  Design a better tokenizer (even just using pre- and post-processing) for these texts.  Note here the correct tokenization of `interest-ing` is `interesting` but the correct tokenization for `newly-formed` is still `newly-formed`.

For a more thorough library for handling OCR'd book data, see https://github.com/tedunderwood/DataMunging

### Cell 1: Importing Libraries

This cell imports the necessary Python libraries for the task.
* `sys`: Provides access to system-specific parameters and functions.
* `nltk`: The Natural Language Toolkit, a popular library for NLP tasks like tokenization.
* `re`: The regular expression module, used for pattern matching and string manipulation.

In [1]:
# Import necessary libraries
import sys, nltk, re

### Cell 2: Defining a Text Reading Function

This cell defines a helper function, `read_text`, to open and read a file. It iterates through the file line by line, removes any trailing whitespace (like newline characters `\n`) from each line, and returns the content as a list of strings.

In [2]:
# Define a function to read a text file and return its lines.
def read_text(filename):
    # Initialize an empty list to store the lines from the file.
    lines=[]
    # Open the specified file for reading.
    with open(filename) as file:
        # Loop through each line in the file.
        for line in file:
            # Remove any trailing whitespace (like the newline character) and append to the list.
            lines.append(line.rstrip())
    # Return the list of cleaned lines.
    return lines        

### Cell 3: Specifying the Data File

This cell defines a string variable `filename` that holds the path to the text file we want to process. This file contains the full text of Darwin's *Origin of Species*.

In [3]:
# Set the variable 'filename' to the path of the input text file.
filename="../data/darwin_origin_ia.txt"

### Cell 4: Reading the Text File

Here, the `read_text` function defined earlier is called with the `filename` path. The entire content of the Darwin text is loaded into the `lines` variable as a list of strings.

In [4]:
# Call the read_text function to load the content of the file into the 'lines' variable.
lines=read_text(filename)

### Cell 5: Creating a Sample Text for Testing

This cell creates a small, multi-line string variable named `testText`. This string contains the exact passage from Darwin's work shown in the initial problem description. Using this smaller sample makes it easier and faster to develop and test the de-hyphenation logic before running it on the entire book.

In [5]:
# Define a multi-line string containing the sample text for testing the tokenizer.
testText="""the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may"""

### Cell 6: Building a Custom Vocabulary

The core strategy for deciding whether to join a hyphenated word is to check if the combined word exists in a dictionary. This cell builds a custom vocabulary (`vocab`) for this purpose.

1.  It starts by populating `vocab` with a standard English dictionary from `/usr/share/dict/words`.
2.  It then enhances this dictionary by adding all the non-hyphenated words from the actual book text (`lines`). This helps account for specialized terms, names, or archaic words present in Darwin's writing that might not be in a standard modern dictionary.

In [6]:
# The strategy is to check if a de-hyphenated word exists in a dictionary.
# This cell builds that dictionary.

# Initialize an empty dictionary to serve as our vocabulary.
vocab={}

# Open the system's built-in dictionary file.
with open("/usr/share/dict/words") as file:
    # Iterate through each word (line) in the dictionary file.
    for line in file:
        # Add the word (in lowercase) to our vocabulary for fast lookups.
        vocab[line.rstrip().lower()]=1
        
# Now, augment the vocabulary with words from the book itself.
# This accounts for proper nouns, jargon, etc.
for line in lines:
    # Tokenize the current line into words using NLTK.
    words=nltk.word_tokenize(line, language="english")
    # Iterate through each tokenized word.
    for word in words:
        # Check if the word is NOT a line-break hyphenation fragment.
        if not word.endswith("-"):
            # Add the word (in lowercase) to our vocabulary.
            vocab[word.lower()]=1

### Cell 7: Implementing the De-hyphenation Logic

This is the main cell where the custom tokenization logic is implemented. It processes the `testText` line by line, identifies words split by hyphens at line breaks, and intelligently joins them based on the `vocab` created in the previous step.

The code iterates through each line, checks if it ends with a hyphenated word, and looks ahead to the next line. If combining the two word fragments creates a valid word found in `vocab`, it merges them. A flag (`previousLineHyphenMatch`) is used to ensure the second part of the merged word is not printed twice.

In [9]:
# --- Tokenization and De-hyphenation ---

# Split the test text into a list of individual lines.
lines=testText.split("\n")
# Initialize an empty list to store the tokenized version of each line.
tokenized_lines=[]
# Loop through each line of the raw text.
for line in lines:
    # Tokenize the line into words using NLTK's tokenizer.
    tok_words=nltk.word_tokenize(line, language="english")
    # Append the list of tokenized words to our list of tokenized lines.
    tokenized_lines.append(tok_words)
    
# Initialize an empty list to hold the final, corrected tokens.
tokens=[]
# A boolean flag to track if the previous line ended in a hyphen that was successfully merged.
previousLineHyphenMatch=False

# Loop through the tokenized lines using an index to allow lookups to the next line.
for idx,words in enumerate(tokenized_lines):
    # A flag to track if a merge occurs on the *current* line. Reset for each line.
    flag=False
    
    # Check if the line is not empty, ends with a hyphenated word, and is not the last line of the text.
    if len(words) > 0 and words[-1].endswith("-") and idx < len(tokenized_lines)-1:
        # Get the list of words from the next line.
        nextwords=tokenized_lines[idx+1]
        # Check if the next line is not empty.
        if len(nextwords) > 0:
            # Get the first word of the next line.
            first=nextwords[0]
            # Create a candidate word by removing the hyphen from the last word of the current line
            # and concatenating it with the first word of the next line.
            candidate="%s%s" % (re.sub("-$", "", words[-1]), first)
            
            # Check if the lowercase version of the candidate word exists in our vocabulary.
            if candidate.lower() in vocab:
                # If it exists, replace the hyphenated fragment with the complete, merged word.
                words[-1]=candidate
                
                # Set the flag to True, indicating a successful merge occurred.
                # This will be used to remove the first word from the *next* line.
                flag=True
           
    # If the previous line was successfully merged, we need to skip the first word of the current line.
    if previousLineHyphenMatch:
        # Append all words from the current line *except the first one* to our final token list.
        tokens.append(words[1:])
    else:
        # Otherwise, append all words from the current line as they are.
        tokens.append(words)
    
    # Update the master flag for the next iteration. If a merge happened, the next loop needs to know.
    previousLineHyphenMatch = True if flag else False

# --- Display Results ---
    
# Print the final, corrected text.
print("Tokenized:\n")
for line in tokens:
    print(' '.join(line))
# Print the original text for comparison.
print("\nOriginal:\n")
print(testText)

Tokenized:

the inhabitants of the surrounding districts will , also , be thus
prevented . Moritz Wagner has lately published an interesting
essay on this subject , and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed .
But from reasons already assigned I can by no means agree
with this naturalist , that migration and isolation are necessary
elements for the formation of new species . The importance
of isolation is likewise great in preventing , after
any physical change in the conditions such as of climate elevation
of the land , & c. , the immigration of better adapted organisms
; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants . Lastly , isolation will give time for a
new variety to be improved at a slow rate ; and this may

Original:

the inhabitants of the surrounding districts will, also, be thus
pre

In [8]:
# This cell is intentionally left blank.