# Stripping an ECCO-TCP text for parts #

While the bulk of the work in creating a training for Tesseract 3 lies in isolating nd identifying the letter forms that we want Tesseract to recognize, we can also improve the computer's chances of recognizing text correctly by including lists of words in the training. Several of the trainings that we'll look at begin with the English dictionary provided by the Early Modern OCR Project (EMOP) at Texas A&M, which is a word list drawn from the EEBO-TCP and ECCO-TCP corpora. (See [http://emop.tamu.edu/outcomes/github/TesseractTrainiing](EMOP's description).)

In building a training aimed at reading Thomson's *Sophonisba* with the greatest possible accuracy, though, I wanted to ensure that the training included words from *Sophonisba* that *weren't* includedd in EMOP's dictionary. 

I began with the double-keyed human transcription of *Sophonisba* created by the ECCO-TCP project. Because I wanted to create a training tuned not simply for recognizing *text* but rather for recognizing *type*, I performed a series of find-and-replace operations to replace groups of characters with Unicode representations of the ligatures that would have been used in Bowyer's printing shop. (Some of these ligatures--like ﬅ--are represented in Unicode and can be displayed on screen. Others do not have a standard Unicode representations and appear as "replacement" characters in a text editor: �, , or similar. The resulting file is not one a human reader would want to try to read, but it workds well for a computer.

The code in this notebook goes through a few steps:
+ Extract the text of *Sophonisba* from the ECCO-TCP XML file, throwing away the TEI markup in the process.
+ Construct a list of distinct words in *Sophonisba*. (The word "the" appears in *Sophonisba* 777 times. We only need it once for this purpose.)
+ Check to see which of the distinct words in *Sophonisba* are and are not in the EMOP dictionary.
+ Save a text file of words from *Sophonisba* that aren't in the EMOP dictionary, with one word per line.

### Import Python packages ###

This notebook uses `BeautifulSoup` with `lxml` to parse the TEI XML and `re` for some substitutions. It also uses the `codecs` package for opening the ECCO-TCP file, rather than the more familiar `with open()` in order to cope more gracefully with the UTF-8 character encoding of the file. UTF-8 is the default character encoding in Python 3, which is an argument for updating this code...

In [None]:
import codecs
import re
from bs4 import BeautifulSoup
import lxml

### Getting the text ###

We open the file with `codecs` and pass the contents of the file to `BeautifulSoup`.

In [None]:
with codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/K132743.000-typography.xml', 'r', 'utf-8') \
as infile :
    
    soup = BeautifulSoup(infile,'xml')
    # Find the "text" element of our TEI document, then get_text() to get the text content, throwing away all 
    # the markup. 
    stripped = soup.find('text').get_text()
    
    # Get rid of the line breaks and extraneous white space. The re package's sub() function takes three arguments:
    # 1) a string or regular expression to search for; 2) a string to substitute in place of that string or regular
    # expression when it's found; and 3) the text to search.
    # In this case, we are defining the regular expression in place with re.compile() (we're looking for more than one 
    # contiguous white space character--note that a newline character counts as a variety of white space, as does a tab). 
    #  Whenever we find two or more contiguous white spaces, we'll replace them with a single white space. 
    smushed = re.sub(re.compile('\s+'),' ',stripped)
    
    # Save the resulting text to a file. Note that, for some reason, a single white space turns up at the beginning of
    # text. We'll strip that left-most white space from the text before we write it to the file.
    with open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/K132743.000-typography-text.txt', 'w') as outfile :
        outfile.write(smushed.encode('utf-8').lstrip())
        print('Text saved')

### Remove punctuation and identify distinct words ###

This block starts by removing most punctuation (which I've had to do in two passes--the second one for removing em-dashes between two words--because the em-dash has to be represented with its Unicode hex value. I have a sense that there's a cleaner way to do this, but it's escaping me for the time being).

Next, it builds up a list of distinct words by reading through the words one at a time and saving each new word it encounters.

In [None]:
# Compile a regular expression for the punctuation we want to remove
punctuation = re.compile('[.,\?;:!\(\)\d]')

# Create an empty list to hold our distinct words.
distinct_words = []

# Open the text file we saved in the last block
with codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/K132743.000-typography-text.txt', 'r', 'utf-8') \
as infile, :
    text = infile.read()
    # Use the re.sub() function to search for the regular expression we defined above for punctuation marks. When 
    # we encounter any of those characters, delete it (by replacing it with nothing: '')
    de_punctuate = re.sub(punctuation,'',text)
    
    # Create a list of all of the words in the text by splitting the text on every white space
    all_words = de_punctuate.split(' ')
       
    # Iterate through the list of all the words in the text
    for word in all_words :
        # In some cases, an emdash appears between two words with no surrounding white space. This would lead to "words"
        # like "Carthage--Glorious", so we need to get rid of those em-dashes and check each of the words on either
        # side of the emdash. 
        # First, we search each word for an em-dash...
        if re.search(ur'\u2014', word) is not None :
            # If we find an em-dash, we partition the word on the em-dash. This creates a three-item list:
            # 0) the part before the em-dash; 1) the em-dash itself; 2) the part after the em-dash.
            # We create a list of the the first and last items in that list--that is, the word before the em-dash
            # and the word after em-dash
            conjoined_words = [word.partition(ur'\u2014')[0], word.partition(ur'\u2014')[2]]
            
            # Now we check each of the two words that had previously been connected by an em-dash.
            for conjoined_word in conjoined_words :
                # If the word is not in our list of distinct_words already, add it to the list
                if conjoined_word.strip() not in distinct_words :
                    print('"' + conjoined_word + '": Haven\'t seen this before')
                    unique_words.append(conjoined_word.strip())
                # If the word is already in our list of distinct_words, move along to the next word
                else :
                    print('"' + conjoined_word + '": Got that one already')
        
        # If we *don't* find an em-dash in our word...
        else :
            # If this word isn't already in our list of distinct_words, add it to the list
            if word.strip() not in distinct_words :
                print('"' + word.strip() + '": Haven\'t seen this before')
                distinct_words.append(word.strip())
            # If the word is already in our list of distinct_words, move along.
            else :
                print('"' + word.strip() + '": Got that one already')      

### Write our distinct words to a file ###

In [None]:
# Let's put these words in alphabetical order. Because.
alphabetical = sorted(distinct_words)
# Create a text file to receive our words.
with codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/Sophonisba_distinct_words.txt', 'w', 'utf-8')\
as outfile :
    # Loop through each word in our list of distinct_words
    for distinct_word in alphabetical :
        # This is just a check to eliminate a couple of gremlins I was seeing...
        if distinct_word == '' or distinct_word == '"' :
            pass
        else :
            # Write each word in the list of distinct_words to the file, followed by a line break
            outfile.write(distinct_word + '\n')
            print('Saving: ' + distinct_word)
    print('Word list saved')

### Look for words that are in *Sophonisba* that aren't in the EMOP dictionary ###


In [None]:
# Create an empty list to hold all the words in the EMOP dictionary
emop_words = []
# Open the EMOP dictionary
with codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/Font_Training_Files/EMOP-dictionary.txt', \
'r', 'utf-8') as dictfile :
    # Loop through the lines in the EMOP dictionary
    for line in dictfile :
        # Add each line (one word) to the list we created above.
        emop_words.append(line)
    
# Open the file of distinct words from Sophonisba (which we will refer to by the variable name "sopphtext") and\
# create an output file for the words that prove not to be in the EMOP dictionary
with codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/Sophonisba_distinct_words.txt', 'r', 'utf-8') \
as sophtext, codecs.open('/media/sf_RBSDigitalApproaches/data/0613_Thursday_data/Font_Training_Files/Sophonisba-words.txt', \
'w', 'utf-8') as outfile :
    # Loop through the lines of sophtext
    for line in sophtext :
        sophword = line.strip('\n')
        # If the word isn't in the list of emop_words, write it to our file.
        if sophword not in emop_words :
            print('"' + sophword + '": wasn\'t in the EMOP dictionary. Saving.')
            outfile.write(sophword + '\n')
