# Introduction

Richard Hakluyt (1553-1616), though an ordained priest, is chefly remembered for his promotion of English colonial expansion, and especially for his printed works. He published *Divers Voyages Touching the Discoverie of America* in 1582; the first edition of *The Principall Navigations, Voiages, Traffiques and Discoueries of the English Nation* followed in 1589, expanded into three volumes in 1598–1600. The latter, a truly voluminous compilation of English and broader European travel narratives (a modern edition spans 11 volumes and over 5,000 pages in total, with another volume devoted to indexes) is the focus of present investigation, examining the representation of colonial violence through computational text analysis and testing the digital methods against a concrete research project.

Despite the obvious challenges posed by the shaky standardization and odd spellings of Early Modern English, Hakluyt is paritcularly well-suited for computational methods due to the sheer volume of text; particular affordances are further opened up by his relative discursive consistency, as will be expanded below. Digital Humanities (which I roughly lump here with computational text analysis or text mining) have been hailed as redeeming litearary studies from the limits of what is humanly readable--tens, at most hundreds of works--imposing a narrow canon out of the tens and hundreds of thousands of works published since the advent of print (and perhaps implicitly delivering the discipline from the low prestige of subjective Humanities-style scholarship). It has also been derided as at best a gimmick and at worst a blatant neoliberalist takeover of one of the last bastions of critique. In the current project, however, the limitations of text mining, and in particular the simplistic and 'flat' nature of the query results, will be not only acknowledged by exploited as part of the inverstigation. Tracking the representations of violence in Hakluyt's collection, I will use digital methods, particularly word frequency counts and topic modeling, as a proxy for a hasty or naive reading, and then contrast it with closer reading of sample passages. I argue that Hakluyt dramatically downplays colonial violence, and is in fact much more concerned with intra-European conflict (and primarily with hostilities between English and Spanish), while colonial violence is often not explicitly registered as violence, and takes more sophisticated methods to uncover. The vast volume of the text, conducive to digital methods, precludes a full or even a representative close reading, and so I will not aspire to definitively solve this problem, but rather raise valuable questions.

*The Principal Navigations*, famously termed "the Prose Epic of the modern English nation" (James A. Froude, 1852), reads vehemently nationalist in its commitment to British expansion and power as well as in its principled stand against the Spanish as a curb on said English power (though some of the modern connotations of nationalism are no doubt an anarchronism for any 16th-century writer). Hakluyt could thus be expected to play up the Black Spanish Legend at every opportunity with a Las Casas-like litany of colonial atrocities to both denigrate the Spanish and call for a benevolent British alternative (as he indeed has done in the *Discourse Concerning Western Planting*). Overt instances of violence in the reports, however, address intra-European conflict at least as often as they do properly colonial violence (meaning European-American conflict). Given the scope of 16th-century colonial devastations, an equal representation is indeed a dramatic misrepresentation somewhat evocative of Rob Nixon's 1994 argument about "slow violence." Nixon investigates the disparity in media and activism attention between bounded instances of spectacular violence featuring clearly identifiable acts and perpetrators (9/11 comes to mind as a particularly stiking example) and the objectively vaster human harms of enviromental degradation, which often slip by unnoticed by the centers of power. Although colonial violence was often 'spectacular' in the most horrific of ways, it clearly gets the short shrift in *The Principal Navigations*; here as well as in Nixon's research, the victims who suffer a second time through erasure are the oppressed and disempowered. The suffering of the American 'Indians,' like that of today's poor (often also located in the Global South), becomes second-rate, implying that they are effectively counted less than human, worth less than English or even enemy Spanish. The symptomatic hasty or naive computer-reading that skims over American suffering will be complemented by spot-readings on ostensibly low-violence passages giving an occasional taste of the kinds of oppression that fail to register as violence, many of which cluster around issues of property or possession and the evocative keyword 'take.'

The project derives from an open-ended exploration of *The Principal Navigations* as much concerned with training myself in computational text analysis as with producing research findings as such, and it retains some of the mesiness of that process. The current notebook essentially shapes code and analysis into a coherent narrative while trying to clean up the programming. Where the analysis runs internally in Python, I will make sure that it runs right, but where external tools are used--epsecially regarding topic modeling, where every run is necessarily different due to the random component of the algorithm--I will rely on the products of the original computations.

The notebook begins with text cleaning, which, though distinctly unexciting, is a necessary foundation for any computational analysis. It proceeds with some basic numbers on the scope and simple characteristics of *The Principal Navigations*, and then surveys the broad discursive qualities of the text through word frequencies and topic modeling. Finally, I zero in on matters of violence, first exploring its discursive neighborhood through word frequencies and network graphs, and then diving into more specific detailed analysis with topic modeling. The main finding will be the approximate ratio between representations of intra-European violence and representations of colonial violence; the under-representation of colonial violence will be given further dimension through a series of close readings that point out the violence that fails to register through the computer reading. In this way, the very limitations of computational text analysis reinforce the argument: in only registering the most obvious instances of violence, the algorithm shows us a valuable lesson on silencing and erasure.

# Text Cleaning

Text cleaning is widely acknowledged to be in equal measures crucial and unexciting; feel free to search-hop to the "Basic Numbers Rundown" section down below unless you have a practical interest in the process (perhaps particularly in handling Early Modern English spelling challenges).

## Source Selection

Oxford University Press has contracted a critical edition of *The Principal Navigations*, but its website at http://www.hakluyt.org/ is currently down and it is in any case still a work in progress ([cf. website here](https://mooreinstitute.ie/projects/the-hakluyt-edition/)). A [report from that project](https://ora.ox.ac.uk/objects/uuid:9f4e7aa8-7368-4085-bb28-ce45e24e3a19) indicates a number of available editions, "including H.R Evans’s five-volume *Hakluyt’s Collection of the Early Voyages, Travels, and Discoveries of the English Nation* (1809‒12); a ‘rearranged’ version by Edmund Goldsmid (published in Edinburgh in sixteen volumes, 1885‒90)); the twelve-volume edition prepared by James MacLehose and Sons in Glasgow (1903‒05); and the Dent Everyman edition in eight volumes (1907) which excluded Hakluyt’s Latin texts." Like the project editors, I have found Goldsmid's rearrangements and Dent's modifications to compromise the integrity of the source. The project ultimately chose the TCP transcription of the original edition [available through EEBO](https://quod.lib.umich.edu/e/eebo/A02495.0001.001?view=toc) as their starting point, but the advantage of the proximity to the original text is negated for my purpose by its strict adherence to Early Modern English spelling conventions, paritcularly the frequent substitutions of u for v and vice-versa. The MacLehose edition conveniently modernizes that particular usage, and I chose to use that instead. I was able to access a PDF version through [Cambridge Core](https://www.cambridge.org/core/search?q=%22The+Principal+Navigations+Voyages+Traffiques+and+Discoveries+of+the+English+Nation%22&_csrf=Yf0FTrIr-KAzLwYWrAIZ6B5I2_7Pr1rlHFl8). In retrospect, the Goldsmid edition, freely available [here](http://onlinebooks.library.upenn.edu/webbin/metabook?id=hakluyt), would have sufficed and even proven advantageous. When I ran my core analysis on the third, American volume of the *Navigations*, I had to selectively exclude reports concenrning travels through the Pacific and on towards China--while Goldsmid conveniently "grouped together those voyages which relate to the same parts of the globe, instead of adopting the somewhat haphazard arrangement of the original edition."

## Text Extraction from PDFs (with basic cleaning)

Cambridge Core offers clean scans in PDF format with an underlying text layer (thus, selectable & searchable rather than merely a static image scan), one file per chapter or section in the paper volume. Each volume follows a standard numbered file naming format, so I downloaded each into a sepaarate folder to avoid confusion.

In [7]:
from IPython.display import IFrame
IFrame("text-data/sample.pdf", width=600, height=300)

As seen in the sample PDF, there are a few obvious challenges: headers and footers, sidenotes, intrusions from the preceding and next chapters, and an ornamental letter at the opening of each chapter that evades OCR. 

### PDFminer attempt

I have first tried to use the PDF text layer as given through the [pdfminer library](https://pypi.org/project/pdfminer/), leaving out blank lines, blank pages, and headers & footers. I also used the process to extract the relelvant date from the headers, encoding the following metadata in the text filename: volume number, chapter number, geographical region, date, title, and page range. The code ran as follows:

In [9]:
import pdfminer
from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
import re
import os

In [None]:
def remove_blank_lines(text):
    '''
    remove blank lines from text given as list of lines
    parameters: text: chapter text split into lines
    returns: all non-empty lines from text
    '''
    prev_len = len(text) + 1
    while len(text) < prev_len:
        prev_len = len(text)
        try:
            text.remove('')
        except: pass
    return(text)

def find_date(headers):
    '''
    extract date from list of chapter headers
    parameters: headers: list of lines from chapter headers
    returns: date identified as the most common numberical component of the lines text zeroed to 4 digits
    '''
    headers_string = ''.join(headers)
    numbers = re.findall(r'[0-9]+', headers_string)
    if len(numbers) == 0: return 'XXXX'
    elif len(numbers) == 1: common_number = numbers[0]
    else:
        common_number = max(set(numbers), key = numbers.count)
    if int(common_number) > 300 and int(common_number) < 1620:
        return common_number.zfill(4)
    else:
        return 'XXXX'

def empty_page(lines):
    '''
    determine whether a page is empty of text content
    parameters: lines: page text as lines
    returns: True  if page contains less than 10 lines or lines average less than 3 chars
    '''
    #if len(lines_sans_blanks) < 10: return True  /// original mis-variable
    if len(lines) < 10: return True
    lines_lens = [len(line) for line in lines]
    if sum(lines_lens)/len(lines_lens) < 4: return True    
    return False

def chapter_process(chapter):
    '''
    extract chapter text free of headers & footers and determine date
    parameters: chapter: path of pdf chapter
    returns: 
        chapter text as single string 
        date with leading zeroes to 4 digits or XXXX if failed to extract
    possible enhancements:
        resolve linebreak dashes
        crop out side notes
    '''
    headers = []
    chapter_text_list = []
    chapter_pages = len(list(extract_pages(chapter)))

    for pagenum in range(chapter_pages): 
        # process page by page, clearing headers, footers & blank lines; storing headers for date extract
        text = extract_text(chapter, page_numbers = [pagenum])
        lines = text.splitlines()
        lines_sans_blanks = remove_blank_lines(lines)
        if empty_page(lines_sans_blanks): continue
        headers += lines_sans_blanks[:3]
        if not ('.1_pp' in chapter and pagenum == 0):  #remove header except for first page of first chapter in each volume, which has no header
            del lines_sans_blanks[:3]
        del lines_sans_blanks[-4:] # remove footer
        chapter_text_list += lines_sans_blanks

    #override hyphen-broken words at ends of lines excepting the last
    for i in range(len(chapter_text_list) - 1):
        if chapter_text_list[i][-1] == '-':
            chapter_text_list[i] = chapter_text_list[i][:-1] + chapter_text_list[i+1]
            chapter_text_list[i+1] = ' '

    #joining list of lines into one string and cleaning out extra spaces
    chapter_text_string = ' '.join(chapter_text_list)
    chapter_text_string = re.sub('\s+',' ', chapter_text_string)
    return(chapter_text_string, find_date(headers))

def vol_chap_geog_prange(vol, chapter):
    '''
    extract volume, chapter number, broad geographical designation and title ready for feeding into txt file names
    parameters: vol as int; chapter as path of pdf chapter
    returns: 
            vol_z as number zeroed to 2 digits
            chap_z as number zeroed to 2 digits
            geog as CCCC determined by volume / chapter numbers below:
                01.01-04.4: NNE-
                04.05-06.17: SSE1
                06.18-07.16: SSE2
                07.17-11.43: AM-- 
            title as extracted from file name
            page range zeroed to 3 digits each number
    '''
    chap = chapter[chapter.find('.') + 1 : chapter.find('_')] #extract chap num between first dot and first underscore
    geog = 'XXXX' #to raise flag just in case something escapes
    if vol in [1,2,3]:
        geog = 'NNE-'
    elif vol == 4:
        if int(chap) in range(5): geog = 'NNE-'
        else: geog = 'SSE1'
    elif vol == 5:
        geog = 'SSE1'
    elif vol == 6:
        if int(chap) in range(18): geog = 'SSE1'
        else: geog = 'SSE2'
    elif vol == 7:
        if int(chap) in range(17): geog = 'SSE2'
        else: geog = 'AMER'
    else: geog = 'AMER'

    title = chapter[:-4]
    for i in range(4): # remove section & pages through 4th underscore
        title = title[title.find('_') + 1:]

    page_range = chapter
    for i in range(2):
        page_range = page_range[page_range.find('_') + 1:]
    page_start = page_range[:page_range.find('_')]
    page_range = page_range[page_range.find('_') + 1:]
    page_end = page_range[:page_range.find('_')]
    page_range = page_start.zfill(3) + '-' + page_end.zfill(3)
    return(str(vol).zfill(2), chap.zfill(2), geog, title, page_range)   

In [None]:
# iterate over volumes, then over chapter pdfs excluding front matter etc
for vol in range(1,12):
    filelist = os.scandir(os.getcwd() + '/' + str(vol))
    for entry in filelist:
        if entry.is_file(): 
            if (((vol == 1 and entry.name.startswith('06')) 
                or (vol > 1 and entry.name.startswith('04'))) 
                and entry.name[3] != '0'):  #identify body chapters
                    print(str(vol) + '_' + entry.name)
                    chapter = str(vol) + '/' + entry.name
                    
                    #extract and save text, extract file name components and fit into file name
                    chapter_text, date = chapter_process(chapter)
                    vol_n, chap_n, geog, title, page_range = vol_chap_geog_prange(vol, chapter)
                    filename = vol_n + '_' + chap_n + '_' + geog + '_' + date + '_' + title + '_pp.' + page_range

                    #create text file
                    with open(filename + '.txt', 'w') as f:
                        f.write(chapter_text)

The process worked reasonably well, but the original OCR largely treated the sidenotes as an extension of the text, substantially garbling the original sentences whenever a sidenote showed. 

### FineReader extraction

[ABBYY FineReader](https://pdf.abbyy.com/), applied to the PDFs, performed better, and I re-OCR'd the 589 files using the batch function, resulting in text files where the side notes, when correctly identified as such (definitely not always, but perhaps half the time), formed distinct paragraphs. The original PDF volumes were kept in separate folders, FineReader reproduced that structure, so my processing had to account for that as I wanted to have all files in a single folder.

Having already formed the filenames as I wanted them, I created a ledger file for easy retrieval and reference; a colleague has suggested I keep the metadata exclusively in a ledger and trim the file names down to a minimum, but I chose the immediate availability of filename-metadata over the efficiency and elegance of referencing it through a ledger every time.

In [10]:
def HK_date(fileid):
    #extracts date component from Hakluyt text file name, returns as integer
    return(int(fileid[11:15]))
def HK_geog(fileid):
    #extracts geography component from Hakluyt text file name, returns as string
    return(fileid[6:10])
def HK_title(fileid):
    #extracts geography component from Hakluyt text file name, returns as string
    return(fileid[16:-15])
def HK_vol(fileid):
    #extracts volume component from Hakluyt text file name, returns as integer
    return(int(fileid[:2]))
def HK_chap(fileid):
    #extracts chapter component from Hakluyt text file name, returns as integer
    return(int(fileid[3:5]))
def HK_pages(fileid):
    #extracts page range component from Hakluyt text file name, returns as tuple of integers
    pages_char = fileid[-11:-4]
    pages_char_split = pages_char.split('-')
    return(int(pages_char_split[0]), int(pages_char_split[1]))
def HK_page_length(fileid):
    #extracts page length component from Hakluyt text file name, returns as integer
    first_page, last_page = HK_pages(fileid)
    return(last_page-first_page+1)

In [48]:
import csv
with open('text-data/ledger.csv', 'w', newline='') as ledgerfile: 
    ledger = csv.writer(ledgerfile) 
    ledger.writerow(['vol', 'chap', 'geog', 'date', 'title', 'pages'])
    filelist = os.scandir('text-data/CambridgeCore MacLehose pdfminer extract')
    for file in filelist:
        ledger.writerow([HK_vol(file.name), HK_chap(file.name), HK_geog(file.name), HK_date(file.name), HK_title(file.name), str(HK_pages(file.name)[0]).zfill(3)+'-'+str(HK_pages(file.name)[1]).zfill(3)])

Thus, processing the FineReader output, I merely did what I could to clean out headers & footers, and drew the formatted file names from the ledger:

In [47]:
def process_text(path):
    '''
    clean up a text file from FineReader
    parameters: path: text file path
    returns: text as string cleaned of headers & footers based on Cambridge Hakluyt edition
    '''
    clean_lines = []
    with open(path, 'r', encoding="utf8") as f:
        text = f.readlines()
        i = 0
        while i < len(text) - 1:
            #skip up to 4 short lines - suspects for header remnants
            if len(text[i]) < 35: 
                i += 1
                if i > len(text) - 1: break
            if len(text[i]) < 35: 
                i += 1
                if i > len(text) - 1: break
            if len(text[i]) < 35: 
                i += 1
                if i > len(text) - 1: break
            if len(text[i]) < 35: 
                i += 1
                if i > len(text) - 1: break
            # after header, collect body text until footer start
            while not 'https://' in text[i] or 'The material originally positioned' in text[i]:
                clean_lines.append(text[i])
                i+=1
                if i >= len(text) - 1: break
            #once footer starts, skip footer lines, check for page number and repeat loop
            while 'https://' in text[i] or 'The material originally positioned' in text[i]:
                i+=1
                if i > len(text) - 1: break
            if i > len(text) - 1: break
            #check for page number that occasionally precedes footer -- now in clean text
            if len(clean_lines[len(clean_lines) - 1]) < 4:
                clean_lines.pop()
    #returns text as string with newlines preserving line braks just in case
    return('\n'.join(clean_lines))
def create_filename(vol, file_name):
    '''
    create file name based on format set in pdfminer extract, including date, drawing on CSV ledger
    parameters:
        vol: number of volume as integer
        file_name: text file name
        returns: new filename as string
    csv headers: vol chap geog date title pages
    '''
    #format vol to match ledger entries
    vol = str(vol).zfill(2)
    #slice chapter portion of filename
    chap = file_name[3:5]
    #correct for chaps 0 through 9
    if chap[1] == '_':
        chap = '0' + chap[0]
    with open ('text-data/ledger.csv', mode = 'r', newline = '') as ledger:
        ledger_reader = csv.DictReader(ledger)
        #iterate over csv rows until find relevant one
        for row in ledger_reader:
            if row['vol'] == vol and row['chap'] == chap:
                r = row
                break
    name = '_'.join([vol, chap, r['geog'], r['date'], r['title']])
    return(name + '_pp.' + r['pages'] + '.txt')

In [51]:
#iterate through volumes
os.mkdir('text-data/ARG/')
for vol in range(1,12):
    #filelist = os.scandir('text-data/Cambridge MacLehose FineReader OCR/' + str(vol))
    #having processed the FineReader output once, I removed the original files to an archive folder; I have to account for that here
    filelist = os.scandir('text-data/archive/Cambridge MacLehose FineReader OCR raw/' + str(vol))
    for entry in filelist:
        #identify body chapters
        if entry.is_file(): 
            #only pick up body chapters rather than front matter and such
            if (((vol == 1 and entry.name.startswith('06')) 
                or (vol > 1 and entry.name.startswith('04'))) 
                and entry.name[3] != '0'):
                # get clean text and desired filename
                text = process_text(entry.path)
                filename = create_filename(vol, entry.name)
                # create text file
                with open('text-data/ARG/'+filename, 'w', encoding="utf8") as f:
                    f.write(text)

## Text Cleaning

While there is some basic text cleaning in the above code, I had on multiple later occasions realized that I could or needed to do better, further branching and splitting up my text corpuses. I aggregate the text cleaning operations below for the sake of organization, but it is likely that I will keep having to scrub mid-way through later research projects too as issues or opportunities come up unexpectedly.

Thus, recognizing that some headers and artifacts that probably mark volume and page number of the original edition (e.g., "[II. i. 281.]") made it into the text, I ran an extra scrub as follows:

In [52]:
filelist = os.scandir('text-data/ARG/')
for entry in filelist:
    with open(entry.path, 'r', encoding="utf8") as fr:
        text = fr.read()
        text = text.replace('THE ENGLISH VOYAGES', '')
        text = re.sub(r'\[.{1,9}?\]', '', text)
        with open('text-data/ARG/' + entry.name, 'w', encoding="utf8") as fw:
            fw.write(text)

***
**Note: potentially far more elegant computer-vision solution to headers, footers, and sidenotes**

William Mattingly has put forward a far more refined solution for trimming excess text on a page [here](https://github.com/wjbmattingly/text-analysis-for-ancient-and-medieval-languages/blob/main/ancient_medieval_02.ipynb), but I've never gotten around to getting the basic facility with OpenCV necessary to adapt his code to my case; in retrospect, I would not have gained much by this improvement, though it still irks me now and then.
***

An important data stewardship issue comes up with operations such as text cleaning: do I just overwrite the 'dirty' corpus, potentially losing data, or do I create a new copy, leaving behind a trail of useless data? I opted for the second approach in my original workflow--which made a lot of sense especially as I was testing things out and could easily, say, overwrite the corpus with empty spaces or, in some ways worse, get what looks like the desired result but lose or grarble 30% of the data. As I am working with verified code in this cleaned-up notebook, I am opting to overwrite whenever it makes sense, but it is a choice worth addressing explicitly. 

In another attempt to patch up an imperfect text corpus, I tried standardizing variant spellings by simple find and replace. By the time I got to it, I had multiple corpuses, which complicated everything (see the 'text cleaning' notebook); had I thought of it earlier in the process, it could have been much simpler, as below. As I only got to it late in the process, I did not use it extensively except to consolidate spellings of tokens I had particular interest in, such as violence indicators.

In [100]:
def replace(old, new, folder):
    '''
    replaces all instances of 'old' token with 'new' token across corpuses defined through 'scope'; 
    records replacement in text file
    arguments:
        old (str) : token to be replaced
        new (str) : substitute token 
        folder: folder within text-data to 
        '''
    filelist = os.scandir('text-data/' + folder)
    for entry in filelist:
        with open(entry.path, 'r', encoding="utf8") as fr:
                text = fr.read()
                # replacement through regex to account for adjacent punctuation & ignore case
                # regex fails if 'new' starts with a number; prefacing the number with a space gets around that
                text = re.sub(r'([^a-zA-Z]|^)('+ old + r')([^a-zA-Z])',r'\1' + new +r'\3', text, flags=re.I)
        with open(entry.path, 'w', encoding="utf8") as fw:
                fw.write(text)
    with open ('text-data/'+ 'ARG_replacement_record.txt', 'a', encoding="utf8") as f:
        f.write(old + ' -> ' + new + ' in ' + folder + '\n')

In [83]:
import pandas as pd
sw = pd.read_csv('text-data/stopwords.csv')
stop_words = [w for w in (sw['nltk'].tolist() + sw['eliz'].tolist() + sw['hk'].tolist() + sw['pronouns'].tolist()) if pd.isnull(w) == False]
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'text-data/ARG'
hakluyt = PlaintextCorpusReader(corpus_root, '.*')
hakluyt_col = nltk.TextCollection(hakluyt)
hakluyt_fd = nltk.FreqDist(word for word in hakluyt.words() if word.isalpha() and word not in stop_words)
#nltk collection & freqdist for identifying candidates for spelling scrub

In [34]:
# print out high-frequency words to identify the most frequent misspellings to maximize impact of spelling scrub
for token, _ in hakluyt_fd.most_common(150):
     print(token, end = '|')

I|The|great|one|And|men|day|came|two|made|good|time|called|many|land|place|come|said|king|people|ships|certaine|found|sea|leagues|river|man|three|water|much|went|sayd|part|countrey|ship|sent|first|make|every|set|towne|In|things|England|God|night|dayes|way|brought|English|A|small|side|tooke|goe|coast|But|John|Captaine|This|foure|company|take|S|yeere|voyage|long|Island|They|see|put|taken|West|divers|downe|For|owne|Cape|North|shore|M|course|degrees|rest|thence|South|places|neere|departed|five|away|Master|goods|together|maner|est|comming|hundred|winde|done|halfe|store|others|thing|Indians|give|farre|use|returne|East|towards|thought|say|high|number|From|letters|reason|morning|Lord|left|little|Generali|THE|cause|name|Spaniards|betweene|order|sixe|quod|gave|Hand|shot|last|must|shippe|meanes|saw|citie|William|faire|victuals|house|ENGLISH|VOYAGES|houses|Thomas|shippes|aboord|

In [102]:
# use fuzzy search to identify misspellings of terms of especial interest
from fuzzywuzzy import fuzz
from fuzzywuzzy import process 
for word in hakluyt_col.vocab():
    if fuzz.ratio('halberd', word) > 70:
        print("'"+word+"',", end = ' ')

'hard', 'shalbe', 'chamber', 'habere', 'haled', 'hailed', 'handled', 'altered', 'hale', 'hald', 'healed', 'haberet', 'habetur', 'halsers', 'habited', 'halberds', 'Chamber', 'halters', 'haberi', 'halser', 'halberd', 'haberem', 'Shalbe', 'herd', 'Calaber', 'Chaleur', 'haulser', 'halbard', 'halbards', 'chalybe', 'haerede', 'Kaleber', 'habenda', 'shaled', 'alberta', 'haleberts', 'Thaber', 'halberdiers', 'haeredi', 'haberes', 'halbardes', 'habendi', 'halowed', 'Halberds', 'Filberd', 'halter', 'halted', 'hayled', 'shoaled', 'haberdash', 

In [105]:
#concordance to make sure that I am indeed dealing with a misspelling and not a separate term
hakluyt_col.concordance('naturali', 100)

no matches


In [103]:
for w in ['halbert', 'halebert']:
    replace(w,'halberd', 'ARG')

### Trimming Out Duplicate Passages

A more substantial clean-up operation concerned the little portions of previous and next chapters in each PDF file (excepting the rare cases where a chapter starts or ends on a clean page-break).

In [13]:
#creating page length frequency distribution to identify number of chapters 3 page long or less
hakluyt_lengths = [HK_page_length(fileid) for fileid in hakluyt.fileids()]
length_fd = nltk.FreqDist(hakluyt_lengths)
print(length_fd[1]+length_fd[2]+length_fd[3],' chapters out of 589 are 3 pages or shorter and thus substantially impacted by imprecise chapter divisions based on pages rather than chapter title location')

259  chapters out of 589 are 3 pages or shorter and thus substantially impacted by imprecise chapter divisions based on pages rather than chapter title location


Given that a full PDF page is reproduced at each chapter split, there is no algorithmic way to decide on where the split should happen (unless I retool the OCR to capture text size to be able to identify the title, perhaps outputting html instead of raw text as I had). I ended up using the titles that came with the PDF files--which, interestingly, match up to the table-of-content titles, which are similar to but not identical to the titles that appear in the text. I did not, unfortunately, think to look up fuzzy matching at the time, so instead I made my own crude version of that. I was really reluctant to lose data when my matching failed, so I opted for a time-intensive (several days of work) manual confirmation of the break line as follows:

In [None]:
# core functions:
# - title_match: ratio of stopword-filtered words from ledger title matched in the section
# - page-approx: flags true once string equals/surpasses maximum text likely to fit on a single page
# - user-select: print out page sections with numbers and mark user input

# algo:
# 1. split chapter into line sections
# 2. assemble first_page
#     1. identify highest title_match
#     2. print out leger title, max title_match, collect user input
#     3. if input == y, store title_match position
#     4. else user-select position
# 3. assemble last_page
#     1. identify highest title_match
#     2. print out leger title, max title_match, collect user input
#     3. if input == y, store title_match position **from end**
#     4. else user-select position **from end**
# 4. new file
#     1. assemble from total lines based on start and end indexes
#     2. save into new folder
def title_match(title, section):
    """determines ratio of stopword-filtered words from ledger title matched in the section"""
    section_trim = section[:int(len(title)*2)]
    title_filtered = [word for word in title if word not in stop_words]
    matching_words = [word for word in title_filtered if word in section_trim]
    return len(matching_words) / len(title_filtered)
def page_approx(lines):
    '''flags true once lines equal/surpass maximum text likely to fit on a single page'''
    return True if len(' '.join(lines)) >= 2600 else False
def user_select(title, lines):
    '''prints title & page sections w/index and returns user input on decided match'''
    print('_____________________________________________________________________________')
    print('select best match for: ', title)
    print('-----------------------------------------------------------------------------')
    for line in lines:
        if len(line) > 3: 
            print(lines.index(line), ': ', line[:400])
    print('_____________________________________________________________________________')
    print('select best match for: ', title)
    print('-----------------------------------------------------------------------------')
    return(int(input('select line')))
def next_title(raw_title):
    '''based on current title, accesses ledger to determine the next chapter title
    return next chapter title + next chapter starting page for overlap calc'''
    with open('text-data/ledgertagged.csv') as ledgertaggedcsv:
        ledger = list(csv.DictReader(ledgertaggedcsv, delimiter = ","))
        for row in ledger:
            if int(row['vol'])==HK.HK_vol(raw_title) and int(row['chap'])==HK.HK_chap(raw_title):
                if int(row['vol']) == 11 and int(row['chap'])== 43: #checking for last entry
                    return ''
                else:
                    next_title = ledger[ledger.index(row) +1]['title']
                    next_chapter_start_page = int(ledger[ledger.index(row) +1]['pages'][:3])
                    return(" ".join(next_title.split('_')),next_chapter_start_page)

In [None]:
filelist = os.scandir('text-data/ARG')
newdir = r'text-data/ARG/'
end_title = '' #initializing flag for known title based on end_title from preceding chapter
char_block = '''||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||'''
for entry in filelist:
    #skip past completed files
    last_vol_n, last_chap_n = 11, 28
    if HK.HK_vol(entry.name) < last_vol_n or (HK.HK_vol(entry.name) == last_vol_n and HK.HK_chap(entry.name) <= last_chap_n):
        continue
    #determine start line
    with open(entry.path, 'r', encoding="utf8") as f:
        chapter_lines = f.readlines()
        chapter_lines.append(' ') #appending extra line in case I need to end the chapter after the last line when page brak matches chapter break
        chapter_title = " ".join(HK.HK_title(entry.name).split('_'))
        first_page = []
        line_index = 0
        while not (page_approx(first_page) or line_index == len(chapter_lines)):
            first_page.append(chapter_lines[line_index])
            line_index += 1
        if end_title == '':
            print(char_block*2)
            print(entry.name)
            print('find start line')
            best_match = 0
            for line in first_page:
                if title_match(chapter_title, line) > best_match:
                    best_title = line
                    best_match = title_match(chapter_title, line)
            title_approved = input(f'--target: {chapter_title}, \n--auto match: {first_page.index(best_title)} : {best_title[:400]} \n--approve? y/n')
            if title_approved == 'y':
                start_index = first_page.index(best_title)
            else:
                start_index = user_select(chapter_title, first_page)
        else:
            start_index = first_page.index(end_title)
    #determine last line
        print(char_block)
        print(entry.name)
        print('find end line')
        chapter_title, next_chap_start_page = next_title(entry.name)
        #flag if next chapter starts on new page
        _, chapter_end_page = HK.HK_pages(entry.name)
        if next_chap_start_page == chapter_end_page + 1:
            print('||| next chapter starts on new page |||')
        last_page = []
        #print('last page just after creation', last_page)
        line_index = -1
        while not (page_approx(last_page) or line_index == len(chapter_lines)*-1):
            last_page.append(chapter_lines[line_index])
            line_index -= 1
        #print('last page initially assembled', last_page)
        last_page.reverse()
        #print('last page after reversal', last_page)
        #print('last page length', len(last_page))
        best_match = 0
        for line in last_page:
            if title_match(chapter_title, line) > best_match:
                best_title = line
                best_match = title_match(chapter_title, line)
        title_approved = input(f'--target: {chapter_title}, \n--auto match: {best_title[:400]} \n--approve? y/n')
        if title_approved == 'y':
            end_index = last_page.index(best_title)
        else:
            end_index = user_select(chapter_title, last_page)
        #print('end index first established', end_index)
        end_index = end_index - len(last_page) # flipping to count from end
        #print('end index flipped', end_index)
    #write new file
    new_lines = chapter_lines[start_index : end_index]
    #no need for +1 in end_index as I don't want to include next title in the text of the chapter
    new_text = '\n'.join(new_lines)
    #print(f'writing to file, index start {start_index} end {end_index} text {new_text}')
    with open(newdir+entry.name, 'w', encoding="utf8") as f2:
        f2.write(new_text)
    # determine if there's page overlap between current and next chapter, 
    # in which case end title can be reused as start title
    if next_chap_start_page == chapter_end_page:
        end_title = last_page[end_index]
    else: end_title = ''

All in all, ~125,000 words were cast off in trimming--including, as I learned much later, one entire short chapter, which, all in all, is really not a bad result. 

**Given the labor involved, at this point I copied CC_ML_FR_trimmed_cleaned over into ARG isntead of actually replicating the process and turned the {n}'s back into newlines**

In [125]:
import shutil
shutil.copytree('text-data/CC_ML_FR_trimmed_cleaned', 'text-data/ARG', dirs_exist_ok=True)
for entry in os.scandir('text-data/ARG'):
    with open(entry.path, 'r+', encoding="utf8") as f:
        text = f.read()
        text = text.replace('{n}', '\n')
        f.seek(0)
        f.write(text)
        f.truncate()

### Clearing Out Non-English Text

On many occasions, Hakluyt presents non-English text (usually Latin, with some Spanish and a few Portuguese, Italian and possibly French) followed by a translation. The vast majority of the text is English, so this has no effect on word frequency counts, but it comes up as distracting noise in topic modeling and network graphs, so I eventually decided to trim it out.

In [111]:
import langid
from langid.langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
identifier.set_languages(['en','la','es'])

In [127]:
source = 'text-data/ARG'
target = 'text-data/ARG_EN/'
os.mkdir(target)
for entry in os.scandir(source):
    with open(entry.path, 'r', encoding="utf8") as f:
        text = f.read()
        #splitting into paragraphs since what I'm looking to weed out is large non-English sections
        #(usually lasting half a document) rather than disparate words or sentences
        parags = text.split('\n')
        EN_parags = []
        for parag in parags:
            if langid.classify(parag)[0] == 'en':
                EN_parags.append(parag)
        EN_text = '\n'.join(EN_parags)
        # control for small bits of English misidentified
        if len(EN_text) > 0.7* len(text):
            EN_text = text
        with open(target + entry.name, 'w', encoding="utf8") as fw:
            fw.write(EN_text+'.') #adding a full stop to avoid zero-length files that will be a pain later

### Lemmatization & Spelling Normalization with MorphAdorner

Even a perfectly transcribed copy of the *Navigations* would still be posing a great deal of trouble for computational analysis due to the inconsistency of Early Modern English and its distance from current English standards. Further, for many operations, such as word frequency counts and topic modeling, it is desirable to bring words down to their basic disctionary forms, allowing, say, "Indians" to be counted as an instance of "Indian" rather than as a separate entity; the process is known as lemmatization. 

NLTK lemmatization does not handle Hakluyt's text particularly well:

In [68]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for w in ['kingdome', 'doe', 'us']:
    print(w,":", lemmatizer.lemmatize(w))

kingdome : kingdome
doe : doe
us : u


I initially thought to use a spelling corrector, but didn't find one at the time. Coming back to the idea now, I find that it would not have gotten me far either:

In [129]:
from textblob import TextBlob
t= '''YOu are to understand, that at the feast of Easter, there was a great companie of 
Nobles with Pope John and Conradus the Emperour assembled at Rome'''
textBlb = TextBlob(t) 
textCorrected = textBlb.correct()
print(textCorrected)

you are to understand, that at the feast of Master, there was a great companies of 
Nobles with Hope John and Conradus the Emperor assembled at Some


Fortunately, folks at Northwestern have developed [MorphAdorner](http://morphadorner.northwestern.edu/morphadorner/), a tool for spelling standardization, lemmatization, and a few other operations, tailored to multiple out-of-date English variants, including Early Modern. MorphAdorner runs from the command line, and (in the simple implementation I used) returns a text file where each original token gets a line with a tab-delimited list of the following: original token, part-of-speech tag, normalized spelling, lemma; it is easy to feed that back into Python and generate new text files with either standardized spelling or lemmatized tokens.

MorphAdorner commands:

- cd C:\Users\apovzner\Documents\morphadorner-2.0.1 (or wherever you put the working folder; see [installation instructions](http://morphadorner.northwestern.edu/morphadorner/download/))

- adornplainemetext C:\Users\apovzner\Documents\Hakluyt\text-data\morphadorner-outputs\ARG_EN C:\Users\apovzner\Documents\Hakluyt\text-data\ARG_EN\\*.txt
  - adornplainemetext: for adorning Early Modern English plaintext files; see [here](http://morphadorner.northwestern.edu/morphadorner/documentation/adorningatext/) for more options
  - full format: [command] \outputdir \inputdir\ [specific file or wildcard format]
  
Since MorphAdorner produces files that require further processing, I've concentrated all those in the separate morphadorner-outputs folder. It does not preserve newlines, so I start by replacing them with a flag, "{n}", and then convert those back into newlines after.

In [130]:
for entry in os.scandir('text-data/ARG_EN'):
    with open(entry.path, 'r+', encoding="utf8") as f:
        text = f.read()
        text = text.replace('\n', '{n}')
        f.seek(0)
        f.write(text)
        f.truncate()

In [139]:
def adorned_extract(adorn_output, target, mode):
    '''
    takes MorphAdorner's output and converts it to a lemmatized/normalized-spelling plaintext,
    substituting newlines for '{ n }'
    adorn_output: folder path of MorphAdorner's output
    target: new corpus folder
    mode: 'lem' for lemmatization or 'spel' for normalized spelling
    '''
    if mode == 'lem':
        col = 4
    elif mode == 'spel':
        col = 3
    else:
        raise ValueError('unrecognized mode requested')
    os.mkdir(target)
    filelist = os.scandir(adorn_output)
    for entry in filelist:
        if entry.is_file():
            with open(entry.path, 'r', encoding="utf8") as f:
                new_text = ''
                for line in f:
                    #extracting token-by-token from either the lemmatized or spel-normalized column
                    new_text += (line.split()[col] + ' ')
                #recovering original line breaks
                new_text = new_text.replace('{ n }', '\n')
                with open((target + '/' + entry.name), 'w', encoding="utf8") as fw:
                    fw.write(new_text)

In [140]:
adorned_extract('text-data/morphadorner-outputs/ARG_EN', 'text-data/ARG_EN_lem', 'lem')
adorned_extract('text-data/morphadorner-outputs/ARG_EN', 'text-data/ARG_EN_spel', 'spel')

# Basic Numbers Rundown