# OCR Use (ocr_use.py)

This notebook documents the process by which the image files downloaded from the Internet Archive underwent optical character recognition (OCR). This script acts as a Tesseract wrapper to sequentially perform OCR on the entire corpus, volume by volume. As with the Adjustment Recommendation documentation, this notebook utilizes a set of 10 images from the "lawsresolutionso1891nort" volume to demonstrate the processes.

In [1]:
import os, sys
import pandas as pandas
from datetime import datetime

#get ocr functions
sys.path.append(os.path.abspath("./"))
from ocr_func import cutMarg, adjustImg, tsvOCR

In addition to the raw image files, ocr_use.py makes use of three separate metadata files:
1. **xmljpegmerge** - Contains information about each image file (hand side, leaf number, Internet Archive url, etc.)
2. **marginalia_metadata** - Page-level metadata generated during the marginalia determination process (see documentation). This file contains coordinates of areas to be OCR'd and background color information for each image.
3. **adjustments** - Contains adjustment recommendations for each volume as determined by adjRec.py (see documentation)

In [2]:
#Set paths for metadata files
masterlist = "xmljpegmerge_demo.csv"
margdata = "marginalia_metadata_demo.csv"
adjdata = "adjustments_demo.csv"

#Set paths for image files
rootImgDir = "./images/"  

#Set output directory
outDir = "./output/"

#Read metadata files into memory
#Read csvs
mastercsv = pandas.read_csv(masterlist)
margcsv = pandas.read_csv(margdata)
adjcsv = pandas.read_csv(adjdata)

Once all three metadata files are loaded into memory, they are merged into a single Pandas.DataFrame object (`fcsv`). 

**NOTE** - *In the production version of this code, the filenames are generated in the `mastercsv` DataFrame to reflect the .jp2 format of our corpus images. However, for this example, our sample images are in .jpg format rather than .jp2 format. As a result, the codeblock below generates ".jpg" extensions for the filenames instead.*

A list of volumes is then created using the `Pandas.DataFrame.groupby` function. For this demonstration, the resulting list (`vols`) only contains a single image volume.

In [3]:
#Create column for volume
for index, row in mastercsv.iterrows():
    volume = row["filename"].split("_")[0]
    mastercsv.at[index, "volume"] = volume  

#Merge csvs
# mastercsv["filename"] = mastercsv["filename"] + ".jp2"
mastercsv["filename"] = mastercsv["filename"] + ".jpg" 
mcsv = mastercsv.merge(margcsv, left_on="filename", right_on="file")
fcsv = mcsv.merge(adjcsv, on = "volume", how = "right")

#get separate volumes
volsGrouped = fcsv.groupby("volume")
vols = volsGrouped.groups.keys()
vols

dict_keys(['lawsresolutionso1891nort'])

Once we have created the list of volumes, we can iterate through each to perform the necessary OCR operations. While this is normally accomplished with a loop, for this example, we forego the loop in order to more easily break down the processes therein.

In [4]:
# #loop through volumes
# for vol in vols:

# DEMO
# set 'vol' to our example volume to compensate for the above loop 
# not being included
vol = list(vols)[0]

Next we create an output directory if it does not already exist and select a volume from the merged metadata DataFrame.

In [5]:
#create a folder for the volume in the output directory if it doesn't already exist
newdir = os.path.normpath(os.path.join(outDir, vol))
if os.path.exists(newdir) == False:
    os.mkdir(newdir)

#select rows for volume
voldf = volsGrouped.get_group(vol)

#DEMO
voldf.head()

Unnamed: 0,filename,leafNum,handSide,page,sectiontype,sectiontitle,fileUrl,volume,file,angle,side,cut,backR,backG,backB,bbox1,bbox2,bbox3,bbox4,color,invert,autocontrast,blur,sharpen,smooth,xsmooth
0,lawsresolutionso1891nort_0272.jpg,272,LEFT,226,public laws,Public Laws of the State of North Carolina Ses...,https://archive.org/download/lawsresolutionso1...,lawsresolutionso1891nort,lawsresolutionso1891nort_0272.jpg,0.0,left,350,229,212,185,427,337,1915,2960,0.75,False,4,False,False,False,False
1,lawsresolutionso1891nort_0374.jpg,374,LEFT,328,public laws,Public Laws of the State of North Carolina Ses...,https://archive.org/download/lawsresolutionso1...,lawsresolutionso1891nort,lawsresolutionso1891nort_0374.jpg,0.0,left,350,228,212,187,446,352,1927,2964,0.75,False,4,False,False,False,False
2,lawsresolutionso1891nort_0542.jpg,542,LEFT,496,public laws,Public Laws of the State of North Carolina Ses...,https://archive.org/download/lawsresolutionso1...,lawsresolutionso1891nort,lawsresolutionso1891nort_0542.jpg,0.0,left,350,228,212,186,472,337,1957,2954,0.75,False,4,False,False,False,False
3,lawsresolutionso1891nort_0606.jpg,606,LEFT,558,public laws,Public Laws of the State of North Carolina Ses...,https://archive.org/download/lawsresolutionso1...,lawsresolutionso1891nort,lawsresolutionso1891nort_0606.jpg,0.0,left,353,227,211,184,459,318,1944,3225,0.75,False,4,False,False,False,False
4,lawsresolutionso1891nort_0771.jpg,771,RIGHT,723,private laws,Private Laws of the State of North Carolina Se...,https://archive.org/download/lawsresolutionso1...,lawsresolutionso1891nort,lawsresolutionso1891nort_0771.jpg,0.0,right,1484,214,198,174,46,345,1530,2510,0.75,False,4,False,False,False,False


After extracting all volume-specific rows from the metadata DataFrame, the volume is separated into law types. 

In this corpus, each physical volume is separated into several law type sections. For example, a single volume may contain "Public Laws," "Private Laws," "Public Local Laws," etc. Each of these law type sections was treated as a separate entity in the post-OCR steps of our workflow, so it is important here to group the images in a single physical volume by the type of laws that can be found on each page. Separate law type section-specific OCR output files can then be generated, meaning that "Private Laws," "Public Laws," etc. in a given physical volume all receive their own set of corresponding OCR output files. 

The example images included in this notebook belong to two law type sections: 'Private Laws' and 'Public Laws'

In [6]:
#get separate section types
secsGrouped = voldf.groupby("sectiontype")
secs = secsGrouped.groups.keys()

#DEMO
secs

dict_keys(['private laws', 'public laws'])

Following the separation of a single volume into law type sections, we initiate a loop to walk though all of the law type sections in sequence:

1. Within a given volume, iterate through each law type section
    2. Within a given law type section, iterate through all of the image files that belong to that law type section
        3. For each of these images:
            * Open the raw image file *(.jp2 in the production code, .jpg in our example)*
            * Set the marginalia removal information for that image as determined by the marginalia determination process
            * Set the image adjustment information for that image as determined by the adjustment recommendation process
            * Record the adjustments made to a given image using a .txt file
            * Perform the actual OCR using Tesseract, accessed via functions stored in ocr_func.py. Metadata recorded above are used as arguments
       

In [7]:
#create seperate OCR files for each section
for sec in secs:

    #select rows for section type
    secsdf = secsGrouped.get_group(sec)

    print(datetime.now().strftime("%H:%M") + " Processing " + vol + " " + sec + "...")

    #Loop through section
    for row in secsdf.itertuples():

#         img = os.path.normpath(os.path.join(rootImgDir, vol + "_jp2", row.file))
        img = os.path.normpath(os.path.join(rootImgDir, vol + "_jpg", row.file))

        #set up margin cutting
        cuts = {"rotate" : row.angle,
                "left" : row.bbox1,
                "up" : row.bbox2,
                "right" : row.bbox3,
                "lower" : row.bbox4,
                "border" : 200,
                "bkgcol" : (row.backR, row.backG, row.backB)}

        #set up image adjustment            
        adjustments = {"color": row.color, 
                       "autocontrast": row.autocontrast,
                       "blur": row.blur,
                       "sharpen": row.sharpen,
                       "smooth": row.smooth,
                       "xsmooth": row.xsmooth}

        #Record image adjustments
        adjf = open(os.path.normpath(os.path.join(outDir, vol, vol + "_adjustments.txt")), "w")
        adjf.write("IMAGE ADJUSTMENTS\n\n")
        for key, value in adjustments.items():
            adjf.write("{}: {}\n" .format(key, value))
        adjf.close()

        #OCR the image
        tsvOCR((adjustImg(cutMarg(img, **cuts), **adjustments)), 
               savpath = os.path.normpath(os.path.join(outDir, vol, vol + "_" + sec + ".txt")), 
               tsvfile = vol + "_" + sec + "_data.tsv")

11:47 Processing lawsresolutionso1891nort private laws...
11:48 Processing lawsresolutionso1891nort public laws...


The above processes result in several output files for a single physical volume:
* *volume*_adjustments.txt - stores the image adjustments used to OCR that particular volume
* *volume*_*section*.txt - stores a compiled version of the OCR'd text for a given law type section. Each set of output files contains one of these files for each OCR'd law type section in a physical volume
* *volume*_*section*_data.tsv - stores a word-level .tsv file for a given law type section. The rows in this file correspond to each individual token (word) recorded by the OCR, along with page coordinates and a confidence value for each. Each set of output files contains one of these files for each OCR'd law type section in a physical volume


The most important files in the above output are the '_data.tsv' files. These are the 'raw' files that will be used in the law splitting and cleanup process, which directly follows the OCR process. The first 30 rows of our example Public Laws 'raw' file are shown below.

In [8]:
# DEMO - 'raw' file rows for one of our example sections: "lawsresolutionso1891nort_public laws"
pandas.read_csv("./output/lawsresolutionso1891nort/lawsresolutionso1891nort_public laws_data.tsv",sep="\t", encoding="cp1252")[0:30]

Unnamed: 0,left,top,width,height,conf,text,name
0,0,0,1888,3023,-1,,lawsresolutionso1891nort_0272.jpg
1,210,208,1470,2606,-1,,lawsresolutionso1891nort_0272.jpg
2,211,208,1466,658,-1,,lawsresolutionso1891nort_0272.jpg
3,211,208,1460,44,-1,,lawsresolutionso1891nort_0272.jpg
4,211,210,173,42,96,"petition,",lawsresolutionso1891nort_0272.jpg
5,417,212,40,31,96,to,lawsresolutionso1891nort_0272.jpg
6,495,209,66,34,96,the,lawsresolutionso1891nort_0272.jpg
7,592,208,142,44,96,capital,lawsresolutionso1891nort_0272.jpg
8,772,209,109,33,96,stock,lawsresolutionso1891nort_0272.jpg
9,912,209,42,32,96,of,lawsresolutionso1891nort_0272.jpg
