CSC 630/1 Zufelt <br>
Michelle Chao, Sam Xifaras, Darius Lam, Moe Sunami, Diva Harsoor

# Catalog Card Processing
Creating a training dataset for object character recognition in order to read catalog cards (Robert S. Peabody Museum of Archaeology, 1930s to 1970s)
___

Given scanned PDFs of typewritten catalog cards, our goal is to create a dataset in which each row is a set of images of each character in a card.<br> <br>
Tasks: 
<ul>
    <li>Convert PDFs to TIFFs</li>
    <li>Cut the images into different sections corresponding to data type </li>
    <li>Cut the images into characters and add each image as a row of a DataFrame </li>
    <li>Label the DataFrame with the proper character labels</li>
    <li>Write a script that presents a user with a randomly generated row/image and its corresponding label (to check for accuracy)</li>
    <li>Display the process in this Jupyter Notebook</li>
</ul>
___

In [None]:
import numpy as np
import subprocess
from PIL import Image
import cv2 #used for Otsu's Thresholding
import os
from skimage import io
from skimage import segmentation
from sklearn.cluster import DBSCAN
from matplotlib import pyplot as plt
import re
%matplotlib inline

The functions below make calls to the following script, `util.sh`, which performs two important functions. First, the `merge` function, meant to be performed only once, moves all of the pdfs from their respective Acc. No. directories to the parent directory, `Accession Files`. This was more for convenience, because dealing with file names with spaces in bash is a hassle, and the accession numbers are indicated in the filenames, so we aren't losing any information when doing this.

Here is the code:
   
```
# util.sh

if [ "$1" == "merge" ]
then
    pushd "$2"
    for f in *
    do
        echo "Moving files from $f"
        pushd "$f"
        cp * ../ -fpv
        popd
        
        echo "Removing $f"
        rm "$f" -fR
    done 
    popd

elif [ "$1" == "convert" ]
then

    ACCNO=$2
    CATNO=$3

    # This is the path to the peabody_files directory
    ROOT=$4

    # Path to the ghostscript executable
    GS=$5

    pushd "${ROOT}"

    "$GS" -dNOPAUSE -r300 -sDEVICE=tiffscaled24 -sCompression=lzw -dBATCH -sOutputFile=${ACCNO}_${CATNO}.tif ${ACCNO}_${CATNO}.pdf

    popd

fi 




```

In [14]:
def merge(acc_files_dir='peabody_files/Accession Files'):
    
    """
    Moves all the pdfs to the parent directory, so that all the pdfs are in one directory, 
    and deletes all the 'Acc. No.' directories.
    """
    return subprocess.call(["sh", "util.sh", "merge", acc_files_dir])

def convert(acc_no, cat_no, gs_exec, acc_files_dir='peabody_files/Accession Files'):

    # Convert a specified pdf to tif
    subprocess.call(["sh", "util.sh", "convert", str(acc_no), str(cat_no).zfill(4), acc_files_dir, gs_exec])
    
    FNAME = str(acc_no) + '_' + str(cat_no).zfill(4)
    # Open the tif as a pillow object, delete the tif file, and return it
    
    tif = Image.open(acc_files_dir + '/' + FNAME + '.tif')
    
    subprocess.call(['rm', FNAME + '.tif'])
    
    return tif

In [5]:
# Run this if the pdfs are still organized into their respective directories.
# merge()

0

In [15]:
# Example of opening an image
img = convert(1, 2, 'C:/Program Files/gs/gs9.20/bin/gswin64c.exe')

___
We decided to use Otsu's Thresholding to binarize the image into only black and white pixels. It searches for the threshold that minimizes interclass variance - the class here are black pixels and white classes. It determines the threshold with a weighted probability; then, it assigns all the pixels over the threshold the value `1` and all those below the value `0`.

We used the OpenCV module to apply Otsu's Thresholding method on the images. The following command allowed us to install OpenCV:
* conda install --channel https://conda.anaconda.org/menpo opencv3



These cells apply Otsu's Thresholding method to binarize the images, which are saved in a new folder called `peabody_files_otsu`. (folder is on Google Drive).
Afterwards, we use the Scikit-Image library to crop the images.

In [None]:
import numpy as np
from PIL import Image
import cv2
import os
from skimage import io
from sklearn.cluster import DBSCAN
from matplotlib import pyplot as plt
import re
%matplotlib inline

In [None]:
def otsu_threshold(im_path):
    im = Image.open(im_path)
    im = cv2.cvtColor(np.array(im),cv2.COLOR_RGB2GRAY)
    im = im[10:im.shape[0]-10,10:im.shape[1]-10]
    height,width = im.shape
    t, d = cv2.threshold(im,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
    return d

In [None]:
acc_nums = ['1','2','3','4','5','6','7','16','17','18','20','21','22','23','24','25']

The following two cells create the new folder containing the binarized images.

In [None]:
save_folder = ("peabody_files_otsu")
os.mkdir(save_folder)
os.mkdir("peabody_files_otsu/Accession_Files")

In [None]:
rootdir = 'peabody_files/Accession_Files'

for acc_no in acc_nums:
    ## Creates new folder to save files in
    new_folder_path = "peabody_files_otsu/Accession_Files/Acc._No._" + str(acc_no)
    os.mkdir(new_folder_path)
    
    ## Finding image
    path = rootdir + "/Acc._No._" + str(acc_no)
    for file,subdir,filelist in os.walk(path):
        for image_name in filelist:
            image_path = path + "/" + image_name
            image_save_name = image_name + ".png"
            save_path = new_folder_path + "/" + image_save_name
            cv2.imwrite(save_path,otsu_threshold(image_path))

___
We have a few different ideas about how to separate out each character. Some of those processes work better with only one line of text (rather than text that wraps) and will interpret the black section boxes as individual characters. For ease down the line, we decided to store each card as a dictionary of section names and the cropped image of the corresponding data. <br><br>
In the next cells we use skimage to crop the binarized images, and save these pieces into a dictionary.

In [None]:
def crop_image(sk_im):
    'Takes a skimage image file and crops the image, returns a dictionary of the pieces.'
    pieces = {}
    left_box = sk_im[5:600,5:420]
    length,height = left_box.shape[1],int(left_box.shape[0]/4)
    left_box_pieces = [left_box[i*height-height:i*height,0:length] for i in range(1,5)]
    
    right_box = sk_im[5:600,470:1770]
    length,height = right_box.shape[1], int(right_box.shape[0]/4)
    right_box_pieces = [right_box[i*height-height:i*height,0:length] for i in range(1,5)]
    
    bottom_box = sk_im[620:1100,10:1775]
    length,height = bottom_box.shape[1],int(bottom_box.shape[0]/4)
    bottom_box_pieces = [bottom_box[i*height-height:i*height,0:length] for i in range(1,4)]
    
    names = ['cat_no','acc_no','orig_no','photo_no']
    pieces.update(dict(zip(names,left_box_pieces)))
    
    names = ['name','site','site_no','locality']
    pieces.update(dict(zip(names,right_box_pieces)))
    
    names = ['situation','remarks','figured']
    pieces.update(dict(zip(names,bottom_box_pieces)))
    
    return pieces

In [None]:
rootdir = 'peabody_files_otsu/Accession_Files'
filed_dict = {} ## each key contains an array of the dictionaries of cropped images
total_array = [] ## contains all dictionaries of cropped images, not filed
for acc_no in acc_nums:
    path = rootdir + "/Acc._No._" + str(acc_no)
    folder = []
    for file,subdir,filelist in os.walk(path):
        for image_name in filelist:
            image_path = path + "/" + image_name
            image = io.imread(image_path)
            ### if you want `folder` to be a dictionary with keys of image names eg. key = 1_0016
            
            # folder = {} 
            # folder[re.sub("[\w.]","",image_name)] = crop_image(image)
            
            ### otherwise folder is an array
            
            folder.append(crop_image(image)) 
            
    total_array.append(folder)
    filed_dict[acc_no] = folder

The above cell goes through all of the binarized images in each accession-number folder. In each accession-number folder, all of the images are cropped by the labels on the images. The pieces of these images can be accessed through a dictionary, where the keys are the labels on the image. There is one dictionary per image. Then, the dictionaries for each image under an accession number are added into an array. This array can be accessed through the dictionary called `filed_dict`.

If we want to see the piece containing `name` in the first image in the Accession No. 1 folder, for example, we can use the following code.

In [None]:
io.imshow(filed_dict['1'][0]['name'],cmap='gray',vmin=0,vmax=225)
plt.show()

___
Segmentation

In [None]:
image = plt.imread('test.png')
plt.imshow(image,cmap=plt.cm.gray)
image.shape

In [None]:
seg = segmentation.felzenszwalb(image,scale=5,sigma=.25,min_size=50)
plt.imshow(seg)
print(np.max(seg))

In [None]:
def bbox2(img):
    rows = np.any(img, axis=1)
    cols = np.any(img, axis=0)
    rmin, rmax = np.where(rows)[0][[0, -1]]
    cmin, cmax = np.where(cols)[0][[0, -1]]

    return rmin, rmax, cmin, cmax

for i in range(72):
    x1,x2,y1,y2 = bbox2(seg==i)
    plt.figure(i)
    plt.imshow(image[x1:x2,y1:y2],cmap=plt.cm.gray)