CSC 630/1 Zufelt <br>
Michelle Chao, Sam Xifaras, Darius Lam, Moe Sunami, Diva Harsoor

# Catalog Card Processing
Creating a training dataset for object character recognition in order to read catalog cards (Robert S. Peabody Museum of Archaeology, 1930s to 1970s)
___

Given scanned PDFs of typewritten catalog cards, our goal is to create a dataset in which each row is a set of images of each character in a card.<br> <br>
Tasks: 
<ul>
    <li>Convert PDFs to TIFFs</li>
    <li>Cut the images into different sections corresponding to data type </li>
    <li>Cut the images into characters and add each image as a row of a DataFrame </li>
    <li>Label the DataFrame with the proper character labels</li>
    <li>Write a script that presents a user with a randomly generated row/image and its corresponding label (to check for accuracy)</li>
    <li>Display the process in this Jupyter Notebook</li>
</ul>
___

_Here we'll explain trying to use Ghostscript with Python 3 and how Sam used PostScript instead (including the code)_

_Declare our DataFrame and so on_

The functions below make calls to the following script, `util.sh`, which performs two important functions. First, the `merge` function, meant to be performed only once, moves all of the pdfs from their respective Acc. No. directories to the parent directory, `Accession Files`. This was more for convenience, because dealing with file names with spaces in bash is a hassle, and the accession numbers are indicated in the filenames, so we aren't losing any information when doing this.

Here is the code:
   
```
# util.sh

if [ "$1" == "merge" ]
then
    pushd "$2"
    for f in *
    do
        echo "Moving files from $f"
        pushd "$f"
        cp * ../ -fpv
        popd
        
        echo "Removing $f"
        rm "$f" -fR
    done 
    popd

elif [ "$1" == "convert" ]
then

    ACCNO=$2
    CATNO=$3

    # This is the path to the peabody_files directory
    ROOT=$4

    # Path to the ghostscript executable
    GS=$5

    pushd "${ROOT}"

    "$GS" -dNOPAUSE -r300 -sDEVICE=tiffscaled24 -sCompression=lzw -dBATCH -sOutputFile=${ACCNO}_${CATNO}.tif ${ACCNO}_${CATNO}.pdf

    popd

fi 




```

In [11]:
import subprocess
from PIL import Image

In [14]:
def merge(acc_files_dir='peabody_files/Accession Files'):
    
    """
    Moves all the pdfs to the parent directory, so that all the pdfs are in one directory, 
    and deletes all the 'Acc. No.' directories.
    """
    return subprocess.call(["sh", "util.sh", "merge", acc_files_dir])

def convert(acc_no, cat_no, gs_exec, acc_files_dir='peabody_files/Accession Files'):

    # Convert a specified pdf to tif
    subprocess.call(["sh", "util.sh", "convert", str(acc_no), str(cat_no).zfill(4), acc_files_dir, gs_exec])
    
    FNAME = str(acc_no) + '_' + str(cat_no).zfill(4)
    # Open the tif as a pillow object, delete the tif file, and return it
    
    tif = Image.open(acc_files_dir + '/' + FNAME + '.tif')
    
    subprocess.call(['rm', FNAME + '.tif'])
    
    return tif

In [5]:
# Run this if the pdfs are still organized into their respective directories.
# merge()

0

In [15]:
# Example of opening an image
img = convert(1, 2, 'C:/Program Files/gs/gs9.20/bin/gswin64c.exe')