# DatasetMake

Here, I endeavor to assemble a suitable testing document. This testing document will be a random selection of documents from our corpus, but it will not be a _uniform_ random selection. Unique pages will be enriched by a known, quantifiable amount. The intent of selecting pages that are observed to be "unique" is to:
1. Represent as many _languages_ as possible that feature reasonably prominently in the dataset.
    * No language (e.g., all pictures) is one possibility here.
1. Represent all _page orientations_ that feature reasonably prominently in the dataset, possibly including:
    * 0 degrees
    * 90 degrees
    * 180 degrees
    * 270 degrees
    * None of the above
1. Represent all exceptional elements that feature reasonably prominently in the dataset, possibly including:
    * Presence of at least one image
    * Presence of tabular data

In this case, "reasonably prominently" means "with high enough frequency as to have a non-negligible effect on overall OCR accuracy."

I also endeavor to quantify the frequency of these characteristics for the following reasons:
* It is necessary to determine which circumstances feature "reasonably prominently."
* More importantly, it is necessary to determine the magnitude of the effect of any bug on our OCR output. This will make our accuracy metrics more meaningful.

Here, I handle imports:

In [1]:
!pip install PyMuPDF
!pip install -U -q PyDrive

Collecting PyMuPDF
[?25l  Downloading https://files.pythonhosted.org/packages/bf/e8/bfd971ed4515fcdc0f7eec374a515f4608b141c62a0fb6949ad8425fb80b/PyMuPDF-1.18.13-cp37-cp37m-manylinux2010_x86_64.whl (6.4MB)
[K     |████████████████████████████████| 6.4MB 7.7MB/s 
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.18.13


Here, initialize drive access:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from pydrive.settings import InvalidConfigError
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Here, I import needed packages:

In [3]:
import fitz
from shutil import copyfile
import os
import numpy as np
import pandas as pd
import time
import gspread
from oauth2client.client import GoogleCredentials
from google.colab import auth

Here, I declare global variables and utility functions:

In [8]:
CATALOG_NAMES = [
  'catalog.20200407.aa', 'catalog.20200407.ab', 'catalog.20200407.ac',
  'catalog.20200407.ad', 'catalog.20200407.ae'
]
LARGE_SAMPLE_PATH = ('/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/'
                     'awca-ocr/large_sample.pdf')
DEVELOPMENT_SAMPLE_PATHS = (
    '/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/'
    'development_sample.pdf',
    '/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/'
    'development_sample1.pdf',
    '/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/'
    'development_sample2.pdf',
    '/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/'
    'development_sample3.pdf',
    '/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/'
    'development_sample4.pdf',
    )
COMPRESSED_SAMPLE_PATH = ('/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/'
                          'pDonovan/awca-ocr/compressed_sample.pdf')
CORPUS_ROOT = '/content/drive/MyDrive/AWCA/PDFtp'
WORKING_DIR = ('/content/drive/MyDrive/AWCA/Colab_notebooks/OCR/'
               'pDonovan/awca-ocr/temp')

In [5]:
def get_df(title, gc, worksheet=0, has_headers=True):
    """Returns a pandas.DataFrame representation of the
    (WORKSHEET)th worksheet of the Google Sheets (GSHEET)
    file that has title TITLE.
    TITLE - the title of the desired spreadsheet
    GC    - the GSpread credentials needed to retrieve the spreadsheet
    WORKSHEET - the index of the desired worksheet within
        the spreadsheet
    HAS_HEADERS - set to False if the spreadsheet does not
        have a header row at the top.
    It is not necessary to specify the path or the GSHEET
    file extension. Note that this creates undefined
    behavior when your google drive has multiple spreadsheets
    with the same name (i.e., you do not know which one
    will be opened).
    """
    # For details on how to handle GSHEET files, see
    # https://gspread.readthedocs.io/en/latest/api.html
    contents = gc.open(title).get_worksheet(worksheet).get_all_values()
    if has_headers:
        return pd.DataFrame.from_records(
            data=contents[1:],
            columns=contents[0]
        )
    return pd.DataFrame.from_records(contents)

In [10]:
auth.authenticate_user()
GC = gspread.authorize(GoogleCredentials.get_application_default())

In [11]:
CATALOG = pd.concat(
    (get_df(name, GC) for name in CATALOG_NAMES),
    ignore_index=True
)
CATALOG.head()

Unnamed: 0,ID,md5,Size,mime-type,Created Date,Modified Date,Folder,Name,gdoc-id,gdoc-url,gdoc-length,gdoc-exceptions,gdoc-timestamp,delta-t (s),Notes
0,0BwjSAKD6JzR6WXBVczdtRVBuVG8,c2bbbcdb46850c82a6f0639122da65b3,43036522,video/quicktime,2017-03-06T21:54:39.618Z,2017-03-06T21:54:39.618Z,ane.pdf.share/By Topic (or field)/Teaching Tools,Flex Search Workspace.mov,,,,,,,
1,0ByFFNduW4doJWUhIQmhqQ0ZCbnM,4b182737a83195784fccbbd2c2532e5a,11711124,video/quicktime,2013-01-31T00:48:04.693Z,2013-01-31T00:48:04.693Z,ane.pdf.share/By Topic (or field)/Teaching Too...,V22i3003 -ls T718 dM ls rain movie.MOV,,,,,,,
2,0ByFFNduW4doJSlZYZTdCTG1oa0E,4b182737a83195784fccbbd2c2532e5a,11711124,video/quicktime,2013-01-31T00:48:46.458Z,2013-01-31T00:48:46.458Z,ane.pdf.share/By Topic (or field)/Teaching Too...,V22i3003 -ls T718 dM rain movie.MOV,,,,,,,
3,0ByFFNduW4doJWThDMC15Nm1Ucms,0b760c00c95eb567b91f195a6f6badba,11831672,video/quicktime,2013-01-31T00:49:32.192Z,2013-01-31T00:49:32.192Z,ane.pdf.share/By Topic (or field)/Teaching Too...,V22i3004 -ls T718 dM draining movie.MOV,,,,,,,
4,0ByFFNduW4doJT0lDUFdSeEdRdkU,0b760c00c95eb567b91f195a6f6badba,11831672,video/quicktime,2013-01-31T00:50:18.413Z,2013-01-31T00:50:18.413Z,ane.pdf.share/By Topic (or field)/Teaching Too...,V22i3004 -ls T718 dM ls draining movie.MOV,,,,,,,


In [12]:
CATALOG.shape

(530285, 15)

In [None]:
CATALOG.sample(5)

Unnamed: 0,ID,md5,Size,mime-type,Created Date,Modified Date,Folder,Name,gdoc-id,gdoc-url,gdoc-length,gdoc-exceptions,gdoc-timestamp,delta-t (s),Notes
298770,1_SDwv9CsL-yqdFp5mAMJ38nOoD2cickL,36f2d245f0ff88da86e6f50120e9fc31,460,image/png,2019-07-03T00:18:51.117Z,2019-07-03T00:18:51.117Z,ane.pdf.share/To be sorted.../EDUB2BAA/Assyrio...,MES.png,,,,,,,
302018,1awMtYVJwYpnp489C3PrNdmrgC50HA_NG,500667ac1e165330714e8c071d48ffa2,13097485,application/pdf,2019-07-03T07:13:14.421Z,2019-07-03T07:13:14.421Z,ane.pdf.share/To be sorted.../EDUB2BAA/Bit Enk...,741–791 English Index.pdf,,,,,,,
98080,0B9Ibqa26YXiReEVUV2QyeHhpZEk,a0259a6cc6bceeeb3e6c715db71dd8c9,3725312,application/msword,2015-10-29T07:35:05.763Z,2015-10-29T07:35:05.763Z,ane.pdf.share/By Series (or encyclopedia)/KBo/...,KBo 41.1-100,,,,,,,
273747,1gC2IlZBvGkiGQx4FUCrV-h9uQf8VNrnW,63f2945c374ed84cd225fcd0b019f7bc,303736,application/pdf,2019-07-03T00:31:12.665Z,2019-07-03T00:31:12.665Z,ane.pdf.share/To be sorted.../EDUB2BAA/Assyrio...,Yamada - 1998 (JCS 50) - Euphrates Crossings S...,,,,,,,
389295,1at490UIeEgBREXtwzxa5u4dFO-SSEDZU,bbe7fceff80c3cef202a3e0b95072691,4096,application/msword,2019-07-05T17:30:51.341Z,2019-07-05T17:30:51.341Z,ane.pdf.share/To be sorted.../EDUB2BAA/Strahil...,._86.doc,,,,,,,


## 1. Build a Large Representative Document

The objective here is simply to take a uniform random sample of pages and to stitch them all into one PDF for easy reference.

I am aware that `sample_pdfs` is not good code. It is only intended to be run once. Please don't maintain it.

In [6]:
def sample_pdfs(catalog, n, root, working_dir, random):
  """Returns an iterable of N paths to copies of PDF files selected
  uniformly at random from CATALOG.
  """
  pdfs = catalog[
    (catalog['mime-type'] == 'application/pdf')
    & np.array([name[:2] != '._' for name in catalog.Name])
  ]
  # The exact value 1e9 is unimportant -- it is not a magic number.
  pdfs = pdfs.sample(n, random_state=random.integers(1e9), replace=True)
  pdfs.index = list(range(len(pdfs.index)))
  original_paths = pdfs.apply(
      (lambda row: os.path.join(root, row.Folder, row.Name)),
      axis=1,
  )
  new_paths = pdfs.apply(
      (lambda row: (os.path.join(working_dir, row.ID) + '.pdf')),
      axis=1
  )
  paths = list()
  for idx in original_paths.index:
    while True:
      try:
        print('DEBUG: ', idx, ', ', new_paths[idx])
        if os.path.exists(new_paths[idx]):
          print('DEBUG: found PDF at ', new_paths[idx])
          paths.append(new_paths[idx])
        elif os.path.exists(original_paths[idx]):
          print('DEBUG: found PDF at ', original_paths[idx])
          paths.append(original_paths[idx])
        else:
          drive.CreateFile({'id': pdfs.loc[idx, 'ID']}).GetContentFile(new_paths[idx])
          print('DEBUG: created PDF at ', new_paths[idx])
          paths.append(new_paths[idx])
        break
      except Exception as e:
        print('DEBUG: ', e)
        auth.authenticate_user()
        GoogleAuth()
        gauth.credentials = GoogleCredentials.get_application_default()
        GoogleDrive(gauth)
  return paths

In [7]:
def combine(pdfs, output_path, random):
  """Writes a document to OUTPUT_PATH with one page selected
  uniformly at random from each document in PDFS.
  :param pdfs: an iterable of paths to PDF files
  :param output_path: the path to the file to be written
  :param random: a numpy.random.Generator instance
  """
  out = fitz.open()
  pdfs = list(pdfs)
  errors = list()
  for i, path in enumerate(pdfs):
    try:
      pdf = fitz.open(path, filetype='pdf')
      assert len(pdf) > 0, 'PDF must have at least one page.'
    except Exception as e:
      errors.append(str(e))
      print(e)
      continue
    chosen = int(random.integers(len(pdf)))
    print('{} DEBUG: Selecting page {} from {}'.format(i, chosen, path))
    out.insert_pdf(pdf, from_page=chosen, to_page=chosen)
  print('Terminated with {} errors.'.format(errors))
  out.save(output_path)
  return errors

In [None]:
random = np.random.default_rng(333) # Time of day when this code was written
combine(
    sample_pdfs(CATALOG, 1600, CORPUS_ROOT, WORKING_DIR, random),
    LARGE_SAMPLE_PATH, random
)

DEBUG:  0 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1T5zmjoa2zcBkiIu6vbpoxcGmoFmS7ZqO.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1T5zmjoa2zcBkiIu6vbpoxcGmoFmS7ZqO.pdf
DEBUG:  1 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1CpervdsqrDmIOuEprs2woDBKiIgTwCDi.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1CpervdsqrDmIOuEprs2woDBKiIgTwCDi.pdf
DEBUG:  2 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1MHzaehYGC-RwqCfrSIrB3cnQo6Guuy2c.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1MHzaehYGC-RwqCfrSIrB3cnQo6Guuy2c.pdf
DEBUG:  3 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0ByFFNduW4doJZFdBRkRTa0dlM3c.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/Oxford Univ Press/0195393

mupdf: expected trailer marker
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


106 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1RYJaBUAb5xS5t7VDjiXdK4rmaqqiy-zR.pdf
107 DEBUG: Selecting page 240 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BbERM8OTfrKsB4aAsUgGHFb0pbb1AQ1d.pdf
108 DEBUG: Selecting page 10 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ur6vuunbBT474-YexKMBq_V3TmvNTjv4.pdf
109 DEBUG: Selecting page 487 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/B/Bright, John - A History of Israel, 4th Edition. 2000.pdf
110 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1XuQiwz_S_nikQcC2DkuB6uvTAlEEQWll.pdf
111 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ZsuKPkhrRokotnS8-LlGkEYPmCq87yWT.pdf
112 DEBUG: Selecting page 11 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1aaGfc5xmJOHC4S

mupdf: expected trailer marker
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


169 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1pXHU7U6m2QJfr2_sJGDEK0rz5zDBi1OF.pdf
170 DEBUG: Selecting page 7 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Oqk5VjKpkpy6IqB7RoykfiOfDkdaRiLO.pdf
171 DEBUG: Selecting page 50 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/Sumer/Sumer Separata/المجلد الخامس و العشرون، 1969 - الجزء 1 و 2/noormags-Nineveh_The_1968_-_1969_Campagn.pdf
172 DEBUG: Selecting page 5 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ShUR4ok7l2utFBH8pXD7LfCnNgQVl5PN.pdf
173 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1nE9H50mtPDhNKkMrgNCbZr9YDXOuNYM1.pdf
174 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Dissertations/Studies in ancient Anatolian language and culture (Turkey) by Taylor, Patrick John.pdf
175 DEBUG: Selecting pag

mupdf: cannot recognize version marker
mupdf: no objects found


401 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Festschriften/Fs Neumann = GVL 17/Janda_FsNeumann2.PDF
402 DEBUG: Selecting page 8 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1KR3AyoAxFGnQ5WfoMGm51_8X4nKeOE-U.pdf
403 DEBUG: Selecting page 22 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1X_uGWWnuVZotcPbzdCNGlDE7FQckPrQ1.pdf
404 DEBUG: Selecting page 142 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/JNES/jnes55.pdf
405 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Hebrew language/History of Hebrew Grammatical Thought/Karaite grammar/מאור עין/געש. צירופים של חלקי הדיבר היכולים להעמיד מבע עצמאי לפי החיבור הדקדוקי הקראי מאור עין.pdf
406 DEBUG: Selecting page 7 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/17r_T9koJWuBLv6Iw64yuIttvbDWXZkMJ.pdf
407 DEBUG:

mupdf: cannot recognize version marker
mupdf: cannot tell in file


441 DEBUG: Selecting page 136 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/18tHG4Uczn3J1Pu_4uNM6ZcMwcHFAknZz.pdf
442 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1yiJRPPgN8lVofcbNJypMcLAcVjavp10e.pdf
443 DEBUG: Selecting page 12 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/16IKOKH2O0SJY1lJrzTQxd0acDHTYoxR-.pdf
444 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/B/burkert-von ullikummi zum kaukasus, zur kontinuität einer mündlichen erzählung (2).pdf
445 DEBUG: Selecting page 230 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/Cambridge Univ Press/0521821320.Cambridge.University.Press.Reading.the.Past.Current.Approaches.to.Interpretation.in.Archaeology.Jan.2004.pdf
446 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BtHlPnoZZtwIyQT

mupdf: cannot find startxref
mupdf: invalid key in dict
mupdf: expected trailer marker


501 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1zx6dgb8QpAgKpBwz0Qt_v7lJraSHptN4.pdf
502 DEBUG: Selecting page 396 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/Encyclopedias & Companions/Cambridge History/Cambridge_History_of_Judaism_3.pdf
503 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1uVC5W_zRouo_1S7pmv1iRvv0Z0z2JsTi.pdf
504 DEBUG: Selecting page 153 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1CyAGLvXcIznHEKIc5wBufHBfvJ2dUvMI.pdf
505 DEBUG: Selecting page 11 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1-tGMAB56r6p_i7jKBBxfuVWHXyDoD0CQ.pdf
506 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1TnHSvyQmRrchivpyuU0Y6HJQYvdAiNTz.pdf
507 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-o

mupdf: cannot recognize version marker


645 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1KcAZpDFHa8H6lklYPb-1-PKieWjoBhtk.pdf
646 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1F0Uu8tUU1yGTXGSrNYAwd0E8jPaJ-4ID.pdf
647 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1AU2iprjpy9U49b9gCSjtUpbqnf2gznQ_.pdf
648 DEBUG: Selecting page 5 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/ArAn (Archivum Anatolicum-Anadolu Arşivleri)/2 Pdf/04 GÜNBATTI, Cahit, Two New Tablets Throwing Light on the Relations Between Anatolian Kings and Assyrian Merchants in the Period of the Assyrian Colonies.pdf
649 DEBUG: Selecting page 42 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/TC/TC_3.3.pdf
650 DEBUG: Selecting page 101 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/HdO/HdO 28 Sivan, Daniel - Ugarit

mupdf: expected trailer marker


747 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Nv8H8jZbDJFBLxvP6KDwfu71lyPpjbz4.pdf
748 DEBUG: Selecting page 34 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1zy0D-TyNq4HTDzm6D60r4M81-8MSg8jx.pdf
749 DEBUG: Selecting page 87 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/T/Taleb - The Black Swan.pdf
750 DEBUG: Selecting page 22 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Site Reports/Israel/Farah N/Mallet, J. -Tell el-Farah II,1 - Le Bronze Moyen.pdf
751 DEBUG: Selecting page 97 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1IoMs8IGEzlGM9V3FJLDJ2Ffklj8K82Xe.pdf
752 DEBUG: Selecting page 290 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1XNIvnF-RKK5G_HqalEJCiaZbQNQJgsdg.pdf
753 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/14OQcebr

mupdf: cannot recognize version marker
mupdf: cannot tell in file


815 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1XIXNGEct5okKt5kfh54FldQg9yyHUU7h.pdf
816 DEBUG: Selecting page 8 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Rn5kEF30PZqPxDkTZGYUASqDxipLp6rP.pdf
817 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1i0pMhmEDnuyziJWqTDvoxBfO4YQgc7A7.pdf
818 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1F___wvvn8F5eVVXcgUERsZxrhGkW2jmS.pdf
819 DEBUG: Selecting page 191 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1W34anOVt8fw7GC6_YtpWTUKVb6WMWBU9.pdf
820 DEBUG: Selecting page 9 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1wdwkqPgLiQHABRN2UVwCb_CVr3ebF4DG.pdf
821 DEBUG: Selecting page 256 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1IgOY7iaD5BJ3HlQ3tNTZ8I7qU2HZM78V.pdf
82

mupdf: expected trailer marker


911 DEBUG: Selecting page 42 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1zP0wHwlNXaBY4NLk8CD_UFw3pJ-dHWIq.pdf
912 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1zKb9d5_lFwZuQ4azf3HGB9mF52vAIMES.pdf
913 DEBUG: Selecting page 71 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/VS/VAS-26_bis.PDF
914 DEBUG: Selecting page 188 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1EmApUd80mIPuK2MQfgQe2ibx0goO19_T.pdf
915 DEBUG: Selecting page 10 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Teaching Tools/ANE Shared Files/Archeologie/Artikel/A/Astour_1972.pdf
916 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1WARaSylc2dVmQltn3sowLEPY2NzMYVi3.pdf
917 DEBUG: Selecting page 13 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/R/R (dirty)/ronchi1.PDF
9

mupdf: cannot find startxref
mupdf: zlib error: incorrect header check
mupdf: corrupt object stream (2244 0 R)
mupdf: zlib error: incorrect header check
mupdf: corrupt object stream (2245 0 R)
mupdf: zlib error: incorrect header check
mupdf: corrupt object stream (2246 0 R)
mupdf: zlib error: incorrect header check
mupdf: corrupt object stream (2247 0 R)
mupdf: zlib error: incorrect header check
mupdf: corrupt object stream (2248 0 R)


960 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Hebrew language/Biblical Hebrew/Dictionaries/מנדלקרן – קונקורדנצייה לתנך/תיקונים לתיקונים לקונקורדנציה של מנדלקרן.pdf
961 DEBUG: Selecting page 11 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1KzclGgM_kc0p9kvPPd9Zf56aPENxywQB.pdf
962 DEBUG: Selecting page 12 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/Bar-Yosef and Zilhao 2006 Towards a Definition of the Aurignacia.pdf
963 DEBUG: Selecting page 92 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Egypt/Coptic/Coptic_CSSC_2_Leipoldt_1908.PDF
964 DEBUG: Selecting page 7 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1z1UlGU-HFe0A13Ni7mHdTscCT1i5WveN.pdf
965 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1L8jxfHQmsdw8aKTTEsqf94td01fnyTRX

mupdf: expected object number





mupdf: No default Layer config


980 DEBUG: Selecting page 35 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ea6kepHzvfIdJGNfSDL-qzKvch9iK7P2.pdf
981 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRT1d0c0E4TEFhS2s.pdf
982 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/12H7A1nCSGRqJ6QbSpgfY8zw792ZwVLY8.pdf
983 DEBUG: Selecting page 30 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/S/Snell, Daniel C. ed - A Companion to Ancient Near East [Blackwell, 2005].pdf
984 DEBUG: Selecting page 19 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1xApIQ5Zu0S0Id2ot_iYZ9jqFkAM7KkvE.pdf
985 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1hzBMwHx6rSiah1cqgtFmETwb_rie7UeI.pdf
986 DEBUG: Selecting page 15 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1qTfm

mupdf: expected trailer marker


1006 DEBUG: Selecting page 292 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Иудаизм периода II Храма/schaefer_judeophobia-attitudes.toward.the.jews.in.the.ancient.world.pdf
1007 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1h12Uav3Lmgish_tb45bydTR1mRmO-4do.pdf
1008 DEBUG: Selecting page 5 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1wOW9jwom_-BPVitFDhX4_jz-A1J_2bpY.pdf
1009 DEBUG: Selecting page 13 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NJvOl3p8xBfyEPjNsRHZZV68BIt675UM.pdf
1010 DEBUG: Selecting page 11 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1HKsaXB6pmwSoXpELEiTJHgUsgtU0u177.pdf
1011 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/17_CM_RA5JOZmJyq0AJjnRD2INHbRGzCE.pdf
1012 DEBUG: Selecting page 2 from /content/drive/MyDrive/A

mupdf: expected trailer marker


1035 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1n63CD9RN8IiAwsfUmMdHIq16Q1N_XXNw.pdf
1036 DEBUG: Selecting page 25 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1e5aI86e5xuRbNPcHtJ7Ug50C6lMKjuZo.pdf
1037 DEBUG: Selecting page 97 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1GMwdZu1buDmtbDMXbjYkVF2v-8n8oHls.pdf
1038 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Assyrian/Old Assyrian/OA Secondary sources/Steiner_1990_UKHB_hattusha_Anitta.pdf
1039 DEBUG: Selecting page 218 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/L/Levinson, Bernard - The Right Chorale, Studies in Biblical Law and Interpretation, 2011.pdf
1040 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/Iraq/Iraq (separata)/Iraq (volumes by year)/Iraq1983/4200195.pdf
1041 DEBUG: 

mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


1171 DEBUG: Selecting page 183 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1qJsa2uTroNwi0n6OSXsxXRRjwuH4qhys.pdf
1172 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1FOTTHRZb_OhaHttBlMsy2uXP2IBKglgR.pdf
1173 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1vXaQnvR9BsPBl6bD02yl8S3d5sL_krja.pdf
1174 DEBUG: Selecting page 19 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Teaching Tools/ANE Shared Files/Old Assyrian/Secondary Sources/rowlandson.PDF
1175 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/SAAB/SAAB Separata/2.2 02 Parpola.pdf
1176 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1IqqNwMJm67BBK4M2JgjuVs5qjVGGgkQn.pdf
1177 DEBUG: Selecting page 60 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1HoAe

mupdf: expected trailer marker


1183 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1S8jcqIrKu2RbKPWpdUfFC-E3m6n4Fl0O.pdf
1184 DEBUG: Selecting page 248 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/11tI6Y99qe3yINAB7LJ9OJCmlVqYsGq5h.pdf
1185 DEBUG: Selecting page 15 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/189PGsXK3tRGFlxQJqnYvo70qbXsCkWgd.pdf
1186 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1tfrseg2HiPNNjhkx-aO9tm113jCLsBF_.pdf
1187 DEBUG: Selecting page 9 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1X2lUMLgZj_AiCvatVSHJmSg9hbq7kmtL.pdf
1188 DEBUG: Selecting page 12 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ZcaH48rlBecdU5KGe-9YVi7ilfvGqSYj.pdf
1189 DEBUG: Selecting page 15 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1t9WqUVe66HvYsfsWSKFFTdRbfUNE9O2

mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


1324 DEBUG: Selecting page 9 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1_8zURWXwmHfEw3aAT5bwiR6Ug2tMHf4R.pdf
1325 DEBUG: Selecting page 124 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1TYp6UGZtk4QT3o1rgIlxcNqMzFJD5G4h.pdf
1326 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1uMCKLlHKiRZFsZhsiaA_Ab5vW0d5Yd94.pdf


mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


1327 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1AvQfRFlwXZFo0Xds7ci22AbPByqp-j11.pdf
1328 DEBUG: Selecting page 206 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/17IxoBNxwjFmEAYY2ii_lhNJS4E6jCyRH.pdf
1329 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/L/Leemans, W.F. - The Trade Relations of Babylonia and the Question of Relations with Egypt in the OB Period. JESHO 3, 1960, pp. 21-37.pdf
1330 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1OfCSxawZps-mU6Bmr5jeUGjgreQ4mZEC.pdf
1331 DEBUG: Selecting page 16 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1kKJ2VzF2Yrh5-i-Txv3t_-ayAkA-D_CU.pdf
1332 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1r3QNGjeZ24zxiZH01W1MsLLbdlFkfSP0.pdf
1333 DEBUG: Selecting page 10 from /con

mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


1513 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1yBTQF9oiuNHuF2qdZ45wpTsBIFy57Qal.pdf
1514 DEBUG: Selecting page 17 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1uzguPi8xnS1GcwlFJTG5dYPE3_zad3Ch.pdf
1515 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BBXISZYhcOlYRdJVkZGRZxi5OB1mLOyT.pdf
1516 DEBUG: Selecting page 14 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1T6ZVAqKUBgDOwhxuvl3YjBWEZOdUDbI6.pdf
1517 DEBUG: Selecting page 17 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Egypt/Posener 1951 - La litterature Egyptienne (RdE 6).PDF
1518 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1-yme5RI4K-61BWnvws-922djUANOBpZR.pdf
1519 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1UQ

['no objects found',
 'cannot tell in file',
 'invalid key in dict',
 'cannot tell in file',
 'PDF must have at least one page.']

## 2. Analyze the Large Representative Document

The final product of our analysis of the large representative document is a spreadsheet with metadata (languages, page orientations, and other unusual features) associated with each page in the large representative document.

### Automatic First Pass

I wish to use my OCR system to speed this process along. I realize that a significant risk is involved here. We do not want errors introduced in the automated first pass of annotation to cause bias in our evaluation procedures downstream.

I began by re-assessing some design choices in the class `tesseract_manager.Text`. In particular, I chose to replace `pdf2image` with the more versatile library `PyMuPDF`, which also is able to extract text. (Unlike `PDFMiner`, `PyMuPDF` does not seem to take hours or days to do page segmentation on a single page.) Here are some notes:
* Magnification seems to improve accuracy when the images are generated from text (unsurprisingly), ~~but possibly not when rasters are being generated by rasters (because you cannot get more information than you started with).~~ I have found at least two examples in which scaling up an image by a factor of 2 leads to a dramatic improvement in accuracy with Tesseract. No idea why. This has downsides -- the program becomes 2-3 times slower -- but I am keeping it in the code for now.
* It will be clear enough when it makes sense to use OCR. Use OCR if there are no large images -- regardless of whether there is text already (because we don't know where the text came from or whether it is accurate). What's a large image? Perhaps it should be at least half the page -- although we could require it to be at least 90% or the page or so, since the documents that are scanned seem to have images that are exactly the same size or almost exactly the same size as the page.
* I caught a poor design choice that was causing the OCR system to completely give up when OSD failed. This was unreasonable; OSD does not seem to use the same algorithms as OCR, so the failure of OSD doesn't seem to imply a failure of OCR.

I am continuing to test `tesseract_manager.Text` by running it and looking at confidence scores.

In [None]:
random = np.random.default_rng(926) # Time of day when this code was written
combine(
    sample_pdfs(CATALOG, 400, CORPUS_ROOT, WORKING_DIR, random),
    DEVELOPMENT_SAMPLE_PATHS[0], random
)

DEBUG:  0 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NmeSSoI3gmzCLdzdMi1sdfJWtImwsTfI.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NmeSSoI3gmzCLdzdMi1sdfJWtImwsTfI.pdf
DEBUG:  1 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/12TFLzxU57KAVJbS5CYt6_8w3DIMsm5F6.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/12TFLzxU57KAVJbS5CYt6_8w3DIMsm5F6.pdf
DEBUG:  2 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1mCWuVyZs6cfiMLPnoNJMhUx3QYga-5SE.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1mCWuVyZs6cfiMLPnoNJMhUx3QYga-5SE.pdf
DEBUG:  3 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1966_2SyrfNK0jq9NjeqBNRLQ7pXo1Xrb.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Site Reports/general/Religi

mupdf: expected object number


158 DEBUG: Selecting page 116 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/F/Fincke ed 2014 Divination in the Ancient Near East, A Workshop on Divination.pdf
159 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Rs_mjKNXcMe0Qo7GYwI4fzUu8cdz12F9.pdf
160 DEBUG: Selecting page 71 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/C/Clay, Albert - Documents from The Temple Archives of Nippur Dated in the Reigns of Cassite Rulers. BE 14, 1906.pdf
161 DEBUG: Selecting page 56 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/G/Gledhill et al (eds) - State and Society. The emergence and development of social hierarchy and political centralization. OWA 4, 1988.pdf
162 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Y5LQakz4QJ3Nxp0PqZJNRNp9_QJyFvmZ.pdf
163 DEBUG: Selecting page 59 from /content/drive/MyDrive/AWCA/P

mupdf: expected object number


219 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Egypt/Jasnow  1997 - The Greek Romance and Demotic Egyptian Literature (JNES 56).pdf
220 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1iOgtj0uzahtfncp3njoLuZmAWKjwA9p7.pdf
221 DEBUG: Selecting page 8 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/Barr, James - Biblical Chronology. 1987.pdf
222 DEBUG: Selecting page 152 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Христианство/Pam_star_rus_literat_3_1862 Ложныя и отреченныя книги.pdf
223 DEBUG: Selecting page 9 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1nJIDeAO3FgNuWCPtBFkbPEqDNcsd75o_.pdf
224 DEBUG: Selecting page 5 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1YKnIHanCAC8CWOkFSAoio3qiTBB_Kuo4.pdf
225 DEBUG: Selecting page 13

mupdf: expected trailer marker


354 DEBUG: Selecting page 60 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/JNES/JNES 65, 2006.pdf
355 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Ancient Aramaic/Imperial Aramaic/Biblical Aramaic/Blake Studies in Semitic Grammar V.pdf
356 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/Studia_Mediterranea/StMed_14_Pecchioli Daddi_Vincolo per i Governatori.PDF
357 DEBUG: Selecting page 8 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NpV-lvl5q7kCwR1H_ZaWKbWMp8TkEk7E.pdf
358 DEBUG: Selecting page 13 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/19DknKYkO4EiuJSXuGvKvfyWjydtoev66.pdf
359 DEBUG: Selecting page 184 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1mpqV8m1CceMvn5ODGFY4gY2rnZllQtGs.pdf
360 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Cola

mupdf: expected trailer marker


378 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Egypt/Jasnow 2003 - Middle Kingdom and Second Interm. period.PDF
379 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1MkMctsxjKP_vBjrOXkU13S4tqcXwG0VY.pdf
380 DEBUG: Selecting page 464 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1RivBcipf4XdStLsqOtCkSflQ1mK9by6h.pdf
381 DEBUG: Selecting page 39 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/16l959IqvsgWQFP3YD-YyyzzVEyONXreo.pdf
382 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Et6AlnFKXITrdja7wclkHc-Ly1Ex2vyL.pdf
383 DEBUG: Selecting page 12 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1qOcpY-2ID-pl82sRyO2bPNMkDjVF3d9O.pdf
384 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9

[]

In [13]:
random = np.random.default_rng(528) # Time of day when this code was written
combine(
    sample_pdfs(CATALOG, 400, CORPUS_ROOT, WORKING_DIR, random),
    DEVELOPMENT_SAMPLE_PATHS[1], random
)

DEBUG:  0 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Sjj_H4u1L70DlzADvjHkUpqgLelgzEjj.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Sjj_H4u1L70DlzADvjHkUpqgLelgzEjj.pdf
DEBUG:  1 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BQOeO81FjGskR5H0RKhw0gfXkTXNcCcS.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BQOeO81FjGskR5H0RKhw0gfXkTXNcCcS.pdf
DEBUG:  2 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0ByFFNduW4doJNTBVYWdXZHl4T00.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/Cambridge Univ Press/0521399785.Cambridge.University.Press.Geometry.of.Low-Dimensional.Manifolds.Vol.1.Gauge.Theory.and.Algebraic.Surfaces.Jan.1991.pdf
DEBUG:  3 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRZGRueTI3VWxDeDQ.pdf
DEBU

mupdf: No default Layer config


6 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Hebrew language/History of Hebrew Grammatical Thought/Karaite grammar/מאור עין/געש. צירופים של חלקי הדיבר היכולים להעמיד מבע עצמאי לפי החיבור הדקדוקי הקראי מאור עין.pdf
7 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1dg8ScT0o6tKSBT5qSsKV4_1NPMQ8RtFu.pdf
8 DEBUG: Selecting page 28 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1cAsSekXGrvQUdgGaS2zZ0jPf2DgrMVsJ.pdf
9 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/188uLrPwvFEySv6Cpr6r4xEktpGe0dlPk.pdf
10 DEBUG: Selecting page 181 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1F_N8xW8LOECdlLhLK6D9tu1CGzckl9jU.pdf
11 DEBUG: Selecting page 7 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1iv0fnZIucV4lA5wrSG_B3qTET5sMdDlR.pdf
12 

mupdf: expected trailer marker


21 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/E/E (dirty)/ephal1983000.PDF
22 DEBUG: Selecting page 12 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/15GgxpRo5MIP4Wuxs0tHFRJQBAO3w2e50.pdf
23 DEBUG: Selecting page 562 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ynCx8hR_-o3JjNXHZTf2dFTir2agKX9Q.pdf
24 DEBUG: Selecting page 4 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1ETDOmvXh595peWtrgKYc5CYBXzkW7J0L.pdf
25 DEBUG: Selecting page 250 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1swe5QQk_jK2qBSS1t2gw3ImImMqSIza-.pdf
26 DEBUG: Selecting page 30 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1DvvQdP88ynb2lHLhztPw2TQe9Ya4uOE2.pdf
27 DEBUG: Selecting page 116 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1kNT_9TimD_YwcBtoSXth7BZXr5NqTzjO.pdf
28 DEBUG: Selec

mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict


135 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRQzVYaHhkNzVlQkk.pdf
136 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1z8hyGctfpdmOMsnGTQEfVUoNBuoUDkr7.pdf
137 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1_LXho9L6TBm_UVgbyzUcfsxXKL_krFYf.pdf
138 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1u3ucwGeYCGqC4T4DDqmJPLf9d_fS1IfW.pdf
139 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Assyrian/Old Assyrian/OA Secondary sources/von_Soden_1956_Or25_beschworung.pdf
140 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRQ3NacWlmdkZrcEk.pdf
141 DEBUG: Selecting page 126 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1

mupdf: expected trailer marker


228 DEBUG: Selecting page 57 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1kveRMiX28Q0unplsR6c_nm-q3IfDMLre.pdf
229 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1Bci0OH3wtNEQmxILaB8nyVZ3FwE57qQf.pdf
230 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1-RJ_M39StWr2ECz8KfA8qsTc2RrY0xBc.pdf
231 DEBUG: Selecting page 13 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NK0BK8e4xKR42G7nyHz28K26NYad9Gss.pdf
232 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/B/Blau--Some Difficulties in the Reconstruction of Proto-Hebrew and Proto-Canaanite.pdf
233 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1YV5ovygWusgmJjc_kG0ZPKLHtupMQnzf.pdf
234 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-

mupdf: expected object number


373 DEBUG: Selecting page 164 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/0520080920.University.of.California.Press.Losing.Face.Status.Politics.in.Japan.Nov.1992.pdf
374 DEBUG: Selecting page 9 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1V33AjDdy1rQMLUpIV_CTXWXQOSzmJwoF.pdf
375 DEBUG: Selecting page 18 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/17DRmQcivvPBOBIQx2pCxYF8j5PiA3rcp.pdf
376 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1-DpTIxFvksW4rbz0da0ZUvSOndziOgfv.pdf
377 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1z7zKqDOhy1uaNfXwxkVxRiKorn8RcDAj.pdf
378 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1z8nz0QZLvgawOOs7N_yYbpb04_ZCp7M8.pdf
379 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/

[]

In [15]:
random = np.random.default_rng(559) # Time of day when this code was written
combine(
    sample_pdfs(CATALOG, 400, CORPUS_ROOT, WORKING_DIR, random),
    DEVELOPMENT_SAMPLE_PATHS[2], random
)

DEBUG:  0 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0ByFFNduW4doJamlobXdkeEZhSDg.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/C/Carter, R. & G. Philip - Beyond the Ubaid. Transformation and Integration in the Late Prehistoric Societies of the Middle East. SAOC 63, 2010.pdf
DEBUG:  1 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRbi0yZmRYaXN6Qlk.pdf
DEBUG: found PDF at  /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/BE/BE 1:2/BE-1-2-title.pdf
DEBUG:  2 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1g7RohpoQJgJVBbTejRNm_R1tqan1k-AI.pdf
DEBUG: created PDF at  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1g7RohpoQJgJVBbTejRNm_R1tqan1k-AI.pdf
DEBUG:  3 ,  /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1dlbPGXfVJPm2eugJLzMq9jJ9gEX8AD7d.pdf
DEBUG: created PDF at  /con

mupdf: expected object number



18 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/Iraq/Iraq (separata)/Iraq (volumes by year)/Iraq2003 (1)/Iraq2003/4200536 (2).pdf
19 DEBUG: Selecting page 176 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Periodical/OrNS/OrNS 63, 1994.pdf
20 DEBUG: Selecting page 168 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/P/Perring - The Roman House in Britain.pdf
21 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1BZJzbyoBfC7_rTmU7YiQ6ZNguREUi8v-.pdf
22 DEBUG: Selecting page 35 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/0262042517.The.MIT.Press.Governing.Global.Electronic.Networks.International.Perspectives.on.Policy.and.Power.Dec.2008.pdf
23 DEBUG: Selecting page 22 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRWmZiV0JTa0lkQzQ.pdf
24 DEBUG: Selecting page 1 from /content/dr

mupdf: cannot find startxref


162 DEBUG: Selecting page 785 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1B8RkPCanv0VgASNmPEq-GnKWc065eWwh.pdf
163 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Assyrian/Old Assyrian/OA Secondary sources/Michel_1998_3UHKB_Quelques réflexions sur les archives_p419-433.pdf
164 DEBUG: Selecting page 10 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/L/Larsen, Mogens - Introduction, literacy and social complexity. State and Society 1995 pp. 173-190.pdf
165 DEBUG: Selecting page 68 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Teaching Tools/ANE Shared Files/From G/FromF/ANE Dictionaries/Hittite/HEG/HEG I-K.PDF
166 DEBUG: Selecting page 263 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1fG2hL0Azft9aK9j-J1eXAXBe7_WZmCY2.pdf
167 DEBUG: Selecting page 6 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/a

mupdf: expected object number


231 DEBUG: Selecting page 404 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Библиотека/Semitic languages/Aramaic/Old Aramaic/Muraoka-Porten - A Grammar of Egyptian Aramaic - 2nd Ed _p - 2003.pdf
232 DEBUG: Selecting page 10 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1j4nKoAnRZUJGMRnJFb-14IJIkEPNBVZP.pdf
233 DEBUG: Selecting page 148 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1eYZcE3zNp2tu6vTEEnFhf8oZXCiB6Fsn.pdf


mupdf: No default Layer config


234 DEBUG: Selecting page 134 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Series (or encyclopedia)/ABL/ABL-14.pdf
235 DEBUG: Selecting page 3 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/T/Toy - 1885 - The Massoretic Vowel-System.pdf
236 DEBUG: Selecting page 17 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/13ZxdxuT7OxE8Bt6_PVG3uGc2pKksZbJq.pdf
237 DEBUG: Selecting page 29 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1fnnT59RiUh_MzvqkWp5h-lA78H6halGd.pdf
238 DEBUG: Selecting page 208 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1-4rynbJp7bJevZwbNwFvo7bHu_9jMi5l.pdf
239 DEBUG: Selecting page 64 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/18Q4hegKYyznKDJgEazJNVnZA62OJo8r6.pdf
240 DEBUG: Selecting page 5 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1s3G5JVsO8xr5Y0irThU1CsqaFmrlo5kr.pdf
241 DEBUG

mupdf: expected trailer marker


265 DEBUG: Selecting page 38 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/C/C (dirty)/Carroue, F., Etudes de Geographie et de Topographie Sumeriennes. III ASJ.pdf
266 DEBUG: Selecting page 267 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1DvhDpfRjnIYru87fzoVrITuOeuqW43_o.pdf
267 DEBUG: Selecting page 1 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/0B9Ibqa26YXiRVkp0THA2amhpeUE.pdf
268 DEBUG: Selecting page 230 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Univ Presses/0816620512.University.of.Minnesota.Press.Wild.Knowledge.Science.Language.and.Social.Life.in.a.Fragile.Environment.Jun.1992.pdf
269 DEBUG: Selecting page 34 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1qMB51tHoHLfjPIaXTR-V3iUzP9z5lHd9.pdf
270 DEBUG: Selecting page 2 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1VTbPM3SvWqlH7T4OkZ0qIp8NwfKYtlhF.pd

mupdf: No default Layer config


354 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Area Studies/Assyrian/Old Assyrian/OA Secondary sources/Steiner_1989_OzgucTFS_anitta.pdf
355 DEBUG: Selecting page 131 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1AN9yogoeDIwWhlBnM7ibj6jIRQRG3_6m.pdf
356 DEBUG: Selecting page 179 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1NFE-YqiGhRNCSHbErsCqiAdPb7ueWrz1.pdf
357 DEBUG: Selecting page 201 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Author (or editor)/B/B/Birot, Maurice - Textes Administratifs de la Salle 5 du Palais, 2eme Partie. ARMt 12, 1964.pdf
358 DEBUG: Selecting page 71 from /content/drive/MyDrive/AWCA/PDFtp/ane.pdf.share/By Topic (or field)/Teaching Tools/ANE Shared Files/Archeologie/Kämmerer_Schwiderski_1998_DAW.pdf
359 DEBUG: Selecting page 0 from /content/drive/MyDrive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr/temp/1HhOUmltoC7yhJO1WNnuyRAB4EcGdvK2d.

[]

## 3. Build a Compressed Representative Document

The objective here is to take a stratified sample of the pages in the large representative document.

## 4. Evaluate an OCR System

Here, I define a process for measuring _time_ and _overall accuracy_ for an OCR system.