# GroundTruth

In this notebook, I endeavor to create a PDF and a corresponding CSV file with page content and other data.

I begin with the usual boilerplate.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr

/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/OCR/pDonovan/awca-ocr


In [31]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.18.19-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 19.4 MB/s 
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.18.19


In [44]:
import pandas as pd
import gspread
from oauth2client.client import GoogleCredentials
from google.colab import auth
import fitz
import os

In [4]:
auth.authenticate_user()
GC = gspread.authorize(GoogleCredentials.get_application_default())

In [5]:
def get_df(title, gc, worksheet=0, has_headers=True):
    """Returns a pandas.DataFrame representation of the
    (WORKSHEET)th worksheet of the Google Sheets (GSHEET)
    file that has title TITLE.
    TITLE - the title of the desired spreadsheet
    GC    - the GSpread credentials needed to retrieve the spreadsheet
    WORKSHEET - the index of the desired worksheet within
        the spreadsheet
    HAS_HEADERS - set to False if the spreadsheet does not
        have a header row at the top.
    It is not necessary to specify the path or the GSHEET
    file extension. Note that this creates undefined
    behavior when your google drive has multiple spreadsheets
    with the same name (i.e., you do not know which one
    will be opened).
    """
    # For details on how to handle GSHEET files, see
    # https://gspread.readthedocs.io/en/latest/api.html
    contents = gc.open(title).get_worksheet(worksheet).get_all_values()
    if has_headers:
        return pd.DataFrame.from_records(
            data=contents[1:],
            columns=contents[0]
        )
    return pd.DataFrame.from_records(contents)

## Save the True Text

Here, I endeavor to load the texts that are transcribed and save them, along with corresponding metadata.

In [23]:
transcriptions = get_df('large_sample_page', GC, 0)
data = get_df('large_sample_page', GC, 1)

In [24]:
transcriptions.pg = transcriptions.apply(lambda row: int(row.pg), axis=1)
data.pg = data.apply(lambda row: int(row.pg), axis=1)

In [25]:
data.head()

Unnamed: 0,zero-based index,text,orientation,language,mean_confidence,used_original_text,time,scale,pg,Unnamed: 10,true_orientation,true_language,image,cuneiform,translit,mix_lang,confirmed,Bib,Fm,Abb,two,dev
0,0.0,"\n\n\n166 Michael Heltzer, The Galgila Family ...",0.0,eng,89.9,False,1622237950.0,2.26,1,,0.0,eng,False,False,False,False,True,True,False,True,True,False
1,,,,,,False,,,1,,,,False,False,False,False,False,False,True,False,False,False
2,,,,,,False,,,1,,,,False,False,False,False,False,False,False,False,False,False
3,1.0,\n\n\nURKUNDEN\n\nDES\n\nALTEN REICHS\n\nERSTE...,0.0,deu,84.2,False,1622237950.0,1.75,2,,0.0,deu,True,False,False,False,True,False,False,False,False,False
4,2.0,\n\n\n30 BENJAMIN R. FOSTER\n\nAt 500 x magnif...,0.0,eng,93.9,False,1622237955.0,1.75,3,,0.0,eng,False,False,False,False,True,False,False,False,False,False


In [26]:
transcriptions.head()

Unnamed: 0,sort,sort_permuted,pg,dev,text,transcribed_by,Unnamed: 7,checked by (initials)
0,0.4832038444,0.003096678761,308,False,Gisa Plateau\nDer Taltempel des Chephren wurde...,Adam,,
1,0.2070127014,0.004254688511,1329,False,34 LAWRENCE H. SCHIFFMAN\nthe debate was engag...,Adam,,
2,0.2624726036,0.004493012561,972,False,81\nDAVID OATES\nstanding arch. Their purpose ...,Peter,,
3,0.377048576,0.0064743734,480,False,TEXTS AND FRAGMENTS\nThe three fragments publi...,Peter,,
4,0.0554161696,0.00725169713,17,False,Keilschrifttexte im Kunsthistorischen Museum W...,Peter,,


In [29]:
combined = transcriptions.join(data.set_index('pg'), on='pg', rsuffix='_data')
combined.head()

Unnamed: 0,sort,sort_permuted,pg,dev,text,transcribed_by,Unnamed: 7,checked by (initials),zero-based index,text_data,orientation,language,mean_confidence,used_original_text,time,scale,_data,true_orientation,true_language,image,cuneiform,translit,mix_lang,confirmed,Bib,Fm,Abb,two,dev_data
0,0.4832038444,0.003096678761,308,False,Gisa Plateau\nDer Taltempel des Chephren wurde...,Adam,,,307,\n7 \n \n \n \n \nGisa Pla...,,deu,,True,1622240493,,,0,deu,True,False,False,False,True,False,False,False,False,False
1,0.2070127014,0.004254688511,1329,False,34 LAWRENCE H. SCHIFFMAN\nthe debate was engag...,Adam,,,1328,the debate was engaged by a series of appropri...,,eng,,True,1622250517,,,0,eng,False,False,False,False,True,False,False,False,False,False
2,0.2624726036,0.004493012561,972,False,81\nDAVID OATES\nstanding arch. Their purpose ...,Peter,,,971,\n\n\n&r DAVID OATES\n\nstanding arch. Their p...,0.0,eng,93.9,False,1622247090,1.75,,0,eng,False,False,False,False,True,False,False,False,False,False
3,0.377048576,0.0064743734,480,False,TEXTS AND FRAGMENTS\nThe three fragments publi...,Peter,,,479,\n\n\nTEXTS AND FRAGMENTS\n\nThe three fragmen...,0.0,eng,93.8,False,1622242324,1.75,,0,eng,True,True,False,False,True,False,False,True,False,False
4,0.0554161696,0.00725169713,17,False,Keilschrifttexte im Kunsthistorischen Museum W...,Peter,,,16,Keilschrifttexte im Kunsthistorischen Museum \...,,deu,,True,1622238091,,,0,deu,False,False,False,False,True,False,False,False,False,False


It should be emphasized that the column text_data is untrusted whereas the column `text` is trusted because it is manually transcribed.

In [30]:
combined.to_csv('./benchmark/ground_truth1/page.csv')

## Make the PDF

Here, I endeavor to make the PDF corresponding to the transcribed text.

In [52]:
sample = fitz.open('./large_sample.pdf')

It is necessary to subtract 1 in order to convert from 1-based to 0-based indexing.

In [46]:
sample.select([page_number - 1 for page_number in list(combined.pg)[:100]])
if not os.path.exists('./benchmark/ground_truth1/pages0'):
  os.makedirs('./benchmark/ground_truth1/pages')

It is necessary to collect garbage in order for the apparently removed pages to truly be removed.

In [61]:
sample.save('./benchmark/ground_truth1/pages/group0.pdf', garbage=2)