Created on 03/02/2019 
<br>
@author: Benjamin Consolvo


## 1. Pytesseract approach using only the PDF first
First, I use pytesseract to extract the text from the PDF only, and then use one of Pytesseract's functions to get a 0-100 confidence value for each word. Although I am not making use of the provided word library, I do get a confidence value for each word. The confidence value obtained here uses the algorithm developed in Google's Tessaract library. Some of the technical papers outlining the Tesseract OCR method can be found here:
https://tesseract-ocr.github.io/

In [None]:
# Installing required Python libraries for code if needed
# !pip install wand
# !python -m pip install --upgrade pip
# Installed pytesseract from notes on StackExchange: 
# https://stackoverflow.com/questions/48357030/pytesseract-output-is-not-defined
# !pip install tesseract
# !pip install tesseract-ocr

In [None]:
# importing necessary libraries 
from wand.image import Image as Img
try:  
    from PIL import Image
except ImportError:  
    import Image
import pytesseract
import glob
import pandas as pd
import numpy as np
from pytesseract import Output


In [None]:
# Sometimes necessary to point the command for tesseract to the installed location on a Windows machine:
# pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

In [None]:
# PDF to JPG - divides into a JPG file for every page of the PDF
with Img(filename='doc_test.pdf', resolution=300) as img:
    img.compression_quality = 99
    img.save(filename='doc_test.jpg')

In [None]:
# Use PIL to open the image, and use pytesseract to transcribe the image to a text string.
# Must have pytesseract and tesseract properly installed, and have the proper command for opening tesseract
# Making a list of all jpg files if I wanted to use a for-loop over all of my jpg files
im_list = glob.glob('doc_test-*.jpg') 
im_list = sorted(im_list)
# Function to extract text from an image using pytesseract
def ocr_1(filename):
    im = Image.open(filename) # opening the image with PIL Image function
    text = pytesseract.image_to_string(im) # Uses Tesseract to convert JPG image to a string of text
    scores = pytesseract.image_to_data(im, output_type=pytesseract.Output.DATAFRAME) # To provide extracted data about each word extracted from the page.
    b = scores[['text','conf']] #Extracting only the text and confidence columns only (each word has a confidence from 0-100)
    b2 = b.conf.replace(-1,np.NaN) # Replacing -1 confidence values with NaNs, so that they are ignored
    mean_b2 = b2.mean() # Taking the mean confidence value of page 1 of the PDF (now with NaNs excluded from mean)
    print('The mean confidence value of', filename, 'is', mean_b2,'.')
    return text

In [None]:
# Running OCR on all 5 images from the PDF and storing text in a list.
lst_scores = []
for i in im_list:
    df_text = ocr_1(i)
    lst_scores.append(df_text)

 The confidence values have nothing to do with the confidence values that I develop in my function in #2 - they are confidence values from Google's Tesseract function.

In [None]:
# Saving the text to a dataframe, and then writing out a CSV to be used in the confidence function later.
df = pd.DataFrame(lst_scores)
df.to_csv('pyt_output.csv',index=False,header=None)

## 2. Writing my own confidence function using doc_test_ocr.csv extracted OCR text and valid_words.txt dictionary
Now, I write my own confidence function based on the text outputs provided.


In theory, the best accuracy measure of a document of text would first be to manually write out the entire document, and ensure that the manually-written document has 0 mistakes; and then compare the manually-copied document with whatever output of OCR. This comparison could be done character by character, word-by-word, paragraph-by-paragraph, or even the whole document. 
<br>
<br>
However, to manually write every document out to test its accuracy would take significant effort and time, and would not be feasible for many documents, and with many pages of text. 
<br>
<br>
Instead, we can use a confidence measure from the word library provided. By cross-referencing the words found in the OCR output with the words in the provided dictionary, we can assign confidence values to each word, and ultimately, to the document as a whole. We cannot assume that this confidence value is necessarily a measure of accuracy, as we do not have a pure "true" copy of the document's data (text). But the aim here would be to come up with a confidence measure that would reflect the quality of the PDF scanned document.
<br> 
<br> 
Assumptions:
- The word dictionary provided contains all of the words that would be written in the PDF document.
- Comparing the OCR output with the provided dictionary is a good measure of scan quality

In [None]:
import pandas as pd
import numpy as np
import datetime

In [None]:
def ocr_all(ocr_csv,dct_csv):
    #### Reading in the OCR output and words dictionary ####
    ocr_o = pd.read_csv(ocr_csv,names = ["Text"]) # Each line is a new paragraph in this OCR output.
    dct_o = pd.read_csv(dct_csv,names = ["Words"]) # Reading in the dictionary of words
    dct_o = dct_o['Words'].tolist() # Make the words dictionary dataframe into a list.
    #ocr_o.head() # display first 5 lines of the transcribed paragraphs
    
    #### PREPROCESSING OF THE TRANSCRIPTION OUTPUT ####
    '''
    Because the provided word dictionary only contains lower case words, with no numbers, and no punctuation, 
    we must preprocess the CSV file before evaluating it to match the word dictionary. We must:
        1. Remove all punctuation
        2. Remove all digits
        3. Replace more than 1 space with 1 space
    '''
    ocr_lc = pd.DataFrame(ocr_o['Text'].str.lower()) # Put all words in lower case
    ocr_lc['Text'] = ocr_lc['Text'].str.replace('[^\w\s]','') # remove all punctuation from text string
    ocr_lc['Text'] = ocr_lc['Text'].str.replace('_','') # removes all underscores _ from text 
    ocr_lc['Text'] = ocr_lc['Text'].str.replace('\d+', '') # remove all numbers
    ocr_lc['Text'] = ocr_lc['Text'].str.strip() # removes leading and end soaces from strings
    
    # For loop to reduce multiple spaces to 1 space for all strings
    lst_0 = []
    for i in ocr_lc['Text']:
        a = ' '.join(i.split())
        lst_0.append(a)
    se_1 = pd.Series(lst_0)
    ocr_lc['Text'] = pd.Series(lst_0)
    '''
    ### Not currently used
    ### Optional for loop to limit word length to eliminate short 1-2 letter words from OCR output
    lst_1 = []
    for i in se_1:
        b = ' '.join( [w for w in i.split() if len(w)>2] )
        lst_1.append(b)
    lst_1
    '''
    #### Write out the preprocessed text file ####
    out_csv = ocr_csv.split('.')[-2] + '_preproc.csv'
    ocr_lc.to_csv(out_csv,index=False) #Write out the preprocessed CSV file for QC
    ocr_lc.head() # Display top 5 lines of preprocessed text
    
    #### Calculating statistics on words and putting into a master dataframe ####
    ocr_master = pd.DataFrame(ocr_lc['Text'].str.split()) # split words in each row by commas to make lists
    l1 = ocr_master['Text'].tolist() # converting the dataframe column Text into a list type
    word_totals = list(map(len,l1)) # getting the length of each row of list of words
    se = pd.Series(word_totals) # converting list to series
    ocr_master['word_totals'] = se.values # inserting the word totals into the dataframe ocr_split
    #ocr_master.head() # Displaying first 5 lines of new dataframe.
    ser1 = ocr_master['Text']
    lst1 = []
    for i in ser1:
        countw = len(set(i) & set(dct_o)) #Key line for comparing words in each row to the words in the dictionary
        lst1.append(countw) # Putting the count into a list
    se_wc = pd.Series(lst1) # Making the lst1 with the count into a series before putting into the dataframe
    
    ocr_master['matching_words'] = se_wc.values # Adding the number of matching words as a column in the master dataframe
    ocr_master['d'] = ocr_master['matching_words'] / ocr_master['word_totals'] # A straight percentage grade for each paragraph detected
    
    ocr_master['all_words'] = ocr_master['word_totals'].sum() # Total of all words
    ocr_master['weight'] = ocr_master['word_totals'] / ocr_master['all_words'] # Weight for each set of words
    ocr_master['weighted_grade'] = ocr_master['d'] * ocr_master['weight'] # Weighted grade for each set of words
    #ocr_master # Displaying the new data-frame with the new statistics columns
    final_grade = 100 * ocr_master['weighted_grade'].sum() # Adding all of the weighted grades for a final score
    final_grade = round(final_grade) # final confidence out of 100 for the whole document
    return ocr_master, final_grade

In [None]:
# Running the function on the provided output OCR CSV file with the provided (valid_words.txt) dictionary
df_master_1, confidence_1 = ocr_all('doc_test_ocr.csv','valid_words.txt')
print('The scanned image has a confidence of',confidence_1,'out of 100')

In [None]:
# Running the function on the pytesseract output .csv file with the provided (valid_words.txt) dictionary
df_master_2, confidence_2 = ocr_all('pyt_output.csv','valid_words.txt')
print('The scanned image has a confidence of',confidence_2,'out of 100')

## Conclusions
- Many transcribed words were not found in dictionary, and thus the confidence function might seem abnormally low (around 50% confidence). 
- With a more comprehensive words dictionary, the score would improve
- However, there were also a number of words transcribed that were not words in the English language, which would decrease the score. <br> For example:
-- "ailic", "pes" "cps", "sm", "oc"
- The OCR output provided performed better at 49 / 100 than the output from my crude pytesseract OCR (36 / 100).
- The confidence measure on the scan quality of the PDF is really a measure of the quality of the OCR output, and if the words actually appear in the valid_words.txt words dictionary.


In [None]:
# OCR output score table from #2
df_master_1

In [None]:
# Pytesseract score table
df_master_2