# OcrevalUAtion: evaluation of a batch on file level

This notebook provides a simple script to run the ocrevalUAtion tool (see: https://github.com/impactcentre/ocrevalUAtion) automatically through a batch of files. Whereas the outcome provided by the tool is the total number per category for the complete batch, this code provides an outcome on file level. The output is a .csv file, with for every file the 'CER', 'WER' and 'WER (order independent)' scores after comparison with the Ground Truth file. 

This code is written for files with a filename in this format: **"idenitifier_pagenr_type.extension"**

**Identifier:** a unique code which is used in both the Ground Truth file and the OCR file <br> 
**Pagenr:** the pagenumber of the source that is captured in the file <br>
**Type:** GT for Ground Truth, OCR for the OCR file to be evaluated <br>
**Extension:** This can be any extensions that is allowed by the ocrevalUAtion tool. In this example .xml is used.

## Step 0: Import relevant packages

In [2]:
import os
import subprocess
from time import sleep
import pandas as pd
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector

## Step 1: Run the ocrevalUAtion tool for every fileset

In [None]:
path = # Insert the path to the directory in which you have stored the Ground Truth files
location_jar_file = # Insert the path to where you have stored the 'ocrevalUAtion-1.3.4.jar' file

# Iterate through all the Ground Truth files
for filename in os.listdir(path):
    # Split the filename, and extract the identifier and pagenr together as identifier 
    identifier = filename.split('_')[0] + "_" + filename.split('_')[1]
    
    file_gt = path + "/" + identifier + "_GT.xml"
    file_ocr = # insert path and use the identifier to pick the right file, for example: "path_to_ocr_files/identifier + "_OCR.xml
    output = # Choose the directory for the output and the name for the outputfile, for example "path_to_output_directory/identifier_ocr.html"

    # Run the ocrevalUAtion tool with given parameters:
    process = subprocess.call("java -cp " + location_jar_file + " eu.digitisation.Main -gt " + file_gt + " -ocr "+ file_ocr +" -o " + output + "")

    # Let the program wait for a few seconds, so the system does not get overloaded
    sleep(5)

## Step 2: Extract the scores from the output files and put them in a .csv file

In [None]:
path = # Insert the path to the directory in which the output files are stored
#Create a dataframe to store the scores
dfScore = pd.DataFrame(columns = ['identifier', 'CER', 'WER', 'WER (order independent)'])

# Iterate through all the output files
for filename in os.listdir(path):
    # Open the file
    soup = BeautifulSoup(open(path + "/" + filename, encoding='utf-8'))
    # Split the filename, and extract the identifier and pagenr together as identifier 
    identifier = filename.split('_')[0] + "_" + filename.split('_')[1]
    # Find the first table (this is the table in which the scores are stored)
    table = soup.find("table")
    # Find the tags in which 'CER', 'WER', and 'WER (order independent)' are stored and take the next tag to get the score 
    cer = table.find('td', text='CER')
    cerScore = cer.findNext('td')
    wer = table.find('td', text='WER')
    werScore = wer.findNext('td')
    werOI = table.find('td', text='WER (order independent)')
    werOIScore = werOI.findNext('td')
    # Add the score of the file, including its identifier, to the dataframe
    dfScore = dfScore.append({'identifier': filename, 'CER': cerScore.text, 'WER': werScore.text, 'WER (order independent)': werOIScore.text}, ignore_index = True)

# Write the created dataframe to a .csv file
dfScore.to_csv('Score_ocr_evalution.csv', sep=';')

## Example output

;identifier;CER;WER;WER (order independent) <br>
0;file1;11,74;23,20;20,37 <br>
1;file2;2,84;7,40;6,95 <br>
2;file3;3,86;9,81;9,01 <br>
3;file4;2,76;6,91;6,53 <br>
4;file5;3,30;12,11;11,55 <br>
5;file6;11,46;25,18;20,39 <br>

## Example output in table
|  identifier |      CER   |     WER   |WER (order independent)|
|---|---|---|---|
|  file 1 | 11,74  | 23,20  | 20,37   | 
|  file 2 | 2,84  |  7,40 | 6,95  |  
|  file 3 |  3,86 |  9,81 | 9,01 | 
|  file 4 |  2,76 |  6,91 | 6,53 | 
|  file 5 |  3,30 |  12,11 | 11,55 | 
|  file 6 |  11,46 |  25,18 | 20,39 | 