# Measure OCR

See [usage](https://among.github.io/fusus/use/).

See [example](https://github.com/among/fusus/blob/master/example/doExample.ipynb).

How to read this notebook:

1.  *best experience*
    get this repository on your computer and run `jupyter lab`.
    Also install the table of contents extension in Jupyter Lab, since this is a lengthy notebook
    You can run the code cells now.
1.  *good reading experience*
    read it on [NbViewer](https://nbviewer.jupyter.org/github/among/fusus/blob/master/example/doExample.ipynb)
1.  *suboptimal*
    read it directly on [GitHub](https://github.com/among/fusus/blob/master/example/doExample.ipynb)
    (long time to load)

# Previously ...

We have run OCR on all pages in the in-directory,
by means of [checkOcr.ipynb](checkOcr.ipynb).

In [1]:
%load_ext autoreload
%autoreload 2
!cd `pwd`

Import the fusus package (see [install](https://among.github.io/fusus/about/install.html)).

In [2]:
from fusus.book import Book
from fusus.ocr import getProofColor

Initialize the processing line.

In [3]:
B = Book()

  0.05s Loading for Kraken: ~/github/among/fusus/model/arabic_generalized.mlmodel
  1.16s model loaded


# Visualise OCR confidence

Here is how we color the degrees of confidence reported by the Kraken OCR engine.
We translate a confidence (a number between 0 and 100 including) into a HSL color:

In [4]:
for i in range(101):
    clr = getProofColor(100 - i, test=True)

# Analyse OCR confidence

## First a single page.

In [6]:
B.measureQuality(132)

 3m 27s Batch of 1 pages: 132
 3m 27s Start measuring ocr quality of these images
   |     0.00s word-confidences of OCR results for 1 pages   


item,# of words,min,max,average,notes
overall,36,87,100,96,
p132,36,87,100,96,


   |     0.00s char-confidences of OCR results for 1 pages


item,# of chars,min,max,average,notes
overall,200,53,100,96,
p132,200,53,100,96,


   |     0.01s by-char-confidences of OCR results for 30 characters


item,# of chars,min,max,average,worst results
⌊ ⌋,29,84,100,98,p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132
⌊!⌋,1,63,63,63,p132
⌊.⌋,2,91,99,95,p132 p132
⌊0⌋,1,99,99,99,p132
⌊5⌋,1,100,100,100,p132
⌊6⌋,1,100,100,100,p132
⌊9⌋,1,66,66,66,p132
⌊ء⌋,2,59,100,80,p132 p132
⌊ا⌋,26,81,100,98,p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132 p132

item,# of chars,min,max,average,worst results
⌊ب⌋,6,99,100,100,p132 p132 p132 p132 p132 p132
⌊ة⌋,7,78,100,94,p132 p132 p132 p132 p132 p132 p132
⌊ت⌋,3,93,100,97,p132 p132 p132
⌊خ⌋,1,100,100,100,p132
⌊د⌋,6,53,100,89,p132 p132 p132 p132 p132 p132
⌊ر⌋,1,87,87,87,p132
⌊س⌋,5,71,100,94,p132 p132 p132 p132 p132
⌊ص⌋,2,93,100,96,p132 p132
⌊ط⌋,4,79,100,94,p132 p132 p132 p132
⌊ع⌋,6,94,100,99,p132 p132 p132 p132 p132 p132


 3m 27s all done


## All pages

In [7]:
B.measureQuality()

 3m 31s Batch of 19 pages: 47-48,58-59,63,67,101-102,111-113,121-122,131-132,200,300,400,999
 3m 31s Start measuring ocr quality of these images
   |     0.00s word-confidences of OCR results for 19 pages  


item,# of words,min,max,average,notes
overall,3839,41,100,95,
p047,174,58,100,95,
p048,174,65,100,95,
p058,168,62,100,96,
p059,266,46,100,95,
p063,265,58,100,96,
p067,258,72,100,96,
p101,212,57,100,94,
p102,359,55,100,95,

item,# of words,min,max,average,notes
p111,28,41,100,80,
p112,340,62,100,96,
p113,255,63,100,95,
p121,167,55,100,94,
p122,105,62,100,93,
p131,279,74,100,96,
p132,36,87,100,96,
p200,170,66,100,93,
p300,245,77,100,96,
p400,324,46,100,95,


   |     0.00s char-confidences of OCR results for 19 pages


item,# of chars,min,max,average,notes
overall,19795,19,100,96,
p047,934,23,100,95,
p048,900,40,100,96,
p058,784,51,100,96,
p059,1369,38,100,96,
p063,1273,39,100,96,
p067,1327,42,100,96,
p101,1113,33,100,95,
p102,1821,42,100,96,

item,# of chars,min,max,average,notes
p111,111,24,100,80,
p112,1651,37,100,96,
p113,1338,39,100,96,
p121,759,37,100,95,
p122,616,38,100,93,
p131,1487,39,100,97,
p132,200,53,100,96,
p200,891,19,100,93,
p300,1406,44,100,96,
p400,1736,40,100,96,


   |     0.01s by-char-confidences of OCR results for 61 characters


item,# of chars,min,max,average,worst results
⌊ ⌋,3489,39,100,98,p111 p200 p300 p048 p111 p122 p047 p058 p111 p113 p400 p047 p058 p101 p102 p200 p400 p059 p067 p102
⌊!⌋,6,63,100,87,p132 p113 p113 p113 p112 p113
⌊(⌋,101,53,100,93,p058 p200 p048 p122 p122 p101 p113 p122 p112 p059 p063 p122 p400 p111 p121 p101 p112 p058 p058 p300
⌊)⌋,98,44,100,95,p067 p200 p121 p122 p122 p200 p111 p048 p121 p063 p063 p400 p101 p113 p122 p102 p122 p059 p113 p101
⌊-⌋,46,75,100,97,p047 p067 p102 p059 p101 p113 p102 p048 p102 p102 p063 p101 p048 p102 p059 p063 p067 p113 p200 p048
⌊.⌋,157,46,100,92,p059 p121 p102 p101 p063 p102 p121 p400 p058 p122 p121 p113 p121 p102 p101 p102 p101 p200 p101 p122
⌊0⌋,14,51,100,90,p112 p101 p102 p200 p102 p102 p102 p132 p101 p102 p102 p102 p102 p102
⌊1⌋,83,32,100,94,p111 p048 p112 p300 p102 p122 p102 p122 p200 p101 p102 p067 p101 p111 p047 p113 p121 p102 p102 p101
⌊2⌋,49,50,100,95,p102 p048 p059 p047 p113 p102 p101 p101 p102 p122 p101 p102 p102 p400 p063 p101 p047 p048 p048 p058

item,# of chars,min,max,average,worst results
⌊3⌋,43,49,100,93,p122 p122 p122 p102 p121 p102 p101 p102 p101 p101 p048 p400 p101 p200 p102 p400 p047 p058 p059 p063
⌊4⌋,38,57,100,95,p113 p101 p101 p101 p059 p121 p101 p112 p113 p101 p101 p063 p101 p112 p122 p101 p101 p102 p112 p113
⌊5⌋,28,44,100,92,p102 p101 p059 p102 p048 p101 p063 p102 p102 p101 p101 p101 p102 p047 p113 p131 p058 p058 p059 p067
⌊6⌋,22,69,100,95,p400 p102 p101 p122 p102 p400 p067 p122 p063 p102 p102 p400 p048 p059 p063 p067 p067 p101 p102 p102
⌊7⌋,24,64,100,95,p067 p067 p059 p112 p102 p101 p102 p102 p102 p112 p102 p102 p112 p101 p101 p101 p102 p102 p102 p102
⌊8⌋,35,66,100,95,p101 p102 p113 p102 p113 p101 p102 p101 p112 p102 p122 p113 p113 p101 p102 p102 p102 p102 p112 p048
⌊9⌋,18,57,100,94,p102 p132 p101 p400 p101 p112 p059 p059 p059 p063 p101 p102 p102 p102 p102 p102 p102 p102
⌊:⌋,55,59,100,92,p121 p400 p112 p400 p200 p112 p102 p102 p131 p101 p200 p112 p101 p102 p102 p102 p131 p112 p102 p112
⌊=⌋,1,78,78,78,p101
⌊[⌋,3,77,100,88,p200 p400 p112


 3m 31s all done


Now we just regenerate the proof pages:

In [8]:
B.measureQuality(showStats=False, updateProofs=True)

 3m 37s Batch of 19 pages: 47-48,58-59,63,67,101-102,111-113,121-122,131-132,200,300,400,999
 3m 37s Start measuring ocr quality of these images
 3m 37s   end regenrating proof files
 3m 39s all done  19 999.jpg                                 
