New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence value calculation (CC - WC - PC) - annotation extension #23

Open
ntra00 opened this Issue Jul 1, 2014 · 1 comment

Comments

Projects
None yet
3 participants
@ntra00
Member

ntra00 commented Jul 1, 2014

Submitter: CCS  (J.Bauer@content-conversion.com)
Submitted: 2013-02
Status: Discussion
Backwards compatible:**Yes (Only Annotation)**
To ALTO Version: ?

For the page / word and character confidence the values for the calculation are not defined in the schema.
To establish a common calculation method the idea was to share the calculation method and to define a common rule for this to make the confidence values comparable.

Here the calculation methods as calculated until now by CCS with docWorks.

Precondition detail:

ABBYY FineReader up to version 7.1: the character confidence range was defined for 28 (good) to 55 (bad)

ABBYY FineReader starting version 8.0: the character confidence range was defined for 0 (good) to 100 (bad)

These ranges have to be transformed into the range defined by ALTO (range 0 to 9; see below). There unsharpness appeares.

CCS continued calculations for WC due to that on more precised values from ABBYY (range 28 - 55 / 0 - 100), Due to that rounding differences can appear on following values of WC from CC within the ALTO!

CC:

The character confidence is defined in ALTO in a scale of "0" to "9" - "0" is best, 9 is worst.

Character Confidence is determined according to ABBYY character confidence.
The results from the Finreader engines are normalized to the ALTO scale of 0 to 9 per character.
e.g. the word FAX - detected 100% ok by OCR engine will have a CC of 000 - one digit for every character.

WC:

Word Confidence is determined based on character level confidence.
The better the character confidence the better the word confidence.
In addition the word confidence is influenced by the dictionary verification.

If a word is found in the dictionary, it increases the word confidence value.
The longer the word, the higher the confidence value.
(Explanation: If a long word (e.g. with 15 characters) is found in dictionary it is pretty sure that the word is correct, while on wrong detected character a match against the dictionary by mistake is unlikely. Short words like 'fun' / 'fan' will both be found in dictionary. There is no improved guarantee by dictionary check, that the right word is detected.)
Due to that also words with 2 or less characters are not checked against the dictionary.

The word confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Calculation:
double( (sum CC)/numChar )/1000.0 - normalization to (0,1)
Example:

                <String HPOS="5485" VPOS="4654" WIDTH="468" HEIGHT="109" CONTENT="quorum" WC="1.00" CC="211110"/>

                <SP HPOS="5953" VPOS="4762" WIDTH="104"/>

                <String HPOS="6057" VPOS="4606" WIDTH="524" HEIGHT="132" CONTENT="conliflmg" WC="0.89" CC="110121122"/>

                <SP HPOS="6581" VPOS="4762" WIDTH="61"/>

                <String HPOS="6643" VPOS="4592" WIDTH="128" HEIGHT="118" CONTENT="of" WC="0.93" CC="02"/>

                <SP HPOS="6770" VPOS="4762" WIDTH="52"/>

                <String HPOS="6822" VPOS="4635" WIDTH="61" HEIGHT="66" CONTENT="a" WC="0.85" CC="2"/>

                <SP HPOS="6883" VPOS="4762" WIDTH="71"/>

                <String HPOS="6954" VPOS="4597" WIDTH="468" HEIGHT="137" CONTENT="majority" WC="1.00" CC="12101111"/>

                <SP HPOS="7422" VPOS="4762" WIDTH="52"/>

                <String HPOS="7474" VPOS="4578" WIDTH="123" HEIGHT="113" CONTENT="of" WC="0.96" CC="01"/>

When a word is in the dictionary, confidence is 1.0, else is computed (mainly average of all “reversed” cc – means for “212” = ((10-2) + (10-1) + (10-2)) / 3 = 25/3 = 8.33, means a WC of 0.83)

For short words, less than 3 chars, the risk is to have incorrect characters. Due to that it is calculated differently. (still pending)

Details:

FR9( FR8.1, FR10 also) : ABBYY character confidence range is between 0-100
The character confidence is normalized to (0,9) . The word confidence is the sum of the characters confidences and in the end this is calculated as an average of the numbers of characters.

Before writing the WC attribute in ALTO, the word confidence is checked against ABBYY dictionary, whenever the word is found in dictionary the confidence increases:
1000 - ((1000 - charConfLevel) / (chars.GetSize()*3));

Otherwise if the word is not found in ABBYY dictionary the initial determined word confidence level is used and normalized to (0,1)

Note:
charConfLevel word confidence - average confidence on character basis.
chars.GetSize number of characters in word

PC:

The Page Confidence is calculated by average dictionary confidence of all alpha-numeric characters.
?
The page confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.

Details:
The confidence is calculated by adding all the confidences of the XMLTexts (sum of character confidence)

set confidenceSum [expr $confidenceSum + $noOfAlphaNumChars * $confidence ]
and in the end the total page confidence is calculated after this formula:
return [ expr $confidenceSum/$pgNoOfAlphaNumChars ]

Note:

confidence- XMLText dictionary confidence

The total characters confidence sum divided by the number of characters on the page, (normalized in the end to (0,1) ) determines the Page Confidence.

If there are zones but no OCR, the returned value is 999 for confidence as for a bad confidence level.
For blank pages the returned value is 100 for confidence – as to full confidence on blank pages.

@ntra00 ntra00 self-assigned this Jul 1, 2014

@ntra00

This comment has been minimized.

Show comment
Hide comment
@ntra00

ntra00 Jul 1, 2014

Member

Markus.Enders@bl.uk said
at 11:41 am on Feb 21, 2013

Regarding the Character Confidence (CC):
I think the new Glyph would be a replacement and extension to the CC attribute. It would allow us to store additional information and use the same value range for the confidence as the WC and PC are using (0-1).
For backwards compatibility we should still support the CC attribute but define it as "deprecated" in the schema documentation.

Member

ntra00 commented Jul 1, 2014

Markus.Enders@bl.uk said
at 11:41 am on Feb 21, 2013

Regarding the Character Confidence (CC):
I think the new Glyph would be a replacement and extension to the CC attribute. It would allow us to store additional information and use the same value range for the confidence as the WC and PC are using (0-1).
For backwards compatibility we should still support the CC attribute but define it as "deprecated" in the schema documentation.

@ntra00 ntra00 added the 2 discussion label Jul 1, 2014

@jukervin jukervin changed the title from 2013-02 confidence value calculation (CC - WC - PC) - annotation extension to Confidence value calculation (CC - WC - PC) - annotation extension Sep 10, 2014

@Jo-CCS Jo-CCS referenced this issue Mar 23, 2016

Closed

Glyphs (IMPACT) #26

@acpopat acpopat assigned acpopat and unassigned ntra00 Sep 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment