# Example: Detect and Repair Unrecognized Characters
With the OCR capability introduced in its version 1.19.0, PyMuPDF is capable of recognizing characters that remain illegible with normal text extraction.
The following script reads a document page via `get_text("dict")`. If encountering unrecogized unicodes (which are returned as `0xFFFD`), it OCRs the respective text span and uses that text instead.
While a similar approach was always available with package ``easyocr`` or invocation of an installed Tesseract (via a `subprocess`), the new solution is not only cleaner and works with the included batteries, but also more than 10 times faster.

In [1]:
"""
Demo script using Mupdf OCR.

Extract text of a page and interpret unrecognized characters using OCR.
MuPDF codes unrecognizable characters as 0xFFFD = 65533.
Extraction option is "dict", which delivers contiguous text pieces within one
line, that have the same font properties (color, fontsize, etc.). Together with
the language parameter, this helps Tesseract finding the correct character.

The basic approach is to only invoke OCR, if the span text contains at least
one chr(65533) character.

--------------
This demo will OCR only text, that is known to already be text. This means, it
does not look at parts of a page containing images or text encoded as drawings.
--------------

Dependencies:
PyMuPDF v1.19.0
"""
import fitz
import time

assert tuple(map(int, fitz.VersionBind.split("."))) >= (1, 19, 0), "Need PyMuPDF v1.19.*"
assert fitz.TESSDATA_PREFIX, "Need Tesseract's tessdata for OCR function"
DPI = 400  # high resolution
OCR_TIME = 0
PIX_TIME = 0


def get_tessocr(page, span):
    """Return OCR-ed span text using Tesseract.

    Args:
        page: fitz.Page
        span: a span from get_text("dict")
    Returns:
        The OCR-ed text of the bbox.
    """
    global OCR_TIME, PIX_TIME, ZOOM
    # Step 1: Make a high-resolution image of the span bbox.
    t0 = time.perf_counter()
    pix = page.get_pixmap(dpi=DPI, clip=span["bbox"])
    t1 = time.perf_counter()

    # Step 2: OCR the bbox. Delivers a 1-page PDF in memory
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    new_text = ocrpage.get_text()  # extract OCR-ed text
    t2 = time.perf_counter()
    OCR_TIME += t2 - t1
    PIX_TIME += t1 - t0

    # Tesseract ignores leading spaces, hence some corrections
    old_text = span["text"]  # the original span text
    # compute number of leading spaces
    lblanks = len(old_text) - len(old_text.lstrip())

    # prefix OCRed text with this many spaces
    new_text = " " * lblanks + new_text

    # walk through old text replacing illegible chars with the OCR result
    return_string = ""  # we will return this string
    for i in range(len(old_text)):
        if old_text[i] != chr(0xfffd):  # this char was no problem
            return_string += old_text[i]
        else:
            return_string += new_text[i]  # else take recognized char
    print("before OCR: '%s'" % old_text)
    print(" after OCR: '%s'" % return_string)
    return return_string


doc = fitz.open("1page.pdf")
ocr_count = 0
for page in doc:
    blocks = page.get_text("dict", flags=0)["blocks"]
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
                text = s["text"]
                if chr(0xfffd) in text:  # invalid characters encountered!
                    # invoke OCR
                    ocr_count += 1
                    new_text = get_tessocr(page, s)

print("-------------------------")
print("OCR invocations: %i." % ocr_count)
print(
    "Pixmap time: %g (avg %g) seconds."
    % (round(PIX_TIME, 5), round(PIX_TIME / ocr_count, 5))
)
print(
    "OCR time: %g (avg %g) seconds."
    % (round(OCR_TIME, 5), round(OCR_TIME / ocr_count, 5))
)


before OCR: ' – integer containing the number of bytes of one line of the pi��ap’s IRe�t '
 after OCR: ' – integer containing the number of bytes of one line of the pixmap’s IRect '
before OCR: 'Co�st�u�to� �ow �e�ui�es the page’s �edia�o�.  '
 after OCR: 'Constructor now requires the page’s mediabox.  '
before OCR: 'Co�st�u�to� �ow �e�ui�es the page’s �edia�o�. '
 after OCR: 'Constructor now requires the page’s mediabox. '
before OCR: 'The �ase �lass fo� P�MuPDF’s '
 after OCR: 'The base class for PYMuPDF’s '
-------------------------
OCR invocations: 4.
Pixmap time: 0.00855 (avg 0.00214) seconds.
OCR time: 0.14418 (avg 0.03605) seconds.
