# OCR Demo: Full vs. Partial OCR of Pages
In v1.19.1 of PyMuPDF there are two choices of OCRing a document page: **_full_** or **_partial_**. In both cases, a `TextPage` object will be created - available for text extractions and text searches as usual. All these text processing methods have been extended with the new parameter `textpage` to allow referencing the OCR result.
* A **_full OCR_** makes a photo of the page with the desired resolution and interprets it.
   - All **_visible text_** on the page will be OCRed.
   - All text will have Tesseract's "GlyphlessFont".
   - May take around 2 seonds - depending on text amount and DPI.
* A **_partial OCR_** interprets only the images displayed by the page.
   - The DPI parameter is not needed, because the original images are OCRed.
   - Text will be a **_mixture of normal and OCR text_**. Normal text retains its properties.
   - Can be much faster than a full OCR.

In [1]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1, 19, 1):
    raise ValueError("Need at least v1.19.1 of PyMuPDF")

# eample PDF contains normal text and two overlapping images
doc = fitz.open("partial-ocr.pdf")
page = doc[0]

## Full Page OCR
First make a **_full page OCR_**. Please take a look at the PDF and note the two little text lines. They are contained in a separate, non-transparent image, which covers some text of the larger image underneath it.

In [2]:
# make the TextPage object. It does all the OCR.
full_tp = page.get_textpage_ocr(flags=0, dpi=300, full=True)

# now look at what we have got
print(page.get_text(textpage=full_tp))

PDF
PyMuPDF — the Python
soe urs ton na
luPDF
PyMuPDF Documentation
Release 1.19.0
Jorj X. McKie
Oct 17, 2021



Or blockwise output, getting rid of some of the unwanted linebreaks:

In [3]:
blocks = page.get_text("blocks", textpage=full_tp)
for b in blocks:
    print(b[4].replace("\n", " "))

PDF 
PyMuPDF — the Python 
soe urs ton na luPDF PyMuPDF Documentation Release 1.19.0 
Jorj X. McKie 
Oct 17, 2021 


Not very impressive either way: the original text (last 4 lines) was detected ok, but text in the pictures looks quite garbled ... no surprise!

> Please note, that the OCR process scans the page from top-left to bottom-right - which therefore also is the sequence of the extraction.

This is what we get when looking at details of each text span:

In [4]:
for block in page.get_text("dict", textpage=full_tp)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short in the display
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(222, 188, 297, 221) PDF
GlyphLessFont IRect(282, 273, 390, 298) PyMuPDF
GlyphLessFont IRect(389, 273, 411, 298)  —
GlyphLessFont IRect(410, 273, 453, 298)  the
GlyphLessFont IRect(452, 273, 540, 298)  Python
GlyphLessFont IRect(283, 300, 307, 336) soe
GlyphLessFont IRect(306, 300, 331, 336)  urs
GlyphLessFont IRect(330, 300, 363, 336)  ton
GlyphLessFont IRect(362, 300, 391, 336)  na
GlyphLessFont IRect(457, 300, 521, 336) luPDF
GlyphLessFont IRect(239, 348, 352, 373) PyMuPDF
GlyphLessFont IRect(351, 348, 538, 373)  Documentation
GlyphLessFont IRect(423, 382, 488, 396) Release
GlyphLessFont IRect(487, 382, 541, 396)  1.19.0
GlyphLessFont IRect(432, 478, 463, 496) Jorj
GlyphLessFont IRect(462, 478, 484, 496)  X.
GlyphLessFont IRect(483, 478, 540, 496)  McKie
GlyphLessFont IRect(470, 641, 489, 656) Oct
GlyphLessFont IRect(488, 641, 510, 656)  17,
GlyphLessFont IRect(509, 641, 538, 656)  2021


## Partial OCR
Let's see what a **_partial OCR_** can do for us.

A partial OCR `TextPage` internally stores text in the following sequence:
1. Normal text
2. OCR text from images in the same sequence as the page displays those images

So we better use the `sort` parameter of text extraction.

In [5]:
partial_tp = page.get_textpage_ocr(flags=0, full=False)

# look at the result
print(page.get_text(textpage=partial_tp, sort=True))  # sort by vertical, then horizontal

=
PDF
Some text as line
1.
Some more text as line 2.
PyMuPDF — the Python
bindings for MuPDF
PyMuPDF Documentation
Release 1.19.0
Jorj X. McKie
Oct 17, 2021



This is very much better. Looking again at span details:

In [6]:
for block in page.get_text("dict", textpage=partial_tp, sort=True)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(284, 189, 297, 202) =
GlyphLessFont IRect(222, 188, 297, 221) PDF
GlyphLessFont IRect(283, 302, 307, 310) Some
GlyphLessFont IRect(306, 302, 326, 310)  text
GlyphLessFont IRect(325, 302, 339, 310)  as
GlyphLessFont IRect(338, 302, 356, 310)  line
GlyphLessFont IRect(360, 302, 363, 310) 1.
GlyphLessFont IRect(283, 320, 307, 328) Some
GlyphLessFont IRect(306, 320, 333, 328)  more
GlyphLessFont IRect(332, 320, 350, 328)  text
GlyphLessFont IRect(349, 320, 363, 328)  as
GlyphLessFont IRect(362, 320, 381, 328)  line
GlyphLessFont IRect(380, 320, 389, 328)  2.
GlyphLessFont IRect(282, 273, 390, 298) PyMuPDF
GlyphLessFont IRect(389, 273, 411, 298)  —
GlyphLessFont IRect(410, 273, 453, 298)  the
GlyphLessFont IRect(452, 273, 539, 298)  Python
GlyphLessFont IRect(301, 305, 394, 331) bindings
GlyphLessFont IRect(393, 305, 433, 331)  for
GlyphLessFont IRect(432, 305, 521, 331)  MuPDF
NimbusSanL-Bold IRect(237, 342, 541, 374) PyMuPDF Documentation
NimbusSanL-BoldItal IRect(422,

As mentioned, normal text is **_not OCRed_** in this case, so keeps its own font, fontsize, position information, etc. Whereas OCRed text appears with Tesseract's `GlyphLessFont`.

> During its internal processing, MuPDF treats every word returned by Tesseract as a separate text span.

## Performance
We mentioned in the beginning, that the OCR work is done during `TextPage` creation. Already without OCR, textpage creation is the most time consuming part of text processing.

Creating OCR textpages may easily take 100 to several thousand times longer. It therefore by all means should happen only once per document page.

The new `textpage` parameter in all text processing methods allows referring to an existing textpage and will suppress creating another one.

Here are some performance comparisons for our example page:

In [7]:
# normal text extraction - no OCR
%timeit page.get_textpage(flags=0)  # suppress image extraction

142 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [8]:
# full page OCR
%timeit page.get_textpage_ocr(flags=0, full=True, dpi=300)

400 ms ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
# partial OCR
%timeit page.get_textpage_ocr(flags=0, full=False)

262 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The above numbers illustrate that OCRing a page is time consuming! Creating an OCR `TextPage` may be several thousand times slower.

Once you **_have_** the textpage however, **_processing_** its text is as fast as it ever was:

In [10]:
# normal textpage
normal_tp = page.get_textpage(flags=0)
%timeit page.get_text(textpage=normal_tp)

6.87 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [11]:
# full page OCR
%timeit page.get_text(textpage=full_tp)

6.82 µs ± 63.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [12]:
# partial page OCR
%timeit page.get_text(textpage=partial_tp)

7.85 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
