Looking for some examples on how to use docTR for OCR-related tasks? You've come to the right place 😀

# Installation

Install all the dependencies to make the most out of docTR. The project provides two main [installation](https://mindee.github.io/doctr/latest/installing.html) streams: one for stable release (update once every 45 days on average), and developer mode.

## Latest stable release

This will install the last stable release that was published by our teams on pypi. It is expected to provide a clean and non-buggy experience for all users.

In [2]:
# TensorFlow
!pip install python-doctr[tf]
# PyTorch
#!pip install python-doctr[torch]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-doctr[tf]
  Downloading python_doctr-0.6.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.3/239.3 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2<4.0.0,>=3.0.0
  Downloading pypdfium2-3.21.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyclipper<2.0.0,>=1.2.0
  Downloading pyclipper-1.3.0.post4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (608 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.2/608.2 KB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting weasyprint>=55.0
  Downloading weasyprint-58.1-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.6/265.6 KB[0m [31m17.4 MB

## From source

Before being staged for a stable release, we constantly iterate on the community feedback to improve the library. Bug fixes and performance improvements are regularly pushed to the project Git repository. Using this installation method, you will access all the latest features that have not yet made their way to a pypi release!

In [None]:
# Install the most up-to-date version from GitHub
# TensorFlow
!pip install -e git+https://github.com/mindee/doctr.git#egg=python-doctr[tf]
# PyTorch
#!pip install -e git+https://github.com/mindee/doctr.git#egg=python-doctr[torch]

Now go to  `Runtime/Restart runtime` for your changes to take effect!

# Basic usage

We're going to review the main features of docTR 🎁
And for you to have a proper overview of its capabilities, we will need some free fonts for a proper output visualization:

In [None]:
# Install some free fonts for result rendering
!sudo apt-get install fonts-freefont-ttf -y

Let's take care of all the imports directly

In [1]:
%matplotlib inline
import os

# Let's pick the desired backend
os.environ['USE_TF'] = '1'
#os.environ['USE_TORCH'] = '1'

import matplotlib.pyplot as plt

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

For the next steps, we will need a proper PDF document that will be used to showcase the library features

In [18]:
# Download a sample
#!wget https://eforms.com/download/2019/01/Cash-Payment-Receipt-Template.pdf
# Read the file
doc = DocumentFile.from_pdf("/content/sample_data/2022_07_6_9001532503-5.pdf")
print(f"Number of pages: {len(doc)}")

Number of pages: 2


docTR is, under the hood, running Deep Learning models to perform the different tasks it supports. Those models were built and trained with very popular frameworks for maximum compatibility (you will be pleased to know that you can switch from [PyTorch](https://pytorch.org/) to [TensorFlow](https://www.tensorflow.org/) without noticing any difference for you). By default, our high-level API sets the best default values so that you get high performing models without having to know anything about it. All of this is wrapper in a `Predictor` object, which will take care of pre-processing, model inference and post-processing for you ⚡

Let's instantiate one!

In [8]:
# Instantiate a pretrained model
predictor = ocr_predictor(pretrained=True)

By default, PyTorch model provides a nice visual description of a model, which is handy when it comes to debugging or knowing what you just created. We also added a similar feature for TensorFlow backend so that you don't miss on this nice assistance.

Let's dive into this model 🕵

In [19]:
# Display the architecture
print(predictor)

OCRPredictor(
  (det_predictor): DetectionPredictor(
    (pre_processor): PreProcessor(
      (resize): Resize(output_size=(1024, 1024), method='bilinear')
      (normalize): Normalize(mean=[0.7979999780654907, 0.7850000262260437, 0.7720000147819519], std=[0.2639999985694885, 0.27489998936653137, 0.28700000047683716])
    )
    (model): DBNet(
      (feat_extractor): IntermediateLayerGetter()
      (fpn): FeaturePyramidNetwork(channels=128)
      (probability_head): <keras.engine.sequential.Sequential object at 0x7fb8cd9183a0>
      (threshold_head): <keras.engine.sequential.Sequential object at 0x7fb8cd8c37f0>
      (postprocessor): DBPostProcessor(bin_thresh=0.3, box_thresh=0.1)
    )
  )
  (reco_predictor): RecognitionPredictor(
    (pre_processor): PreProcessor(
      (resize): Resize(output_size=(32, 128), method='bilinear', preserve_aspect_ratio=True, symmetric_pad=False)
      (normalize): Normalize(mean=[0.6940000057220459, 0.6949999928474426, 0.6930000185966492], std=[0.298999

Here we are inspecting the most complex (and high-level) object of docTR API: an OCR predictor. Since docTR achieves Optical Character Recognition by first localizing textual elements (Text Detection), then extracting the corresponding text from each location (Text Recognition), the OCR Predictor wraps two sub-predictors: one for text detection, and the other for text recognition.

## Basic inference

It looks quite complex, isn't it?
Well that will not prevent you from easily get nice results. See for yourself:

In [13]:
!wget https://eforms.com/download/2019/01/Cash-Payment-Receipt-Template.pdf
# Read the file
doc = DocumentFile.from_pdf("/content/sample_data/table.pdf")
print(f"Number of pages: {len(doc)}")

--2023-03-22 22:20:08--  https://eforms.com/download/2019/01/Cash-Payment-Receipt-Template.pdf
Resolving eforms.com (eforms.com)... 104.21.92.25, 172.67.185.85, 2606:4700:3031::ac43:b955, ...
Connecting to eforms.com (eforms.com)|104.21.92.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16080 (16K) [application/pdf]
Saving to: ‘Cash-Payment-Receipt-Template.pdf.2’


2023-03-22 22:20:08 (105 MB/s) - ‘Cash-Payment-Receipt-Template.pdf.2’ saved [16080/16080]

Number of pages: 1


In [20]:
result = predictor(doc)

## Prediction visualization

If you rightfully prefer to see the results with your eyes, docTR includes a few visualization features. We will first overlay our predictions on the original document:

In [None]:
result.show(doc)

Looks accurate!
But we can go further: if the extracted information is correctly structured, we should be able to recreate the page entirely. So let's do this 🎨

In [None]:
synthetic_pages = result.synthesize()
plt.figure(figsize = (50,50))

plt.imshow(synthetic_pages[0]); plt.axis('off'); 
plt.show()

## Exporting results

OK, so the predictions are relevant, but how would you integrate this into your own document processing pipeline? Perhaps you're not using Python at all?

Well, if you happen to be using JSON or XML exports, they are already supported 🤗

In [17]:
# JSON export
json_export = result.export()
print(json_export)

{'pages': [{'page_idx': 0, 'dimensions': (1584, 1224), 'orientation': {'value': None, 'confidence': None}, 'language': {'value': None, 'confidence': None}, 'blocks': [{'geometry': ((0.1181640625, 0.125), (0.267578125, 0.142578125)), 'lines': [{'geometry': ((0.1181640625, 0.125), (0.267578125, 0.142578125)), 'words': [{'value': 'Example', 'confidence': 0.9993807673454285, 'geometry': ((0.1181640625, 0.125), (0.2099609375, 0.142578125))}, {'value': 'table', 'confidence': 0.9983372092247009, 'geometry': ((0.212890625, 0.125), (0.267578125, 0.1396484375))}]}], 'artefacts': []}, {'geometry': ((0.115234375, 0.14453125), (0.4775390625, 0.166015625)), 'lines': [{'geometry': ((0.115234375, 0.14453125), (0.4775390625, 0.166015625)), 'words': [{'value': 'This', 'confidence': 0.9135354161262512, 'geometry': ((0.115234375, 0.1455078125), (0.162109375, 0.1640625))}, {'value': 'is', 'confidence': 0.9996023774147034, 'geometry': ((0.1650390625, 0.146484375), (0.185546875, 0.1650390625))}, {'value': 'a

In [None]:
# XML export
xml_output = result.export_as_xml()
print(xml_output[0][0])

b'<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>docTR - XML export (hOCR)</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type" /><meta content="python-doctr 0.4.1a0" name="ocr-system" /><meta content="ocr_page ocr_carea ocr_par ocr_line ocrx_word" name="ocr-capabilities" /></head><body><div class="ocr_page" id="page_1" title="image; bbox 0 0 1224 1584; ppageno 0" /><div class="ocr_carea" id="block_1" title="bbox 385 101                     842 133"><p class="ocr_par" id="par_1" title="bbox 385 101                     842 133"><span class="ocr_line" id="line_1" title="bbox 385 101                         842 133;                         baseline 0 0; x_size 0; x_descenders 0; x_ascenders 0"><span class="ocrx_word" id="word_1" title="bbox 385 102                             488 131;                             x_wconf 100">CASH</span><span class="ocrx_word" id="word_2" title="bbox 497 101                             675 133;                   