OCR utils

Python tools for interacting with Tesseract

Features

Detects tables in PDF/images and performs OCR on each cell
Performs OCR on PDF and generates SVG image

Quick Start

from ocr_utils import pdf_to_svg

pdf_to_svg(
    input_filename='in.pdf',
    output_filename='out.svg',
    detect_tables=True,
    lang='eng',
)

Execution example

Input pdf

Output svg

Installation

Stable Release: pip install tesseract_ocr_utils
Development Head: pip install git+https://github.com/envinorma/ocr_utils.git

This library is built upon pytesseract and pdf2image which have non-pip requirements. Visit these libraries installation pages to install dependencies.

For example, on ubuntu, the following libraries need to be installed:

apt-get install libarchive13
apt-get install tesseract-ocr
apt-get install poppler-utils

Documentation

For full package documentation please visit envinorma.github.io/ocr_utils.

Development

See CONTRIBUTING.md for information related to developing the code.

MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
docs		docs
example_execution		example_execution
ocr_utils		ocr_utils
.editorconfig		.editorconfig
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

Envinorma/ocr_utils

Folders and files

Latest commit

History

Repository files navigation

OCR utils

Features

Quick Start

Execution example

Input pdf

Output svg

Installation

Documentation

Development

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages