pdf_to_json

Python module to Convert a PDF file to a JSON format

The goal is to be able to quickly extract all the available information in the document to a python dictionay. The dictionay can then be stored in a database or a csv file (for a later Machine Learning processing).

The extracted information can be :

Document metadata : title, format, version, creation date, author
Page images
Page texts (in Unicode, when availabe, no OCR here) and text attributes (fonts etc)

This tool uses the excellent poppler library.

This tool is initially intended for multilingual PDF document processing. The following tests describve the use case :

https://github.com/antoinecarme/pdf_to_json_tests/tree/master/data/multilingual

Demo

also availabe as a jupyter notebook

import pdf_to_json as p2j
import json

# web document : UDHR
url = "https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf"

# Convert the document into a python dictionary
lConverter = p2j.pdf_to_json.pdf_to_json_converter()
lDict = lConverter.convert(url)

print(json.dumps(lDict, indent=4))

Installation

pip install --upgrade git+git://github.com/antoinecarme/pdf_to_json.git

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
doc		doc
pdf_to_json		pdf_to_json
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

pdf_to_json

pdf_to_json

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

pdf_to_json

Demo

Installation

About

Releases

Packages

Languages

License

antoinecarme/pdf_to_json

Folders and files

Latest commit

History

Repository files navigation

pdf_to_json

Demo

Installation

About

Resources

License

Stars

Watchers

Forks

Languages