Scanned vs Structured PDF Processor

The script identifies whether the given pdf is structured (text based) or scanned one. If it's the text based pdf, it uses pdftotext tool to extract the text content and saves pages in the given folder. It also separates the pdf into individual pdf pages using pdftk.

Prerequisites

Make sure that pdftotext, pdfinfo and pdftk are installed in your computer. pdftotext and pdfinfo are available in poppler-utils. Pdftk has to be installed separately.

Installing pdftk in Amazon Linux

Apparently pdftk can't be installed easily in Amazon Linux. However there's a workaround.

How it works

Reads the pdf file
Uses pdfinfo to get the total pages in the pdf and size and whether it's encrypted
If encrypted i.e. "password protected", then it writes stats.json with { "status":"Encryption", .. } throws an Exception, and exits from the script.
If not encrypted
- Uses pdftotext to dump the text and compares the size of the extract text content. If the text content size is 500 bytes in average for each page, then it is structured otherwise scanned one.
- Uses pdftk to extract each pdf page and saves in the pages folder.
- If the pdf is structured, then it uses pdftotext to extract the text content page-wise and puts the txt files in the text folder.
- If the pdf is non-structured i.e. scanned, then it uses Abbyy OCR service to extract the text content TODO
- Creates stats.json file with the following content (status = [Scanned|Structured|Encrypted])

{ "status": "Structured", "pages": 5 }

Test

Execute bash runtest.sh to run all above tests at once.

Run

Register in ABBYY and get application-id and password, copy settings.config.bak to settings.config and update application-id and password
python run.py to see the options
python run.py -i tests/sample.pdf -o out -l french creates folder out/text with the extracted text files, out/pages with the separated pdf files and out/stats.json. In case of french contract, it OCRs the document in that language. For now only english, french and spanish are supported. Language is optional field and uses english by default.

TODO

handle more exceptions

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
abbyy		abbyy
logs		logs
pdftools		pdftools
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PdfProcessor.py		PdfProcessor.py
ProcessLogger.py		ProcessLogger.py
readme.md		readme.md
run.py		run.py
runtest.sh		runtest.sh
settings.config.bak		settings.config.bak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abbyy

abbyy

logs

logs

pdftools

pdftools

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

PdfProcessor.py

PdfProcessor.py

ProcessLogger.py

ProcessLogger.py

readme.md

readme.md

run.py

run.py

runtest.sh

runtest.sh

settings.config.bak

settings.config.bak

Repository files navigation

Scanned vs Structured PDF Processor

Prerequisites

Installing pdftk in Amazon Linux

How it works

Test

Run

TODO

About

Releases

Packages

Languages

License

anjesh/pdf-processor

Folders and files

Latest commit

History

Repository files navigation

Scanned vs Structured PDF Processor

Prerequisites

Installing pdftk in Amazon Linux

How it works

Test

Run

TODO

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages