Tesseract 5.0 OCR on N (X4) threads

Performs OCR on all .pdf files found in the input folder using Tesseract version 5.0. It processes N files simultaneously (on N*4 threads).

Note: Memory usage can be high when running multiple threads. It is advisable to consider that -p 10 = 100GB RAM!

Install

bash ./build_docker.sh

Options:

-i : Path to the input folder (the folder containing PDF files to be OCR'd).

-o : Path to the output folder (if not specified, .txt files will be saved in the input folder).

-p : Number of processes. Default = 2 (Note: Tesseract uses 4 threads per process, so up to 24% of the processors can be used).

-l : Language of the documents to be OCR'd. Default = "hun" (Note: Only Hungarian and English language dictionaries are installed).

-d : Model reliability regarding character recognition. Default = False.

python3 run_docker2.py -i /home/pdf -o /home/txt -p 10 -l hun -d True

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
Dockerfile		Dockerfile
README.md		README.md
build_docker.sh		build_docker.sh
requirements.txt		requirements.txt
run_docker2.py		run_docker2.py