An OCR application for Farsi/ Persian documents. This OCR application uses open source text recognition Tesseract 5.1.0 and Python3.
Preprocessing is applied to each image before using tesseract
. This is done to improve the performance of tesseract and also fix the rotation angle of the image (if needed). After converting the image to a txt
file, the quality of ocr can be measured using the Levenshtein distance metric (By putting original.docx of the intended image into Data directory).
- Install Tesseract
You can either install Tesseract via pre-built binary package or build it from source.
- Install farsi language data for tesseract
Download language training data (fas.traineddata) and move the file to the following directory:
mv fas.traineddata /usr/local/share/tessdata
-
Install poppler (PDF rendering library) for your OS Ubuntu-based Linux:
apt-get install -y poppler-utils
, macOS:brew install poppler
, Windows: download poppler file for windows and install it -
Install dependencies via
requirements.txt
pip install -r requirements.txt
docker build -t farsiocr .
docker run --name ocr -it --rm -v $PWD/data:/app/data -v $PWD/output:/app/output farsiocr
Copy your pdf or image files into the data
directory (a sample image in the Data directory is downloaded from the internet).
Run the src/ocr.py
and the results will be created in the output
directory.