Skip to content

An OCR application for Farsi/ Persian documents.

License

Notifications You must be signed in to change notification settings

amir2mi/FarsiOCR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FarsiOCR

An OCR application for Farsi/ Persian documents. This OCR application uses open source text recognition Tesseract 5.1.0 and Python3.

Preprocessing is applied to each image before using tesseract. This is done to improve the performance of tesseract and also fix the rotation angle of the image (if needed). After converting the image to a txt file, the quality of ocr can be measured using the Levenshtein distance metric (By putting original.docx of the intended image into Data directory).

Installation

  1. Install Tesseract

You can either install Tesseract via pre-built binary package or build it from source.

  1. Install farsi language data for tesseract

Download language training data (fas.traineddata) and move the file to the following directory:

mv fas.traineddata /usr/local/share/tessdata
  1. Install poppler (PDF rendering library) for your OS Ubuntu-based Linux: apt-get install -y poppler-utils, macOS: brew install poppler, Windows: download poppler file for windows and install it

  2. Install dependencies via requirements.txt

pip install -r requirements.txt

Installation via Docker

docker build -t farsiocr .
docker run --name ocr -it --rm -v $PWD/data:/app/data -v $PWD/output:/app/output farsiocr

How to use?

Copy your pdf or image files into the data directory (a sample image in the Data directory is downloaded from the internet).

Run the src/ocr.py and the results will be created in the output directory.

About

An OCR application for Farsi/ Persian documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.0%
  • Dockerfile 7.0%