Text OCR Scanner Software Tool. It is for extracting text from images and PDF files.
This software is UNDER HEAVY DEVELOPMENT. Use this OCR software for now.
v0.1.01
# get all dependencies of the project
go mod tidy
# build: produce an executable file/CLI app named "ocr"
go build -o ocr src/*
# run the app to identify a sample image
./ocr --lang=eng --img=img/default.png
# deps, build, run
go mod tidy && go build -o ocr src/* && ./ocr --lang=eng --img=img/default.png
# detect race conditions & memory leaks
go run -race .
Done & TO-DO of features, goals and values. tiny steps to the goal.
- OCR library used is tesseract-OCR with the gosseract v2 wrapper.
- image to text
- PDF to text
- PDF to docx
- PDF to selectable-text PDF
- scalable : take advantage of all CPU cores to get the job done faster
- bulk / patch-processing : coroutines and parallelism for tasks / jobs
- composable CLI app for scripts and automation
- UX / easy to use / user friendly
- support English OCR
- support Arabic OCR
- add tests
- use some test images from renard314/textfairy
- cut the image into pieces/segments and concurrently OCR them. (performance)
- available on Debian & Debian-based distros
- available as snap
- available as flatpak
- available on Elementary OS
- available on Mac OS (likely via HomeBrew)
- available on Windows
- scan history : { original image, processed image, extracted text }
- migrate from opencv into Go libs
- 3b1b : what is a convolution?
- train / refine tesseract OCR
- tesseractfonts : fine tune tesseract for new fonts
The source code of OCR project can be found on:
- GitHub: https://github.com/abanoubha/ocr.git
- GitLab: https://gitlab.com/abanoubha/ocr.git
- CodeBerg: https://codeberg.org/abanoubha/ocr.git