-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spell check with aspell #106
Comments
I believe the script you mention is an older version of OCRmyPDF. At the least, OCRmyPDF contains all of the ideas in that script and many additions as well. I don't know in what sense "it doesn't work with GNU parallel" without some other information. Perhaps try keeping the temporary files around (argument -k) and then testing your command with the .hocr file to get it work standalone. My guess is that something is wrong with redirecting stdin As far as spell check as a feature, Tesseract already does spell check internally. When it produces gibberish, Tesseract could not decide. For example "w_rd" could be "word" or more rarely "ward", and if the letter _ is corrupted then it has no business deciding. It does not use NLP to try to figure out which word would fit grammatically. Spell check will help you filter out gibberish to generate a better list of keywords but it will not extract more information from a bad OCR result. Removes noise, but doesn't add signal. Make sense? |
Yes, the script on http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ is an older version of The call of To use I have never used |
I can't see that happening in the current, shell script version of this project (v2.x). It's a desirable feature, but the problem is that the script is currently set up to parallelize There's a newer Python based version in my fork that I'm in the process of merging to the mainline. That framework could accommodate interactive prompts a lot more easily. It represents the script as a pipeline instead, and you'd insert a stage to the pipeline that acquires a semaphore and prompts for input. It could provide a GUI. If you want to try that, as a very rough sketch you'd write a rule in ocrpage.py that transforms .hocr files: from multiprocess import Lock
tty_lock = Lock()
@transform(ocr_tesseract, suffix(".hocr"), ".hocr.checked")
def spell_check_hocr(input_file, output_file):
if not (spell check enabled):
shutil.copy2(input_file, output_file)
return
with tty_lock:
p = subprocess.call(['aspell', ...], stdin=PIPE)
out, err = p.communicate('/dev/tty') And then change the other dependent rules that involve ".hocr" to look for ".hocr.checked" instead. |
Just a side note: you say you started a fork with Python based script. I'm currently writing the ocrmypdf.py. Hope you referred to ocrpage.py.
|
@zorglups: ocrpage.py is in the "develop" branch of the main repository now. I haven't forgotten you expressed interested in writing the Python version (issue #94). We should probably discuss and share some ideas - probably better to merge earlier rather than later. I wrote some comments in issue #94. |
Hi,
Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
Can you enhance your script with a call to aspell?
I have tried it within
src/ocrPage.sh
on line 198:but it doesn't work with the Gnu-Parallel tool.
Thank you
Andre
The text was updated successfully, but these errors were encountered: