Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spell check with aspell #106

Closed
witchi opened this issue Mar 23, 2015 · 5 comments
Closed

Spell check with aspell #106

witchi opened this issue Mar 23, 2015 · 5 comments

Comments

@witchi
Copy link

witchi commented Mar 23, 2015

Hi,

Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
Can you enhance your script with a call to aspell?

I have tried it within src/ocrPage.sh on line 198:

# perform spell check
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Performing spell check"
!aspell --dont-backup --lang=de_DE --mode=sgml -c "${curHocr}" < /dev/tty   \
        && echo "Could not spell checking file \"${curHocr}\". Exiting..." && exit $EXIT_OTHER_ERROR

but it doesn't work with the Gnu-Parallel tool.

Thank you
Andre

@jbarlow83
Copy link
Collaborator

I believe the script you mention is an older version of OCRmyPDF. At the least, OCRmyPDF contains all of the ideas in that script and many additions as well.

I don't know in what sense "it doesn't work with GNU parallel" without some other information. Perhaps try keeping the temporary files around (argument -k) and then testing your command with the .hocr file to get it work standalone. My guess is that something is wrong with redirecting stdin < /dev/tty.

As far as spell check as a feature, Tesseract already does spell check internally. When it produces gibberish, Tesseract could not decide. For example "w_rd" could be "word" or more rarely "ward", and if the letter _ is corrupted then it has no business deciding. It does not use NLP to try to figure out which word would fit grammatically. Spell check will help you filter out gibberish to generate a better list of keywords but it will not extract more information from a bad OCR result. Removes noise, but doesn't add signal. Make sense?

@witchi
Copy link
Author

witchi commented Mar 23, 2015

Yes, the script on http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ is an older version of OCRmyPDF, but OCRmyPDF doesn't use a scanner to create an initial PDF. So I have combined both scripts to get a PNM file from scanimage, convert this into TIFF with scantailor and tiffcp and into a PDF with tiff2pdf, which I use in OCRmyPDF to get a PDF-A.

The call of parallel within OCRmyPDF.sh processes every page of the provided PDF in its own job (which can run in parallel), but no job has the possibility to use a terminal. You can use -tmux to get a terminal, but it will be closed before you can use it. The application aspell uses terminals to display an internal editor, which let you correct the words provided by tesseract. The output of tesseract is an SGML-style file (hocr), which aspell can parse and compare with language-specific dictionaries. If aspell finds an unknown word, it will suggest some similar words from the dictionary and the user can correct the word manually or replace it with a suggested word. The output of the OCR will be better, aspell let the user decide between "word" and "ward" and stores the decision into the hocr file.

To use aspell I have removed the usage of parallel from OCRmyPDF.sh and have replaced it with a for-loop, which processes all pages of the provided PDF sequentially. With this trick, I call the src/ocrPage.sh as a normal shell script (within the loop) and I can use aspell as described above (because the shell script has been bound to a terminal instead to a background job queue). The redirection of stdin to /dev/tty was necessary to display the internal editor of aspell (without the redirection aspell returns the error code 255).

I have never used parallel, so I don't know a way to use aspell and parallel together. Therefore I have started this issue.

@jbarlow83
Copy link
Collaborator

I can't see that happening in the current, shell script version of this project (v2.x). It's a desirable feature, but the problem is that the script is currently set up to parallelize tesseract and you need serialized interactive input from /dev/tty. Even if it's technically possible to coordinate access to a shared resource in a shell script, I wouldn't want to go there.

There's a newer Python based version in my fork that I'm in the process of merging to the mainline. That framework could accommodate interactive prompts a lot more easily. It represents the script as a pipeline instead, and you'd insert a stage to the pipeline that acquires a semaphore and prompts for input. It could provide a GUI.

If you want to try that, as a very rough sketch you'd write a rule in ocrpage.py that transforms .hocr files:

from multiprocess import Lock
tty_lock = Lock()

@transform(ocr_tesseract, suffix(".hocr"), ".hocr.checked")
def spell_check_hocr(input_file, output_file):
    if not (spell check enabled):
        shutil.copy2(input_file, output_file)
        return
    with tty_lock:
        p = subprocess.call(['aspell', ...], stdin=PIPE)
        out, err = p.communicate('/dev/tty')

And then change the other dependent rules that involve ".hocr" to look for ".hocr.checked" instead.

@hilsonp
Copy link

hilsonp commented Mar 29, 2015

Just a side note: you say you started a fork with Python based script. I'm currently writing the ocrmypdf.py. Hope you referred to ocrpage.py.

Le 24 mars 2015 à 00:47, jbarlow83 notifications@github.com a écrit :

I can't see that happening in the current, shell script version of this project (v2.x). It's a desirable feature, but the problem is that the script is currently set up to parallelize tesseract and you need serialized interactive input from /dev/tty. Even if it's technically possible to coordinate access to a shared resource in a shell script, I wouldn't want to go there.

There's a newer Python based version in my fork that I'm in the process of merging to the mainline. That framework could accommodate interactive prompts a lot more easily. It represents the script as a pipeline instead, and you'd insert a stage to the pipeline that acquires a semaphore and prompts for input. It could provide a GUI.

If you want to try that, as a very rough sketch you'd write a rule in ocrpage.py that transforms .hocr files:

from multiprocess import Lock
tty_lock = Lock()

@Transform(ocr_tesseract, suffix(".hocr"), ".hocr.checked")
def spell_check_hocr(input_file, output_file):
if not (spell check enabled):
shutil.copy2(input_file, output_file)
return
with tty_lock:
p = subprocess.call(['aspell', ...], stdin=PIPE)
out, err = p.communicate('/dev/tty')
And then change the other dependent rules that involve ".hocr" to look for ".hocr.checked" instead.


Reply to this email directly or view it on GitHub.

@jbarlow83
Copy link
Collaborator

@zorglups: ocrpage.py is in the "develop" branch of the main repository now. I haven't forgotten you expressed interested in writing the Python version (issue #94). We should probably discuss and share some ideas - probably better to merge earlier rather than later. I wrote some comments in issue #94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants