Skip to content
This repository has been archived by the owner on Jan 6, 2024. It is now read-only.

Enhance multilingual processing/options #7

Closed
popnt opened this issue Aug 27, 2016 · 32 comments
Closed

Enhance multilingual processing/options #7

popnt opened this issue Aug 27, 2016 · 32 comments
Milestone

Comments

@popnt
Copy link

popnt commented Aug 27, 2016

Both Tesseract and ABBYY support multilingual OCR however including multiple dictionaries for a single scan increases processing time and (at least for Tesseract) it decreases the language optimization strategy since pattern matching breaks down as you increase the variety of patterns to match against.

I would argue that unless you work in the translation field, it's unlikely that a single document will contain more than one language, and it's more likely that you will have a variety of documents to scan that will be one of a handful of languages.

This happens frequently in countries that have more than one official language. Canada is one example where documentation can arrive in either English or French but rarely both together on the same printed page. The United States for example has a high percentage of Spanish albeit unofficially a second language, however Switzerland has four official languages being German, French, Italian and Romansh.

The script does have language settings however unless you edit those parameters each time you scan a given document, all specified languages will be included in all scans and the end result may not be as accurate or effective versus if a particular language was specified for each document.

What I propose is that the script be modified slightly to monitor subfolders within the current set of monitored folders and then adjust the language parameter accordingly. Example:

  • /storage/service_ocr/PDF/eng --> document contains only English , only English dictionary used
  • /storage/service_ocr/PDF/fra --> document contains only French, only French dictionary used
  • /storage/service_ocr/PDF/eng+fra -->English and French both appear within the same document
  • /storage/service_ocr/PDF/eng+fra+spa -->English, French, Spanish all appear in the same document

This can be especially useful if you use something like FTP from a scanner so you can change the target location of a scan and that will change the language.

@deajan
Copy link
Owner

deajan commented Aug 27, 2016

I think it would add big complexity to include an undefined number of languages.
The script already supports multiple instances (I run two instances on the same server for different purposes).
Thus, you could have one instance per language(s) you want with folders like:

/storage/service_ocr/eng/PDF
/storage/service_ocr/fra/PDF
/storage/service_ocr/eng+fra/PDF
/storage/service_ocr/eng+fra+spa/PDF

I'll add some instructions on how to multiply service instances.

@popnt
Copy link
Author

popnt commented Aug 29, 2016

I can appreciate that adding specialised language triaging would make the script more complex. I was hoping to keep it manageable based on a single abstraction layer tied into the pathnames and mapping those directly to the language parameter.. Maybe it's more trouble than it's worth :)

Your workaround of running multiple scripts sounds very reasonable however wouldn't doing it that way then make the multicore routines in v1.5 independent from each script? Like there would not be any pooling of available cores, right?

@deajan
Copy link
Owner

deajan commented Aug 29, 2016

You are right, each script could use the given number of cores. So having 6 scripts each allocating 4 cores could lead to 24 core usage when all folders get feeded at the same time.

Let me think about a config array in order to launch one monitor per dir with different OCR options.
I think of something like

serviceConfig=(
"/storage/service_ocr/eng/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG"
"/storage/service_ocr/fra/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_FRA"
"/storage/service_ocr/eng+fra/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG_FRA"
"/storage/service_ocr/eng+fra+spa/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG_FRA_SPA"
)

This way, I could loop over the array and create a monitor per dir with different OCR parameteres.
The would still be shared parameters for all monitored dirs, like CHECK_PDF or DELETE_ORIGINAL.

@deajan
Copy link
Owner

deajan commented Aug 29, 2016

Actually, I'm thinking of this since some time now.
I should split pmocr.sh into a conf file the executable, so you can run different configs depending on what you need.
In batch mode, you'd have to pass a config file. In service mode, it'll be one instance per config file.

@popnt
Copy link
Author

popnt commented Aug 29, 2016

Perhaps keep the default options in the main pmocr.sh but allow individual configs to override them so you don't end up with the same settings repeated in 10 different configs?

@popnt
Copy link
Author

popnt commented Aug 29, 2016

Actually, in batch mode wouldn't it make more sense to pass the options as parameters from the CLI? that way if you have a special batch to process it doesn't necessitate having a config already built.

Passing a config as parameter would be useful too but adjusting individual settings sans config is much quicker

@deajan
Copy link
Owner

deajan commented Aug 29, 2016

There's no way you can pass OCR options from CLI, because there are way too much. All other options except the new multicore variable can be passed as cli argument already in batch mode.

The point of removing default options from main pmocr.sh is to be able to upgrade without losing config.
I'll go for a default /etc/pmocr/default.conf file which is called unless another conf file is given as argument or via a service file.

@popnt
Copy link
Author

popnt commented Aug 29, 2016

sounds good..! Eager to see what you come up with :)

@popnt popnt closed this as completed Aug 29, 2016
@popnt popnt reopened this Aug 29, 2016
@deajan
Copy link
Owner

deajan commented Sep 6, 2016

Finished moving config to default.conf and adapt service files.
Also improved idle cpu usage and made minor fixes.
Care to review ?

@popnt
Copy link
Author

popnt commented Sep 8, 2016

I updated via git pull and after running install.sh and using systemctl start pmocr-srv@default.service then systemctl status pmocr-srv@default.service I receive the following

   Loaded: loaded (/lib/systemd/system/pmocr-srv@.service; disabled)
   Active: failed (Result: exit-code) since Wed 2016-09-07 22:29:11 EDT; 6s ago
  Process: 354 ExecStart=/usr/local/bin/pmocr.sh --service --config=/etc/pmocr/%i (code=exited, status=1/FAILURE)
 Main PID: 354 (code=exited, status=1/FAILURE)

Important to note is that I had made manual changes from v1.4 to monitor multiple paths. I have removed pmocr-instance.sh and also /storage/storage_ocr and re-ran install.sh but the folders are not re-created... so without looking too deeply into the matter it seems there may be some inconsistencies in 1.5

Also, the README only lists instructions for how to run multiple configs with systemd.. will the initV style no longer support multiple configs?

@deajan
Copy link
Owner

deajan commented Sep 8, 2016

I've worked a bit too fast and made an error in the default.conf file.
Please update to latest commit and then manually delete the file and use install.sh again (new install without prior deletion will not overwrite the default config).

Install.sh won't create any other folders than /etc/pmocr. You're supposed to have folders to monitor which you setup in default.conf.

The README states that running InitV style automatically creates an instance per config file, so yes, initV supports multiple configs :)

If you have other failures with the new version, please give me the output of

systemctl status pmocr-srv@default.conf

and /var/log/pmocr.log

@popnt
Copy link
Author

popnt commented Sep 9, 2016

I updated to the last committ, re-added the /storage/service_ocr/* paths and ran systemctl start pmocr-srv@default.conf however the I'm still getting a similar error. Here is the full output:

   Loaded: loaded (/lib/systemd/system/pmocr-srv@.service; disabled)
   Active: failed (Result: exit-code) since Thu 2016-09-08 21:44:18 EDT; 13min ago
  Process: 3735 ExecStart=/usr/local/bin/pmocr.sh --service --config=/etc/pmocr/%i (code=exited, status=1/FAILURE)
 Main PID: 3735 (code=exited, status=1/FAILURE)

Sep 08 21:44:17 host systemd[1]: Started pmocr - monitors a local directory and gives any new file to an OCR ... file.
Sep 08 21:44:18 host pmocr.sh[3735]: CRITICAL:/usr/local/bin/abbyyocr11 not present.
Sep 08 21:44:18 host systemd[1]: pmocr-srv@default.conf.service: main process exited, code=exited, status=1/FAILURE
Sep 08 21:44:18 host systemd[1]: Unit pmocr-srv@default.conf.service entered failed state.

the /var/log/pmocr.log only contains one line at the end which is the following:

CRITICAL:/usr/local/bin/abbyyocr11 not present.

Given that this version is still not quite ready and I do not have a test environment setup, I'll revert back to the previous working version and just run two instances and only specify 2 cores for each instance on my 4 core machine.

@deajan
Copy link
Owner

deajan commented Sep 9, 2016

Do you use tesseract or abbyyocr ?
I've tested with both ocr tools and pmocr works and ran my tests successfully, in batch and in service mode.

But you have to comment out the lines in default.conf that don't correspond to your ocr tool, and uncomment those that correspond to your tool. Have you done this step ?
Maybe this has to be documented more clearly in the default.conf file, but this is not a developpment issue anymore, only config issue I think.

@popnt
Copy link
Author

popnt commented Sep 9, 2016

I had reviewed default.conf and saw

OCR_ENGINE=tesseract3

which is the ocr engine I'm using so I'm not sure why other lines corresponding ABBYY should actually be commented out. I do not recall having to comment out those lines from v1.4.

I had a look at the readme again and there is no indication to perform the step you describe. I can go through the remainder of default.conf but I have no idea what other lines should be edited or commented out.

@deajan
Copy link
Owner

deajan commented Sep 9, 2016

OCR_ENGINE=xxxxxxx helps the program to decide what type of special code it has to run.

But OCR_ENGINE_EXEC is defined twice if one of the sections is not commented out.
Just comment all the abbyy11 lines out of your default.conf and you should be running.

I've commented both sections out by default, and added more clear instructions in default.conf.
Sometimes things seem clear to the developper who designed something, even if it's not clear for anyone else :)

[Edit] In v1.4 I had logic to remove lines depending on OCR_ENGINE=xxxxx, but I cannot add logic code to a conf file :)[/Edit]

@popnt
Copy link
Author

popnt commented Sep 9, 2016

I see that pmocr.sh originally had an IF wrapper which prevented the declarations for tesseract3 and abbyyocr11 from affecting each other (line-59).

Personally I think that was a more elegant solution than commenting out entire sections as this can lead to human error, and it's must easier to control which OCR to use using one global variable OCR_ENGINE instead.

Nevertheless I commented out the abbyyocr11 settings from default.conf and the service appears to start properly now. I have not tested with multiple configs yet.

When I moved one single jpg into a monitored path, the ocr process began as expected. However then I moved another different jpg into the same directory while the first process was still running, the ocr process picked up the new jpg but also started a new process to ocr the previous jpg. So there were 3 processes running concurrently, 2 for the same jpg started at different times.

@deajan
Copy link
Owner

deajan commented Sep 10, 2016

As I said, I cannot add IF logic directly into a config file.
Having commented out settings is a small price to pay to have an upgrade path that doesn't overwrite the configuration.
Also, it's way more elegant to only duplicate a config file than having to duplicate the main executable and the services files for each instance.

I'm not sure about your problem with the double ocr process as no other OCR session should launch until the first is finished, from a code point of view. I'll have some tests.

@deajan
Copy link
Owner

deajan commented Sep 10, 2016

Having done multiple tests, there shouldn't be a way to get multiple times the same file processed.
I think that you already have another pmocr service running while trying the new one.
You should check with

ps aux | grep inotifywait

There shouldn't be more than one inotifywait instance with the same path.

@popnt
Copy link
Author

popnt commented Sep 10, 2016

I re-read previous notes and I cannot find where you indicated the config cannot contain an IF, but your explanation makes sense though and I agree the tradeoff to having multiple configs is better than to duplicate the executable.

As for the multiple instances, I ran ps aux|grep inotify and there is only one process running. I repeated my earlier test of first copying a single file to the monitored path, wait a second for the ocr to start, then copy a second different file, and unfortunately three processes were running again, same pattern as before: 2 OCR running for the same file. I am not sure what might be causing this but even after rebooting I am able to repro.

I am monitoring by running screen, then top in one window and bash in another. Then from the bash window I copy a single jpg, wait for tesseract to appear, then copy another jpg. It does successfully output three PDFs and there are no other jpgs in the monitored path prior to the test procedure.

Can you think of anything else other than multiple inotify running that might cause this?

@deajan
Copy link
Owner

deajan commented Sep 11, 2016

I stated that "I cannot add logic to a config file" (the comment with the EDIT tag), which includes any instructions other than variable assignments, including IF.

Anyway, I've tried to reproduce your problem, without success.
While doing so, I identified two other potential bugs, of which one made me modify the code that handles file monitoring to make it asynchronous in order to catch all files (files weren't catched while added when OCR process was already running, leaving them unprocessed until a next file is added).
Please update to the latest code and check again.

If you still experience your problem, I'll need you to follow the instructions below:

Stop the service, double check there is no inotifywait and no pmocr running with

ps aux | egrep "pmocr|inotify"

Then launch the service manually with the following command (using the debug version instead of the normal version of the program):

_PARANOIA_DEBUG=yes bash -x ./debug_pmocr.sh --service --verbose > /var/log/pmocr_debug.log 2>&1

Then I'll need the output of /var/log/pmocr.log and /var/log/pmocr_debug.log (pasted as gist if possible).

@deajan deajan added this to the v1.5 milestone Sep 11, 2016
@popnt
Copy link
Author

popnt commented Sep 11, 2016

In the interest of keeping the testing as clean as possible I'll setup a test server via QEMU.. my platform is a Raspberry Pi which is ARM and apparently QEMU is the easiest virtualisation method. I'll let you know soon when I have that setup.

I had also noticed that bug which skipped newly added files if there was a current OCR process running, however since the next batch of files would then include the one skipped I didn't think it was too critical, but nice to see you caught that one too :)

Please allow me a few days to setup the virtual machine. I'll reply back with the results after following the debug instructions above.

@deajan
Copy link
Owner

deajan commented Sep 12, 2016

Corrected two other bugs this morning and improved tesseract support.
Please only test with latest master. Eager to have your test results.

Btw, I'm thinking of including an OCR preprocessor for tesseract. Do you use any tools like OpenCV or ImageMagick to deskew / clear background / remove noise from your images prior to handle them with tesseract ?

@popnt
Copy link
Author

popnt commented Sep 15, 2016

I manually cleared off previous pmocr tests, and updated to master, and I am no longer able to repro. I'm not sure what might've changed but with each test I performed there was no duplication of processes.

I have not yet attempted to roll back to previous pmOCR versions to see if the behaviour can be found in a previous version or if it was simply my local environment. However I did complete tests using QEMU running Wheezy 3.1.9 using latest master and again I could not find any duplicate process. If I notice the behaviour occurring again I'll update this bug but I'm fine with moving on and closing the issue.

The only behaviour I noticed that I am not able to reconcile is that when I get the pmocr service status, there are two pmocr.sh processes running, as shown below. Notice there are two pmocr.sh running but like I said above there is no duplication of process, each jpg is scanned only once.

   Loaded: loaded (/lib/systemd/system/pmocr-srv@.service; disabled)
   Active: active (running) since Wed 2016-09-14 22:09:10 EDT; 43min ago
 Main PID: 7734 (bash)
   CGroup: /system.slice/system-pmocr\x2dsrv.slice/pmocr-srv@fr.conf.service
           ├─ 7734 bash /usr/local/bin/pmocr.sh --service --config=/etc/pmocr/fr.conf
           ├─ 7749 bash /usr/local/bin/pmocr.sh --service --config=/etc/pmocr/fr.conf
           ├─16735 inotifywait --exclude (.*)_OCR.pdf --exclude (.*)_OCR_ERR.pdf -qq -r -e create /storage/service_ocr/PDF/fra
           └─19268 sleep 1

Regarding your question about deskew and reduce noise, my scanner does have built-in deskew and noise reduction so I have not yet had the need to perform these as part of the OCR process, but I can certainly see the benefit for others. If a new issue is opened regarding ImageMagick integration I'll add in feedback if any come to mind :)

@deajan
Copy link
Owner

deajan commented Sep 15, 2016

Glad to hear everything worked for you.
I still don't know what could have been triggering double conversion, and never got some in my tests, but since I rewrote the code that triggers conversion in order to get async monitoring, I had to redo all tests anyway.
There are now two pmocr services running in order to keep monitoring even while converting, so that's pretty normal.

About the preprocessing, I already integrated ImageMagick as optional preprocessor for Tesseract in latest commits.
If you are happy with the new functionality regarding mutliple monitoring with different OCR options, feel free to close this issue.

@popnt
Copy link
Author

popnt commented Sep 16, 2016

Yes I saw the new settings, I haven't experimented with them but I'll give them a shot when I get a chance.

One question, but I'm not sure it's relative to this release or was the bahviour before: the source files are added to the monitored path by a regular non-root user, however the output pdf is actually owned by root. This doesn't seem normal, although on my system I am the only user it's no big deal but shouldn't the output files be owned by the same user as the source files?

@deajan
Copy link
Owner

deajan commented Sep 16, 2016

Good point. Actually the files are created as the user who runs the service.
There's no easy way to preserve file ownership as there are no real transformations, but rather creations.
Three solutions:

  • Get the file permission before processing them, and add them later to the output file (bulky and not elegant solution)
  • Launch the service as another user (makes everything more complex)
  • Add ACL heritage on the monitored files (simple as long as the FS has ACL support)

What do you think about the ACL solution ?

@popnt
Copy link
Author

popnt commented Sep 16, 2016

For ACL heritage, just so I understand the suggestion correctly, you're saying that the current folder's owner is what would be inherited by the files created? I suspect I have misunderstood something, but if that's what you're saying then yeah maybe that would be best.

I agree the first two points are probably not ideal, although I'm not sure retaining all the file permissions would be necessary but rather just the owner -- how much more or less complex that would be versus retaining all the permissions, is this just semantics? Launching the service as another user could end up being a mess and probably not useful for an admin.

@deajan
Copy link
Owner

deajan commented Sep 16, 2016

The owner of new files would still be root, but an ACL on the parent folder would allow other users or groups than root to have the same privileges as root on the file.

Getting and setting ownership and / or permissions on file isn't really elegant, even if it doesn't ask too much code effort.

Launching the service as another user is what is often used other daemons, but it would just lift the problem you describe from root to another user.

@popnt
Copy link
Author

popnt commented Sep 16, 2016

Setting the ACL on the parent folder would require the person installing the service to set the permissions correctly and then manage the user/group rights, which I imagine probably be very reasonable from an admin perspective. However perhaps for a less experienced admin it might be an annoyance.

Conceptually speaking if I were running a service for a group of users and I wanted them to benefit from the OCR service but I didn't want to allow each one to read or edit any of the other users content then maybe managing groups in the parent folder would be less convenient than making the output accessible only by the originating user. I really don't have enough experience managing file rights for groups of users like this so I'm not sure what sort of headache this might become.

Perhaps I am naive but if assigning the same ownership/permissions to the output file is technically trivial albeit not super elegant, that may be a worthwhile sacrifice. I only asked this question under this issue to confirm if this is the intended behaviour for v1.5, but we can open up a new Issue and discuss it more in detail there? The multilingual/multiconfig aspects of v1.5 seem to be pretty solid AFAIC.

@deajan
Copy link
Owner

deajan commented Sep 16, 2016

Well there aren't like 1000 issues open on pmocr, so we might just continue to talk here.

I added an optional parameter to keep ownership of the files, as long as the service is executed as root so it can chown.
I'll also added a chmod mask parameter so new files will have the permissions set in config file instead of default ones.

Let me know if this works out for you.

@popnt
Copy link
Author

popnt commented Sep 20, 2016

I've been running the most recent master with new ownership/permissions settings and everything seems to be running smoothly! I haven't noticed anything out of the ordinary, all around job well done I would say :)

@deajan
Copy link
Owner

deajan commented Sep 20, 2016

Thanks for the feedback.
I'll have to review the code before going for a release.
Feel free to open another issue if you think of other improvements.

@deajan deajan closed this as completed Sep 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants