-
Notifications
You must be signed in to change notification settings - Fork 15
Enhance multilingual processing/options #7
Comments
I think it would add big complexity to include an undefined number of languages.
I'll add some instructions on how to multiply service instances. |
I can appreciate that adding specialised language triaging would make the script more complex. I was hoping to keep it manageable based on a single abstraction layer tied into the pathnames and mapping those directly to the language parameter.. Maybe it's more trouble than it's worth :) Your workaround of running multiple scripts sounds very reasonable however wouldn't doing it that way then make the multicore routines in v1.5 independent from each script? Like there would not be any pooling of available cores, right? |
You are right, each script could use the given number of cores. So having 6 scripts each allocating 4 cores could lead to 24 core usage when all folders get feeded at the same time. Let me think about a config array in order to launch one monitor per dir with different OCR options.
This way, I could loop over the array and create a monitor per dir with different OCR parameteres. |
Actually, I'm thinking of this since some time now. |
Perhaps keep the default options in the main pmocr.sh but allow individual configs to override them so you don't end up with the same settings repeated in 10 different configs? |
Actually, in batch mode wouldn't it make more sense to pass the options as parameters from the CLI? that way if you have a special batch to process it doesn't necessitate having a config already built. Passing a config as parameter would be useful too but adjusting individual settings sans config is much quicker |
There's no way you can pass OCR options from CLI, because there are way too much. All other options except the new multicore variable can be passed as cli argument already in batch mode. The point of removing default options from main pmocr.sh is to be able to upgrade without losing config. |
sounds good..! Eager to see what you come up with :) |
Finished moving config to default.conf and adapt service files. |
I updated via git pull and after running install.sh and using systemctl start pmocr-srv@default.service then systemctl status pmocr-srv@default.service I receive the following
Important to note is that I had made manual changes from v1.4 to monitor multiple paths. I have removed pmocr-instance.sh and also /storage/storage_ocr and re-ran install.sh but the folders are not re-created... so without looking too deeply into the matter it seems there may be some inconsistencies in 1.5 Also, the README only lists instructions for how to run multiple configs with systemd.. will the initV style no longer support multiple configs? |
I've worked a bit too fast and made an error in the default.conf file. Install.sh won't create any other folders than /etc/pmocr. You're supposed to have folders to monitor which you setup in default.conf. The README states that running InitV style automatically creates an instance per config file, so yes, initV supports multiple configs :) If you have other failures with the new version, please give me the output of
and /var/log/pmocr.log |
I updated to the last committ, re-added the /storage/service_ocr/* paths and ran systemctl start pmocr-srv@default.conf however the I'm still getting a similar error. Here is the full output:
the /var/log/pmocr.log only contains one line at the end which is the following:
Given that this version is still not quite ready and I do not have a test environment setup, I'll revert back to the previous working version and just run two instances and only specify 2 cores for each instance on my 4 core machine. |
Do you use tesseract or abbyyocr ? But you have to comment out the lines in default.conf that don't correspond to your ocr tool, and uncomment those that correspond to your tool. Have you done this step ? |
I had reviewed default.conf and saw
which is the ocr engine I'm using so I'm not sure why other lines corresponding ABBYY should actually be commented out. I do not recall having to comment out those lines from v1.4. I had a look at the readme again and there is no indication to perform the step you describe. I can go through the remainder of default.conf but I have no idea what other lines should be edited or commented out. |
OCR_ENGINE=xxxxxxx helps the program to decide what type of special code it has to run. But OCR_ENGINE_EXEC is defined twice if one of the sections is not commented out. I've commented both sections out by default, and added more clear instructions in default.conf. [Edit] In v1.4 I had logic to remove lines depending on OCR_ENGINE=xxxxx, but I cannot add logic code to a conf file :)[/Edit] |
I see that pmocr.sh originally had an IF wrapper which prevented the declarations for tesseract3 and abbyyocr11 from affecting each other (line-59). Personally I think that was a more elegant solution than commenting out entire sections as this can lead to human error, and it's must easier to control which OCR to use using one global variable OCR_ENGINE instead. Nevertheless I commented out the abbyyocr11 settings from default.conf and the service appears to start properly now. I have not tested with multiple configs yet. When I moved one single jpg into a monitored path, the ocr process began as expected. However then I moved another different jpg into the same directory while the first process was still running, the ocr process picked up the new jpg but also started a new process to ocr the previous jpg. So there were 3 processes running concurrently, 2 for the same jpg started at different times. |
As I said, I cannot add IF logic directly into a config file. I'm not sure about your problem with the double ocr process as no other OCR session should launch until the first is finished, from a code point of view. I'll have some tests. |
Having done multiple tests, there shouldn't be a way to get multiple times the same file processed.
There shouldn't be more than one inotifywait instance with the same path. |
I re-read previous notes and I cannot find where you indicated the config cannot contain an IF, but your explanation makes sense though and I agree the tradeoff to having multiple configs is better than to duplicate the executable. As for the multiple instances, I ran ps aux|grep inotify and there is only one process running. I repeated my earlier test of first copying a single file to the monitored path, wait a second for the ocr to start, then copy a second different file, and unfortunately three processes were running again, same pattern as before: 2 OCR running for the same file. I am not sure what might be causing this but even after rebooting I am able to repro. I am monitoring by running screen, then top in one window and bash in another. Then from the bash window I copy a single jpg, wait for tesseract to appear, then copy another jpg. It does successfully output three PDFs and there are no other jpgs in the monitored path prior to the test procedure. Can you think of anything else other than multiple inotify running that might cause this? |
I stated that "I cannot add logic to a config file" (the comment with the EDIT tag), which includes any instructions other than variable assignments, including IF. Anyway, I've tried to reproduce your problem, without success. If you still experience your problem, I'll need you to follow the instructions below: Stop the service, double check there is no inotifywait and no pmocr running with
Then launch the service manually with the following command (using the debug version instead of the normal version of the program):
Then I'll need the output of /var/log/pmocr.log and /var/log/pmocr_debug.log (pasted as gist if possible). |
In the interest of keeping the testing as clean as possible I'll setup a test server via QEMU.. my platform is a Raspberry Pi which is ARM and apparently QEMU is the easiest virtualisation method. I'll let you know soon when I have that setup. I had also noticed that bug which skipped newly added files if there was a current OCR process running, however since the next batch of files would then include the one skipped I didn't think it was too critical, but nice to see you caught that one too :) Please allow me a few days to setup the virtual machine. I'll reply back with the results after following the debug instructions above. |
Corrected two other bugs this morning and improved tesseract support. Btw, I'm thinking of including an OCR preprocessor for tesseract. Do you use any tools like OpenCV or ImageMagick to deskew / clear background / remove noise from your images prior to handle them with tesseract ? |
I manually cleared off previous pmocr tests, and updated to master, and I am no longer able to repro. I'm not sure what might've changed but with each test I performed there was no duplication of processes. I have not yet attempted to roll back to previous pmOCR versions to see if the behaviour can be found in a previous version or if it was simply my local environment. However I did complete tests using QEMU running Wheezy 3.1.9 using latest master and again I could not find any duplicate process. If I notice the behaviour occurring again I'll update this bug but I'm fine with moving on and closing the issue. The only behaviour I noticed that I am not able to reconcile is that when I get the pmocr service status, there are two pmocr.sh processes running, as shown below. Notice there are two pmocr.sh running but like I said above there is no duplication of process, each jpg is scanned only once.
Regarding your question about deskew and reduce noise, my scanner does have built-in deskew and noise reduction so I have not yet had the need to perform these as part of the OCR process, but I can certainly see the benefit for others. If a new issue is opened regarding ImageMagick integration I'll add in feedback if any come to mind :) |
Glad to hear everything worked for you. About the preprocessing, I already integrated ImageMagick as optional preprocessor for Tesseract in latest commits. |
Yes I saw the new settings, I haven't experimented with them but I'll give them a shot when I get a chance. One question, but I'm not sure it's relative to this release or was the bahviour before: the source files are added to the monitored path by a regular non-root user, however the output pdf is actually owned by root. This doesn't seem normal, although on my system I am the only user it's no big deal but shouldn't the output files be owned by the same user as the source files? |
Good point. Actually the files are created as the user who runs the service.
What do you think about the ACL solution ? |
For ACL heritage, just so I understand the suggestion correctly, you're saying that the current folder's owner is what would be inherited by the files created? I suspect I have misunderstood something, but if that's what you're saying then yeah maybe that would be best. I agree the first two points are probably not ideal, although I'm not sure retaining all the file permissions would be necessary but rather just the owner -- how much more or less complex that would be versus retaining all the permissions, is this just semantics? Launching the service as another user could end up being a mess and probably not useful for an admin. |
The owner of new files would still be root, but an ACL on the parent folder would allow other users or groups than root to have the same privileges as root on the file. Getting and setting ownership and / or permissions on file isn't really elegant, even if it doesn't ask too much code effort. Launching the service as another user is what is often used other daemons, but it would just lift the problem you describe from root to another user. |
Setting the ACL on the parent folder would require the person installing the service to set the permissions correctly and then manage the user/group rights, which I imagine probably be very reasonable from an admin perspective. However perhaps for a less experienced admin it might be an annoyance. Conceptually speaking if I were running a service for a group of users and I wanted them to benefit from the OCR service but I didn't want to allow each one to read or edit any of the other users content then maybe managing groups in the parent folder would be less convenient than making the output accessible only by the originating user. I really don't have enough experience managing file rights for groups of users like this so I'm not sure what sort of headache this might become. Perhaps I am naive but if assigning the same ownership/permissions to the output file is technically trivial albeit not super elegant, that may be a worthwhile sacrifice. I only asked this question under this issue to confirm if this is the intended behaviour for v1.5, but we can open up a new Issue and discuss it more in detail there? The multilingual/multiconfig aspects of v1.5 seem to be pretty solid AFAIC. |
Well there aren't like 1000 issues open on pmocr, so we might just continue to talk here. I added an optional parameter to keep ownership of the files, as long as the service is executed as root so it can chown. Let me know if this works out for you. |
I've been running the most recent master with new ownership/permissions settings and everything seems to be running smoothly! I haven't noticed anything out of the ordinary, all around job well done I would say :) |
Thanks for the feedback. |
Both Tesseract and ABBYY support multilingual OCR however including multiple dictionaries for a single scan increases processing time and (at least for Tesseract) it decreases the language optimization strategy since pattern matching breaks down as you increase the variety of patterns to match against.
I would argue that unless you work in the translation field, it's unlikely that a single document will contain more than one language, and it's more likely that you will have a variety of documents to scan that will be one of a handful of languages.
This happens frequently in countries that have more than one official language. Canada is one example where documentation can arrive in either English or French but rarely both together on the same printed page. The United States for example has a high percentage of Spanish albeit unofficially a second language, however Switzerland has four official languages being German, French, Italian and Romansh.
The script does have language settings however unless you edit those parameters each time you scan a given document, all specified languages will be included in all scans and the end result may not be as accurate or effective versus if a particular language was specified for each document.
What I propose is that the script be modified slightly to monitor subfolders within the current set of monitored folders and then adjust the language parameter accordingly. Example:
This can be especially useful if you use something like FTP from a scanner so you can change the target location of a scan and that will change the language.
The text was updated successfully, but these errors were encountered: