Consumer is not detecting files from 120kb? #378
Comments
Does the consumer output anything at all when you add a large doc to the consumption dir? If you feed the consumer first a big file and then a small file (both similarly named) — will the consumer only the consume the small file? Does the consumer use inotify? |
@erikarvstedt No it does not. When I compress a too big PDF and it gets small enough it detects the PDF just perfectly. I am running Paperless and Paperless Consumer on Docker. And it uses inotify yes! |
This seems really odd. On my native setup (no docker) my biggest pdf is 20mb and it was consumed with no issues. |
@ddddavidmartin Everything out of the box or have you tweaked some settings? |
@MaartenMol96 If you can share the log of the consumer as @erikarvstedt suggested, that'll give us more to work with, as would sharing the file that can't be consumed so we can try it with our test environments. Something else worth looking into is whether the document in question has already been consumed. Paperless does a check on every candidate for consumption that if that file has the same hash as one of the ones in the db, it will explicitly skip it (this is mentioned in the log output). If that's the case, it may seem that Paperless isn't detecting it, but in fact it's ignoring it as redundant. |
@danielquinn There is no logging of the file because it doesn't do anything with it. The files that get consumed are logged properly and works fine. The file that doesn't work for example: Paperless Latest Documentation PDF It's also not a duplicate in any form. |
Well that's annoying. I can't reproduce it. Here's what I did and it worked fine:
If your server isn't outputting anything (even the "duplicate found, skpping" message), then I'm guessing that the folder you're writing to isn't the same folder Paperless is monitoring for changes. Is it possible that your Docker setup isn't mounting the right directory, or it's somehow un-mounting at some point? Without knowing much about your setup, it's hard to know what's going on. Paperless can definitely handle that file though. |
@danielquinn Why are other files proccesed perfectly then? Same directory, no unmounting is happening. I am using the exampe docker compose. Running Docker CE on CentOS on a LVM partitioned SSD. |
Sorry @MaartenMol96, without the ability to reproduce it, I'm afraid I can't help. My recommendation would be to poke at the code a bit. Have a look at Outside of that, I'm afraid I'm tapped. |
This may sound like a crazy idea, but maybe increasing the inotify watchers helps? @SummittDweller had also a strange issue on CentOS. He has documented how to increase the inotify watchers. The way he does it, is only a temporary change, but that way it's ideal for testing: #370 (comment) I would be interested in the result if you can try that. |
@Strubbl This fix isn't working for me as it was applied months ago because of an issue with the Plex Media Server container that needed a lot more watchers. So I upped it to the max back then.... @danielquinn I'll try some things when I have the time, else I have to look for an alternative. |
@danielquinn I have tried a few things now, but one thing is really strange: when I reboot my docker host, the paperless container start automaticly (not strange ofcourse) but then it consumes all large files without a problem! Whats going on here? |
That makes me think that this has something to do with inotify. On start, Paperless will take a first look at the directory and consume what's in it, and then wait for instructions from inotify. Therefore, if it's consuming the files on-start, then inotify is likely the culprit. Try starting the consumer with |
I'm also trying to run paperless in a Docker container (under RancherOS) and I'm having the same problem that new files aren't consumed but when I restart the container all documents are consumed without any problem. Where would I put the |
Sorry, I guess I wasn't clear. |
@danielquinn I get the following error from the docker container: Following in my docker-compose.yml:
The following runs just normal (except for the problems described above):
|
The inotify portion is an option to the command. You can't put them in quotes like that. Try this:
|
Alright I think we can close this issue. This command is the fix! Now all documents are accepted and consumed directly. |
Would you mind writing something for the troubleshooting section before you close this? |
I couldn't reproduce the bug in CentOS or RancherOS (the OSes of @MaartenMol96 and @Blackhawk92). As it looks like an inotify bug it would be valuable not just for paperless users to find out the cause. |
I had the same issue when testing paperless locally using docker within a vagrant test environment. It would consume items when first spinning the docker container up. It also consumed items when manually attaching to the container and invoking The reason in this specific case seems to be that virtualbox environments do not trigger inotify updates. I am not sure if @MaartenMol96 @Blackhawk92 are also running paperless under virtual machines, but if so that could be the central cause. It seems to specifically not trigger updates for any shared volumes - but may happen for the whole virtual machine file-system, I am not sure. For those who still need inotify updates in this case, another docker application seems to provide a band-aid for the issue - but really it just does the same as not using inotify. And of course, supplying |
I experience the same problem as described in the original post. PDF files larger than around 120KB are not processed by the consumer, while smaller files are processed immediately. On a restart of the consumer, all files in the consumer directory are processed, no matter what size. I am running on Ubuntu 18.04 (not in Docker), on a Google Cloud Compute instance. Could it be something about the virtualization as suggested by the virtualbox bug report? @MaartenMol : Are you in a virtualized environment? |
This issue is pretty old with now updates, so I'm going to close it for now. If there's new developments, feel free to re-open. |
Hello,
Just testing this nice application, however.... I have imported numbers of pdf files but the consumer can't detect larger files. Like for example I downloaded the paperless documentation pdf, he didn't detect it.
I am trying to scan a few documents with android apps but I cant import them because of the file size?
The text was updated successfully, but these errors were encountered: