Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Consumer is not detecting files from 120kb? #378

Closed
MaartenMol opened this issue Jul 11, 2018 · 23 comments
Closed

Consumer is not detecting files from 120kb? #378

MaartenMol opened this issue Jul 11, 2018 · 23 comments

Comments

@MaartenMol
Copy link

Hello,

Just testing this nice application, however.... I have imported numbers of pdf files but the consumer can't detect larger files. Like for example I downloaded the paperless documentation pdf, he didn't detect it.

I am trying to scan a few documents with android apps but I cant import them because of the file size?

@erikarvstedt
Copy link
Contributor

Does the consumer output anything at all when you add a large doc to the consumption dir?
Please check the standard output of the consumer process or the logs at http://{YOUR_PAPERLESS_SERVER}/admin/documents/log/.

If you feed the consumer first a big file and then a small file (both similarly named) — will the consumer only the consume the small file?

Does the consumer use inotify?
You can check that by looking for a line in the consumer logs that says
Starting document consumer at {CONSUMPTION_DIR} with inotify

@MaartenMol
Copy link
Author

MaartenMol commented Jul 12, 2018

@erikarvstedt No it does not. When I compress a too big PDF and it gets small enough it detects the PDF just perfectly. I am running Paperless and Paperless Consumer on Docker. And it uses inotify yes!

@ddddavidmartin
Copy link
Contributor

This seems really odd. On my native setup (no docker) my biggest pdf is 20mb and it was consumed with no issues.

@MaartenMol
Copy link
Author

@ddddavidmartin Everything out of the box or have you tweaked some settings?

@danielquinn
Copy link
Collaborator

@MaartenMol96 If you can share the log of the consumer as @erikarvstedt suggested, that'll give us more to work with, as would sharing the file that can't be consumed so we can try it with our test environments.

Something else worth looking into is whether the document in question has already been consumed. Paperless does a check on every candidate for consumption that if that file has the same hash as one of the ones in the db, it will explicitly skip it (this is mentioned in the log output). If that's the case, it may seem that Paperless isn't detecting it, but in fact it's ignoring it as redundant.

@MaartenMol
Copy link
Author

MaartenMol commented Jul 12, 2018

@danielquinn There is no logging of the file because it doesn't do anything with it. The files that get consumed are logged properly and works fine. The file that doesn't work for example: Paperless Latest Documentation PDF

It's also not a duplicate in any form.

@danielquinn
Copy link
Collaborator

Well that's annoying. I can't reproduce it. Here's what I did and it worked fine:

  1. Deleted my test db: rm data/db.sqlite3
  2. Setup the db: `cd src && ./manage.py migrate
  3. Fetched your test PDF into the consumption directory: cd /tmp/paperless/consume && wget https://media.readthedocs.org/pdf/paperless/latest/paperless.pdf
  4. Ran the consumer:
$ ./manage.py document_consumer
Starting document consumer at /tmp/paperless/consume
Parsers available: RasterisedDocumentParser
Consuming /tmp/paperless/consume/paperless.pdf
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-dfxzmu2l/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.
Skipping OCR, using Text from PDF
Detected document date 2018-07-08T00:00:00+00:00 based on string Jul 08, 2018
Completed
Document 20180708000000: paperless consumption finished

If your server isn't outputting anything (even the "duplicate found, skpping" message), then I'm guessing that the folder you're writing to isn't the same folder Paperless is monitoring for changes. Is it possible that your Docker setup isn't mounting the right directory, or it's somehow un-mounting at some point? Without knowing much about your setup, it's hard to know what's going on.

Paperless can definitely handle that file though.

@MaartenMol
Copy link
Author

@danielquinn Why are other files proccesed perfectly then? Same directory, no unmounting is happening. I am using the exampe docker compose. Running Docker CE on CentOS on a LVM partitioned SSD.

@danielquinn
Copy link
Collaborator

Sorry @MaartenMol96, without the ability to reproduce it, I'm afraid I can't help. My recommendation would be to poke at the code a bit. Have a look at consumer.py and add some print() lines to help debug your particular situation. Maybe it's that the inotify events aren't triggering? To test for that, simply restarting the consumer should do the job. Maybe the file(s) are somehow not visible to the conumer? To test for that, add a print() line around here. to see what's going on.

Outside of that, I'm afraid I'm tapped.

@Strubbl
Copy link
Contributor

Strubbl commented Jul 12, 2018

This may sound like a crazy idea, but maybe increasing the inotify watchers helps?

@SummittDweller had also a strange issue on CentOS. He has documented how to increase the inotify watchers. The way he does it, is only a temporary change, but that way it's ideal for testing: #370 (comment)

I would be interested in the result if you can try that.

@MaartenMol
Copy link
Author

@Strubbl This fix isn't working for me as it was applied months ago because of an issue with the Plex Media Server container that needed a lot more watchers. So I upped it to the max back then....

@danielquinn I'll try some things when I have the time, else I have to look for an alternative.

@MaartenMol
Copy link
Author

@danielquinn I have tried a few things now, but one thing is really strange: when I reboot my docker host, the paperless container start automaticly (not strange ofcourse) but then it consumes all large files without a problem!

Whats going on here?

@danielquinn
Copy link
Collaborator

That makes me think that this has something to do with inotify. On start, Paperless will take a first look at the directory and consume what's in it, and then wait for instructions from inotify. Therefore, if it's consuming the files on-start, then inotify is likely the culprit.

Try starting the consumer with --no-inotify and see what happens?

@alx-k
Copy link

alx-k commented Jul 15, 2018

I'm also trying to run paperless in a Docker container (under RancherOS) and I'm having the same problem that new files aren't consumed but when I restart the container all documents are consumed without any problem. Where would I put the --no-inotify when running in a container? Into scripts/paperless-consumer.service?

@danielquinn
Copy link
Collaborator

Sorry, I guess I wasn't clear. --no-inotify should be added to how you execute the consumer, so ./manage.py document_consumer --no-inotify. This will likely require modifying your docker-compose.yml file, but I'm on mobile, so I can't be sure where it's invoked in the Docker case.

@MaartenMol
Copy link
Author

@danielquinn I get the following error from the docker container: Unknown command: 'document_consumer --no-inotify'

Following in my docker-compose.yml:

command: ["document_consumer --no-inotify"]

The following runs just normal (except for the problems described above):

command: ["document_consumer"]

@danielquinn
Copy link
Collaborator

danielquinn commented Jul 15, 2018

The inotify portion is an option to the command. You can't put them in quotes like that. Try this:

["document_consumer", "--no-inotify"]

@MaartenMol
Copy link
Author

Alright I think we can close this issue. This command is the fix! Now all documents are accepted and consumed directly.

@danielquinn
Copy link
Collaborator

Would you mind writing something for the troubleshooting section before you close this?

@erikarvstedt
Copy link
Contributor

I couldn't reproduce the bug in CentOS or RancherOS (the OSes of @MaartenMol96 and @Blackhawk92).

As it looks like an inotify bug it would be valuable not just for paperless users to find out the cause.
If you're affected – could you please share the steps to reproduce the bug or a virtual machine? I'd love to look into this.

@marty-oehme
Copy link

I had the same issue when testing paperless locally using docker within a vagrant test environment.

It would consume items when first spinning the docker container up. It also consumed items when manually attaching to the container and invoking document_consumer again. But in each case, it only did so once and never seemed to run again.

The reason in this specific case seems to be that virtualbox environments do not trigger inotify updates. I am not sure if @MaartenMol96 @Blackhawk92 are also running paperless under virtual machines, but if so that could be the central cause. It seems to specifically not trigger updates for any shared volumes - but may happen for the whole virtual machine file-system, I am not sure.

For those who still need inotify updates in this case, another docker application seems to provide a band-aid for the issue - but really it just does the same as not using inotify. And of course, supplying --no-inotify also provided a quick fix for the issue.

@rhclayto
Copy link

rhclayto commented Dec 31, 2018

I experience the same problem as described in the original post. PDF files larger than around 120KB are not processed by the consumer, while smaller files are processed immediately. On a restart of the consumer, all files in the consumer directory are processed, no matter what size. I am running on Ubuntu 18.04 (not in Docker), on a Google Cloud Compute instance. Could it be something about the virtualization as suggested by the virtualbox bug report? @MaartenMol : Are you in a virtualized environment?

@danielquinn
Copy link
Collaborator

This issue is pretty old with now updates, so I'm going to close it for now. If there's new developments, feel free to re-open.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants