Consumer is not detecting files from 120kb? #378

MaartenMol · 2018-07-11T21:14:44Z

Hello,

Just testing this nice application, however.... I have imported numbers of pdf files but the consumer can't detect larger files. Like for example I downloaded the paperless documentation pdf, he didn't detect it.

I am trying to scan a few documents with android apps but I cant import them because of the file size?

erikarvstedt · 2018-07-11T23:51:49Z

Does the consumer output anything at all when you add a large doc to the consumption dir?
Please check the standard output of the consumer process or the logs at http://{YOUR_PAPERLESS_SERVER}/admin/documents/log/.

If you feed the consumer first a big file and then a small file (both similarly named) — will the consumer only the consume the small file?

Does the consumer use inotify?
You can check that by looking for a line in the consumer logs that says
Starting document consumer at {CONSUMPTION_DIR} with inotify

MaartenMol · 2018-07-12T07:19:29Z

@erikarvstedt No it does not. When I compress a too big PDF and it gets small enough it detects the PDF just perfectly. I am running Paperless and Paperless Consumer on Docker. And it uses inotify yes!

ddddavidmartin · 2018-07-12T07:38:09Z

This seems really odd. On my native setup (no docker) my biggest pdf is 20mb and it was consumed with no issues.

MaartenMol · 2018-07-12T09:37:42Z

@ddddavidmartin Everything out of the box or have you tweaked some settings?

danielquinn · 2018-07-12T09:41:06Z

@MaartenMol96 If you can share the log of the consumer as @erikarvstedt suggested, that'll give us more to work with, as would sharing the file that can't be consumed so we can try it with our test environments.

Something else worth looking into is whether the document in question has already been consumed. Paperless does a check on every candidate for consumption that if that file has the same hash as one of the ones in the db, it will explicitly skip it (this is mentioned in the log output). If that's the case, it may seem that Paperless isn't detecting it, but in fact it's ignoring it as redundant.

MaartenMol · 2018-07-12T09:45:03Z

@danielquinn There is no logging of the file because it doesn't do anything with it. The files that get consumed are logged properly and works fine. The file that doesn't work for example: Paperless Latest Documentation PDF

It's also not a duplicate in any form.

danielquinn · 2018-07-12T09:54:57Z

Well that's annoying. I can't reproduce it. Here's what I did and it worked fine:

Deleted my test db: rm data/db.sqlite3
Setup the db: `cd src && ./manage.py migrate
Fetched your test PDF into the consumption directory: cd /tmp/paperless/consume && wget https://media.readthedocs.org/pdf/paperless/latest/paperless.pdf
Ran the consumer:

$ ./manage.py document_consumer
Starting document consumer at /tmp/paperless/consume
Parsers available: RasterisedDocumentParser
Consuming /tmp/paperless/consume/paperless.pdf
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-dfxzmu2l/convert-%04d.png' @ warning/png.c/MagickPNGWarningHandler/1744.
Skipping OCR, using Text from PDF
Detected document date 2018-07-08T00:00:00+00:00 based on string Jul 08, 2018
Completed
Document 20180708000000: paperless consumption finished

If your server isn't outputting anything (even the "duplicate found, skpping" message), then I'm guessing that the folder you're writing to isn't the same folder Paperless is monitoring for changes. Is it possible that your Docker setup isn't mounting the right directory, or it's somehow un-mounting at some point? Without knowing much about your setup, it's hard to know what's going on.

Paperless can definitely handle that file though.

MaartenMol · 2018-07-12T10:30:05Z

@danielquinn Why are other files proccesed perfectly then? Same directory, no unmounting is happening. I am using the exampe docker compose. Running Docker CE on CentOS on a LVM partitioned SSD.

danielquinn · 2018-07-12T11:28:09Z

Sorry @MaartenMol96, without the ability to reproduce it, I'm afraid I can't help. My recommendation would be to poke at the code a bit. Have a look at consumer.py and add some print() lines to help debug your particular situation. Maybe it's that the inotify events aren't triggering? To test for that, simply restarting the consumer should do the job. Maybe the file(s) are somehow not visible to the conumer? To test for that, add a print() line around here. to see what's going on.

Outside of that, I'm afraid I'm tapped.

Strubbl · 2018-07-12T12:39:13Z

This may sound like a crazy idea, but maybe increasing the inotify watchers helps?

@SummittDweller had also a strange issue on CentOS. He has documented how to increase the inotify watchers. The way he does it, is only a temporary change, but that way it's ideal for testing: #370 (comment)

I would be interested in the result if you can try that.

MaartenMol · 2018-07-12T23:01:40Z

@Strubbl This fix isn't working for me as it was applied months ago because of an issue with the Plex Media Server container that needed a lot more watchers. So I upped it to the max back then....

@danielquinn I'll try some things when I have the time, else I have to look for an alternative.

MaartenMol · 2018-07-13T10:46:35Z

@danielquinn I have tried a few things now, but one thing is really strange: when I reboot my docker host, the paperless container start automaticly (not strange ofcourse) but then it consumes all large files without a problem!

Whats going on here?

danielquinn · 2018-07-13T11:05:18Z

That makes me think that this has something to do with inotify. On start, Paperless will take a first look at the directory and consume what's in it, and then wait for instructions from inotify. Therefore, if it's consuming the files on-start, then inotify is likely the culprit.

Try starting the consumer with --no-inotify and see what happens?

alx-k · 2018-07-15T11:42:33Z

I'm also trying to run paperless in a Docker container (under RancherOS) and I'm having the same problem that new files aren't consumed but when I restart the container all documents are consumed without any problem. Where would I put the --no-inotify when running in a container? Into scripts/paperless-consumer.service?

danielquinn · 2018-07-15T11:51:47Z

Sorry, I guess I wasn't clear. --no-inotify should be added to how you execute the consumer, so ./manage.py document_consumer --no-inotify. This will likely require modifying your docker-compose.yml file, but I'm on mobile, so I can't be sure where it's invoked in the Docker case.

MaartenMol · 2018-07-15T17:15:25Z

@danielquinn I get the following error from the docker container: Unknown command: 'document_consumer --no-inotify'

Following in my docker-compose.yml:

command: ["document_consumer --no-inotify"]

The following runs just normal (except for the problems described above):

command: ["document_consumer"]

danielquinn · 2018-07-15T18:09:30Z

The inotify portion is an option to the command. You can't put them in quotes like that. Try this:

["document_consumer", "--no-inotify"]

MaartenMol · 2018-07-15T20:56:55Z

Alright I think we can close this issue. This command is the fix! Now all documents are accepted and consumed directly.

danielquinn · 2018-07-15T21:54:16Z

Would you mind writing something for the troubleshooting section before you close this?

erikarvstedt · 2018-07-16T07:15:12Z

I couldn't reproduce the bug in CentOS or RancherOS (the OSes of @MaartenMol96 and @Blackhawk92).

As it looks like an inotify bug it would be valuable not just for paperless users to find out the cause.
If you're affected – could you please share the steps to reproduce the bug or a virtual machine? I'd love to look into this.

marty-oehme · 2018-08-01T14:13:13Z

I had the same issue when testing paperless locally using docker within a vagrant test environment.

It would consume items when first spinning the docker container up. It also consumed items when manually attaching to the container and invoking document_consumer again. But in each case, it only did so once and never seemed to run again.

The reason in this specific case seems to be that virtualbox environments do not trigger inotify updates. I am not sure if @MaartenMol96 @Blackhawk92 are also running paperless under virtual machines, but if so that could be the central cause. It seems to specifically not trigger updates for any shared volumes - but may happen for the whole virtual machine file-system, I am not sure.

For those who still need inotify updates in this case, another docker application seems to provide a band-aid for the issue - but really it just does the same as not using inotify. And of course, supplying --no-inotify also provided a quick fix for the issue.

rhclayto · 2018-12-31T02:11:00Z

I experience the same problem as described in the original post. PDF files larger than around 120KB are not processed by the consumer, while smaller files are processed immediately. On a restart of the consumer, all files in the consumer directory are processed, no matter what size. I am running on Ubuntu 18.04 (not in Docker), on a Google Cloud Compute instance. Could it be something about the virtualization as suggested by the virtualbox bug report? @MaartenMol : Are you in a virtualized environment?

danielquinn · 2019-05-07T07:26:13Z

This issue is pretty old with now updates, so I'm going to close it for now. If there's new developments, feel free to re-open.

erikarvstedt mentioned this issue Aug 23, 2018

Suggestion: Enable document_consumer command to be run from webinterface #388

Closed

danielquinn closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer is not detecting files from 120kb? #378

Consumer is not detecting files from 120kb? #378

MaartenMol commented Jul 11, 2018

erikarvstedt commented Jul 11, 2018

MaartenMol commented Jul 12, 2018 •

edited

ddddavidmartin commented Jul 12, 2018

MaartenMol commented Jul 12, 2018

danielquinn commented Jul 12, 2018

MaartenMol commented Jul 12, 2018 •

edited

danielquinn commented Jul 12, 2018

MaartenMol commented Jul 12, 2018

danielquinn commented Jul 12, 2018

Strubbl commented Jul 12, 2018 •

edited

MaartenMol commented Jul 12, 2018

MaartenMol commented Jul 13, 2018

danielquinn commented Jul 13, 2018

alx-k commented Jul 15, 2018

danielquinn commented Jul 15, 2018

MaartenMol commented Jul 15, 2018

danielquinn commented Jul 15, 2018 •

edited

MaartenMol commented Jul 15, 2018

danielquinn commented Jul 15, 2018

erikarvstedt commented Jul 16, 2018

marty-oehme commented Aug 1, 2018

rhclayto commented Dec 31, 2018 •

edited

danielquinn commented May 7, 2019

Consumer is not detecting files from 120kb? #378

Consumer is not detecting files from 120kb? #378

Comments

MaartenMol commented Jul 11, 2018

erikarvstedt commented Jul 11, 2018

MaartenMol commented Jul 12, 2018 • edited

ddddavidmartin commented Jul 12, 2018

MaartenMol commented Jul 12, 2018

danielquinn commented Jul 12, 2018

MaartenMol commented Jul 12, 2018 • edited

danielquinn commented Jul 12, 2018

MaartenMol commented Jul 12, 2018

danielquinn commented Jul 12, 2018

Strubbl commented Jul 12, 2018 • edited

MaartenMol commented Jul 12, 2018

MaartenMol commented Jul 13, 2018

danielquinn commented Jul 13, 2018

alx-k commented Jul 15, 2018

danielquinn commented Jul 15, 2018

MaartenMol commented Jul 15, 2018

danielquinn commented Jul 15, 2018 • edited

MaartenMol commented Jul 15, 2018

danielquinn commented Jul 15, 2018

erikarvstedt commented Jul 16, 2018

marty-oehme commented Aug 1, 2018

rhclayto commented Dec 31, 2018 • edited

danielquinn commented May 7, 2019

MaartenMol commented Jul 12, 2018 •

edited

MaartenMol commented Jul 12, 2018 •

edited

Strubbl commented Jul 12, 2018 •

edited

danielquinn commented Jul 15, 2018 •

edited

rhclayto commented Dec 31, 2018 •

edited