New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer improvements, inotify support #351

Merged
merged 18 commits into from May 28, 2018

Conversation

Projects
None yet
3 participants
@erikarvstedt
Contributor

erikarvstedt commented May 11, 2018

Please see the individual commit messages for in-depth explanations of all changes.

I've been using inotify in production for a few weeks now and it all works perfectly.

erikarvstedt added some commits May 11, 2018

Mail fetching: Only catch internal errors
Previously, all errors raised during mail fetching were silently caught
and printed without backtrace.

To increase robustness and ease debugging, we now fail with a backtrace
on unexpected errors.
Set default empty PAPERLESS_EMAIL_SECRET
Previously, if the user didn't set PAPERLESS_EMAIL_SECRET, Paperless
failed with an error in check_body() because self.SECRET was None.
Fix list out of bounds error in mail message parsing
Check list length before accessing the first two elements of
'dispositions'.
The list may have only a single element ('inline') or may be empty in
mailformed emails.
Consume documents in order of increasing mtime
This increases overall usability, especially for multi-page scans.
Previously, the consumption order was undefined (see os.listdir())
Use os.scandir instead of os.listdir
It's simpler and better suited for use cases introduced in later commits.
Refactor: extract fn try_consume_file
The main purpose of this change is to make the following commits more
readable.
Ensure docs have been unmodified for some time before consuming
Previously, the second mtime check for new files usually happened right
after the first one, which could have caused consumption of docs that
were still being modified.

We're now waiting for at least FILES_MIN_UNMODIFIED_DURATION (0.5s).

This also cleans up the logic by eliminating the consumer.stats attribute
and the weird double call to consumer.run().

Additionally, this a fixes memory leak in consumer.stats where paths could be
added but never removed if the corresponding files disappeared from
the consumer dir before being considered ready.
Consider mtime of ignored files, garbage-collect ignore list
1. Store the mtime of ignored files so that we can reconsider them if
they have changed.

2. Regularly reset the ignore list to files that still exist in the
consumption dir. Previously, the list could grow indefinitely.

@erikarvstedt erikarvstedt changed the title from Consumer improvments, inotify support to Consumer improvements, inotify support May 11, 2018

erikarvstedt added some commits May 11, 2018

Refactor: renamings, extract fn 'loop'
Renamings:
loop -> loop_step
delta -> next_mail_time (this variable names a point in time, not a duration)

Extracting the 'loop' fn is a preparation for later commits where a
second type of loop is added.
Consumer loop: make sleep duration dynamic
Make the sleep duration dynamic to account for the time spent in
loop_step.
This improves responsiveness when repeatedly consuming newly
arriving docs.

Use float epoch seconds (time.time()) as the time type for
MailFetcher.last_checked to allow for natural time arithmetic.
@danielquinn

This is amazing work, thank you! I've added a few things that need fixing before I can merge it, but all-in-all, it's a great addition to the project.

def make_dirs(*dirs):
for dir in dirs:
try:
os.makedirs(dir)

This comment has been minimized.

@danielquinn

danielquinn May 20, 2018

Owner

You don't need to do a try/except here. os.makedirs() has an exists_ok=True kwarg you can use instead :-) In fact, I wouldn't even bother with this function (I see you've imported it elsewhere) as it can now be boiled down to:

for d in dirs:
    os.makedirs(d, exists_ok=True)

(note that dir is a reserved word, so don't use that ;-)

This comment has been minimized.

@erikarvstedt

erikarvstedt May 20, 2018

Contributor

Yeah, with this kwarg we don't need the helper fn.

@@ -42,7 +42,7 @@ class Message(Loggable):
and n attachments, and that we don't care about the message body.
"""
SECRET = os.getenv("PAPERLESS_EMAIL_SECRET")
SECRET = os.getenv("PAPERLESS_EMAIL_SECRET", "")

This comment has been minimized.

@danielquinn

danielquinn May 20, 2018

Owner

This is probably a Bad Idea, as testing for "" in any string will always return True, negating the value of the secret in the first place. I figure you did this 'cause later down we test if self.SECRET not in self.body and None in this case will break things, but the better way to account for this would be to have a test in __init__() for whether self.SECRET is None and raise an exception instead.

This comment has been minimized.

@erikarvstedt

erikarvstedt May 20, 2018

Contributor

I actually did this on purpose.
For use cases where no email secret is needed the empty string is an elegant way to disable it.
Why should a secret be mandatory?
(It's not a secret by the way, it's more like a marker, but I don't think it's worth renaming it.)

This comment has been minimized.

@danielquinn

danielquinn May 21, 2018

Owner

That's the thing, it's meant to be a secret. It prevents other people from stuffing your server with bogus (or even potentially nefarious) documents.

For example, if I know your target email address, I can send a bunch of PDF documents to your server and do any or all of the following:

  1. Fill up your hard drive (best case)
  2. Use a 0-day exploit that triggers something scary on your server.
  3. Use an anonymous email account to fill your personal document store with child porn.

...all by just emailing you something

The secret was added to make sure that emails are from a source you know you can safely attempt to import. It doesn't have to be an unmemorable string like cae3Mai1xi?uSood, but even insisting that every email must contain parse-all-the-things will protect you against a, b, and c above.

This comment has been minimized.

@erikarvstedt

erikarvstedt May 21, 2018

Contributor

I envisioned a use case where the email address is the secret.
But yeah, most people won't set it up that way and it may be better to nudge them towards a safer configuration.
Fixed.

@@ -75,8 +75,9 @@ def __init__(self, data, group=None):
continue
dispositions = content_disposition.strip().split(";")
if not dispositions[0].lower() == "attachment" and \
"filename" not in dispositions[1].lower():
if len(dispositions) < 2 or \

This comment has been minimized.

@danielquinn

danielquinn May 20, 2018

Owner

3-line if statements make me cringe a little. I didn't much like it when it was 2-lines, but at 3 we're pushing it. Can you maybe break this up into multiple if statements just for readability?

This comment has been minimized.

@erikarvstedt

erikarvstedt May 20, 2018

Contributor

Done.

@erikarvstedt

This comment has been minimized.

Contributor

erikarvstedt commented May 20, 2018

When all looks good to you I'll apply the fixup commits and we're ready to merge!

@danielquinn danielquinn merged commit f96e7f7 into danielquinn:master May 28, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls First build on master at 62.934%
Details
@danielquinn

This comment has been minimized.

Owner

danielquinn commented May 28, 2018

So it turns out that the argument is exist_ok= and not exists_ok=. My bad, sorry. I fixed the kwarg and merged your changes. Thank you so much! I figured there was a way to do this better, but I didn't know how when I first wrote this thing.

@erikarvstedt

This comment has been minimized.

Contributor

erikarvstedt commented May 28, 2018

Those fixup commits were meant to be squashed to avoid polluting the history.
If you don't mind the git push --force, we could still fix this. Would that be ok?

@danielquinn

This comment has been minimized.

Owner

danielquinn commented May 28, 2018

Normally, I'd be cool with that, but as a lot of people are running instances by checking out master, this could confuse/upset an indeterminite number of people. I'm less concerned about history (we've got merge commits anyway) and more concerned with user experience.

You seem to be a lot more adept at git than I am though, so advice for future stuff is appreciated :-)

@ovv

This comment has been minimized.

Contributor

ovv commented May 28, 2018

My opinion is to not do it :)

@erikarvstedt

This comment has been minimized.

Contributor

erikarvstedt commented May 28, 2018

Alright, case closed. Thanks for reviewing and the helpful suggestions! ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment