-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan-mailbox
task looping on first 50 items
#551
Comments
Also, the output for a task should probably have a |
Hi @bjeanes , thanks for reporting! I'm looking into it! I suspect that the exclusion filter is not taken into account when going to the next batch. And yeah, the whole job/queue view is still in its earliest version…. I do have the desire to improve that. Some css can help a little already, I will see that I add it right away. It is currently not designed for large output. So there will be some deeper changes in the future. |
What kind of settings did you use (subject filter? are mails moved or deleted? etc) Thanks! |
Anything I didn't mention is blank |
Thank you very much! The glob I think part of the problem is that you currently need to move the "done" mails out of the way. But the thing is that the subject filter is applied before they are moved or deleted, so this doesn't help in your case :( Going to fix that. But I think you need to still define a target folder to move processed mails to. It is currently implemented by searching again and again to not hold onto a connection for too long (some mail servers are picky, but defaults may be too small?). It expects to not find the previous mails again. However, you can set this limit yourself in the config file (see |
Gotcha... You might need to make target folder required in that case. Unfortunately, that's also kind of a deal breaker, as this is my personal inbox, not just a a dummy receptacle for documents. I don't want it moving/re-organising at all. My expectation as a new user was that this would work by using send dates as a type of cursor:
This may be a misunderstanding of the capabilities of IMAP on my part, but it was my understanding that it was designed to constrain results in such a way, even if it doesn't have official "pagination" per se.
Ack! |
Perhaps I can use GMail filters to match on my search terms and apply a label, configure Docspell to search the "folder" for that label and essentially remove the label after it is processed. This could work, but would be highly GMail-specific, and I don't plan to stay with GMail if I can help it. |
Yes, I see your concern. The date filter is sure an option. IMAP supports this in the protocol afaik, but not all imap servers implement it. Then the imap servers just send all mails and you have to filter in memory…. Guess (or hope) this is rare, it can be simply an option and users need to decide then. It complicates it because now the task needs to memorize something in between (the state is not externalized). I will create a separate ticket to find a solution for better bulk-importing. Another workaround for now would be to use real sync tools like mbync or offlineimap and then upload the files via You currently have the option to delete the mail or to move it away; or to let the "received since" and the timer expression work together, like scan mails every 6hours and search for received>now-6hours…. That's why its not required, but I do see that this is quite inconvenient…. The idea was more to periodically get the newest mails and not so much for bulk-importing. |
I've noticed based on Elm and Scala use that you're of the FP persuasion. Thinking in an FP way... one option is to only have the scan job do its own batch, and then enqueue the next batch with a parameter. That way, you don't need state. Each scheduled task always starts from beginning of mailbox. The "Scan Mailbox" settings page already has a For my use case, I'd do a single run with |
Yes, thanks, it works like this already as you said. I can just emit new jobs with the next batch, or I can re-search in a loop (it is done like this currently). I can think of problems still: It may occur that I don't get an ordered list of mails back, such that I don't know for sure whether the last mail is the one with the "newest" date. And even when I'm submitting new jobs or re-search, there will come the time when the job gets submitted anew starting over – it is designed to be a periodic job. Then it will start all over and re-read all mails. So I was thinking to provide a separate task for bulk-importing mails as a one-time job instead? |
That seems useful.
Yeah, until/unless you do something like a specific one-off job as you suggest, I think it's still workable as I said by blanking the I think emit new jobs for new batch instead of looping, because for VERY large inboxes, this ties up the job executor for a long time and a shutdown of the executor would lose the state. Lots of smaller jobs are more flexible with execution and scheduling.
Yeah that will be a bit annoying. Ideally, if you are getting say 50 things back, you can check if they are sorted. If not, assume you can't trust dates as a cursor and raise an error perhaps. Or revert to a more naive import. |
Yes definitely! I would do the bulk-import a bit differently, exactly as you said. The existing job is meant to be able to get restarted and called periodically. It currently "loops" over batches but stops after a certain amount of mails have been submitted to processing. The next periodic run can then start from there.
Yes it could work like this, but it has the problems mentioned before. If it starts new jobs it can easily interfere with the periodic run. You would need to regularly monitor the processing to see whether the "whole" job is done or not. Because if not, it is unsafe to enable the same periodic job. Then it might mean that "run once" works a bit differently (submit jobs) compared to "periodic" (not submit jobs). It would be confusing I think… Otoh if it does everything in one run it can easily block the executor for a long time, meaning that other jobs can't run on lower hardware (if there is just one joex thread). When submitting many smaller jobs, it still means that it depends on some stored criteria (whether it is fp or not doesn't matter much imho). It must be ensured that it retrieves mails in some order, so it might be unsafe to provide too many options for filtering to the user (when thinking to run periodically). In the current situation, it is ok, because the "central" mailbox itself is synchronized. But this is just more a feeling right now, I might be wrong here, have to think a bit more :). In my feeling the cases "import all" and things like "import newest mails daily" is very different scenario. |
When skipping mails due to a filter, it must still enter the post-handling step. Otherwise it will be seen again on next run. Issue: #551
When skipping mails due to a filter, it must still enter the post-handling step. Otherwise it will be seen again on next run. Issue: #551
When skipping mails due to a filter, it must still enter the post-handling step. Otherwise it will be seen again on next run. Issue: #551
I closed this, because it now should work when moving mails away or when choosing a large batch via config. I created #557 for a better bulk-import. |
Job output makes this most clear. As you can see, while it states
Searching next 50 mails
it always reads the first 50 again and again:The text was updated successfully, but these errors were encountered: