Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rrule.between is quite slow #654

Open
jelly opened this issue Mar 26, 2018 · 7 comments
Open

rrule.between is quite slow #654

jelly opened this issue Mar 26, 2018 · 7 comments

Comments

@jelly
Copy link

jelly commented Mar 26, 2018

The following snippet takes ~ almost 0.1 seconds to run, this seems to be due to dateutil iterating over all the dates from 2000 till 2020. There may be an possibility to not loop over all the days, speeding the calculation up but only for this simple case I guess. For example starting at the between startdate, but this doesn't seem easy to implement or would have to be special cased for 'easy' recurrences. Thoughts?

from datetime import datetime
from dateutil.rrule import DAILY, rrule
output = rrule(DAILY, dtstart=datetime(2000,1,1), until=datetime(2030,1,1)).between(datetime(2020,1,1), datetime(2020,2,1))

Tested on:
Dateutil 2.7.0
Python 3.6.4

Background
Our application might need to create ~ 3/4 rrule objects per request which would take up a considerable amount of time.

@pganssle
Copy link
Member

pganssle commented Mar 26, 2018

@jelly Yes, I'm very interested in performance enhancements to rrule.

There are definitely some easy ones to be had with simple rules like DAILY, YEARLY etc. One problem with implementing such enhancements is that rrule maintains a cache so that this expensive cost is paid only one time. For "simple" rules you can probably skip around, but doing so means that you don't populate the cache with each query.

I think it requires some careful thinking, unfortunately, so it's probably not a quick job.

In the meantime, I'm not sure about your specific requirements or where these rrules are coming from, but my recommendations for performance enhancement would be:

  1. Try to keep the dtstart and until close to the dates of interest, so you don't end up doing a lot of unnecessary calculations.
  2. Try to re-use rrule objects - if the rrule objects you create tend to be the same ones over and over again, you can use some sort of cache (maybe an lru_cache) to get the exact same rrule object. You'll have to pay the performance cost the first time you iterate over the rrule, but subsequent lookups will come from the cache and will be much faster.
  3. Be aware of how many "misses" occur. This one doesn't apply much in your situation, but for example, rrule(DAILY, bymonthday=15) will be much slower than rrule(MONTHLY, bymonthday=15), the reason being that the first one goes day by day and checks if the monthday is 15 - constructing ~30 datetime objects that don't fit the rule for every one that does. The second one will jump month to month and then select the 15th of the month, with 0 misses each time.

Hopefully this helps, and if you'd like to dig deeper into this and think there are some easy performance wins that don't significantly impact code clarity or the public API, please do make a PR (or just point them out in this issue so that anyone who wants to make a PR will benefit from your work).

Thanks for your report, good luck!

@srepmub
Copy link

srepmub commented Mar 26, 2018

what about using something like cython, to get these tight loops potentially much faster, for any pattern..?

@pganssle
Copy link
Member

@srepmub A compiled backend may help, but in some ways that's beside the point. Because of the way rrule is written, @jelly's code will always generate 7305 unused dates (every day between 1 Jan 2000 and 1 Jan 2020), just to generate the 30 dates it's actually intending to use, but you could make it >= 250x faster in this instance if rrule could detect ahead of time that this is one of those cases where it's easily to predict what all the dates that match the rule in that interval are, even without calculating all the dates in the rule.

@srepmub
Copy link

srepmub commented Mar 27, 2018

sure, you could just optimize a few simple cases, but I think you may be underestimating the potential speedup of using cython or a manually written extension module (which I would guess to be something like a factor of 10 to 100 in this case).

it seems natural to me for a much-used, library like dateutil to not do its core calculations in pure python..

@pganssle
Copy link
Member

@srepmub I am not underestimating it at all, and in fact have plans for a compiled backend. It's not an amazingly simple undertaking because dateutil is used incredibly widely, meaning that any binaries need to be built for all supported versions of Python on all supported platforms.

Additionally, I am hoping to write the backend in Rust rather than C, but the bindings between Rust and the CPython API don't expose the datetime API yet (I'm working on a patch for this), so there's some work to be done on that side as well.

None of this is something that you can just slap on to an existing project without some serious potential compatibility problems, and also it's irrelevant to the problem that causes rrule slowness in this case, which is that the actual algorithm used is slow and will also be slow relative to a fast algorithm if you use a compiled backend.

@srepmub
Copy link

srepmub commented Mar 27, 2018

okay, sorry, guess I misinterpreted you saying 'it may help', as english is not my first language.. :) in any case, great to hear you are working on a compiled backend!

@pgcd
Copy link

pgcd commented Mar 26, 2019

I'm not sure of the implications but wouldn't it be better to use an iterator instead of a list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants