Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.get_absolute_url() is too expensive due to the involved of home_pk_cache (.is_home() check) #1623

Closed
BinaryKhaos opened this issue Feb 1, 2013 · 15 comments

Comments

@BinaryKhaos
Copy link

Since I have discussed this problem and possible solutions for two days on the irc channel, I thought it was time to make it into a proper issue for tracking.

example explaining the problem

Imagine a page which has several gallery-like areas that contain images for example which are mostly links to some cms pages. The models involved have a PageField. In the template, each linked image needs to call get_absolute_url().

What happens now is, in the associated Page.get_absolute_url(), a check is performed if that Page is a home page (to cut its path). Unfortunately the way this is done, is rather expensive: It checks for the topmost root node in the database (the root node with the smallest tree_id) and caches that result internally for subsequent lookups. This is done for every single Page (meaning for every PageField link).

On my very small test frontpage (and my client will have quite a bit more on it): The query count is 35. Commenting the .is_home() check brings the sql query count down to 17. The query counts grows with O(n), meaning every page link adds further queries to the list which is rather bad.

possible solutions

  1. Replace django-mptt with something more tailored to the needs of the cms but this is more a long-term goal I guess. ;-)
  2. Cache the pks per thread with thread locals. This could be done by introducing a custom Middleware that takes care of the initial setup (init) and invalidation after a request has been processed. The cache could then be used by Page or Permissions, thus moving the current thread local part out of Permissions.
  3. Cache the pks through django.core.cache across process boundaries. This would involve making the caching optional and instructing the user that activating the cache is only safe if he is using a fast and inter-process caching mechanism like memcache. The default cache is per process and thus not suitable.
  4. Simplify the marking and finding of home pages. For example, instead of relying on mptt, the Title.path could be set to '' before saving or introduce a additional boolean attribute. Thus, a simple test for the path or the new attribute would suffice.

IMHO, (4) is the solution which would cause the most headache. Great care would have to be taken that all cases where the home page changes, all db objects get updated. For example (all within the same site and public): Page1.tree_id < Page2.tree_id. Thus, Page1 is the home page. Now Page1 gets deleted, thus Page2 needs to be updated.

Solution (3) has the greatest potential in terms of saving queries and performance gains. For this to work, every time a Page is saved, the cache not only needs to be invalidated (easy) but also updated with the newest home pages. This could lead to races between processes / threads because we can no longer rely on a db transaction. This would need to be solved properly.

Solution (2) would also greatly benefit performance when there are multiple Page instances involved while requesting their absolute urls. It also has the benefit that across all Page instances, a consistent data set is used. Currently, every Page instance has its own cache which could in theory lead to a situation where two Pages think they are the home page. Again, said cache needs to be invalidated as well when the request is over (done in the Middleware) or when a page is moved/deleted/added which would be done in the Page itself.

So the purpose of this issue is a RFC with the goal to find a suitable solution to improve the current situation and hopefully get it into 2.4 (or even 2.3.6).

@digi604
Copy link
Contributor

digi604 commented Feb 2, 2013

So i think we will need to make at least one query per request to determine which page is home. Is it to get it from cache or from the DB. We could make sure that home is written to thread_local... but when do we write it to there?

@ojii
Copy link
Contributor

ojii commented Feb 2, 2013

I wish we could go with option 1, but unfortunately, this didn't find a majority, see https://groups.google.com/forum/?fromgroups=#!topic/django-cms-developers/S3jc69mug7A

I'm not a fan of using thread locals for this, as it makes things messy. They make testing a lot harder, debugging a lot harder and reading the code a lot harder.

I'm always a proponent to fixing stuff properly, instead of patching workarounds together. So this problem should be addressed on page-save, not on page-read. Then a page (after selection) should already know whether it is home or not.

@BinaryKhaos
Copy link
Author

I agree with Jonas but the problem with doing this in the page-save is, we find ourselves in a race and could, if not carefully crafted, end up with the wrong or duplicate home page(s).

IMHO, the following should work:

  1. save the model through MPTT by invoking the base save() like usual
  2. get a queryset that has the model's db row as well as the current home page row locked (QuerySet.select_for_update()), so no other threads / processes can modify it and block
  3. check if the model has become the new home page and change both rows accordingly and commit the transaction

(2) + (3) are the critical section which is nicely protected by the "db lock". The performance impact (if any at all) should be minimal since not much is being done in the critical section, only two rows are locked and this only happens on saves.

For this to work, we have have to add a new boolean to the Page model (which defaults to false) and is not touched when the model is initially saved (thus preserving its original home page status which can only be updated in the critical section). Also we would need to craft a data migration that properly sets the new home page for each site and draft/public state. Also, since select_for_update() was added with Django 1.4, we could either increase the Django requirement to 1.4 w/ the 2.4 release and only implement this for 2.4 or detect the availability during runtime and use either the old pk_cache or new method. Or do it with raw sql, naturally. Personally, I am all for increasing the Django requirement but that's just me.

Something similar is required for a Page delete/(move/copy?) naturally.

With this, no extra queries are needed per Page and we can drop the whole pk cache altogether (yay!). That solution would nicely decrease the query count all over the place and is the best so far imho.

@BinaryKhaos
Copy link
Author

Off-Topic: One thing I forgot: Personally I really would like to raise the question again if, in the long-run, MPTT should be replaced by something that is tailored to the needs of Django CMS and very efficient at it. Without having looked at the Mezzanine codebase, I know they pride themselves for their efficient solution of exactly this. But again, this is best suited for the mailing list and also nothing that could be done safely for 2.3 or 2.4, imho.

@ojii
Copy link
Contributor

ojii commented Feb 2, 2013

Why would this require select_for_update? Can't we lock the table (with transactions or something)? (I'm not a database export!)

On your OT: Why replace it? I'm still in favor of "vendoring" mptt into the cms (under the cms namespace) and then gradually alter it to fit our needs. MPTT (the algorithm) is still very suitable for our needs in my opinion.

@BinaryKhaos
Copy link
Author

Honestly, I have a bad feeling (tm) relying on transactions for this. A transaction could fail with some RDBMS. And I have a hard time thinking about some flow of statements that will ensure that when the transaction goes through, we will in all cases have a consistent state (one hp that is also the correct one). With transactions, things still happen in parallel, they are just applied at a later stage.

The Django ORM unfortunately has no support to lock a table explicitly. If you wanted to do something that drastic, you would have to resort to raw sql and as far as I know those statements are not 100% compatible amongst the different systems. But I am no expert as well, even though I had several semesters of fun with Oracle and database theory at university some years back. :)

With select_for_update, we are already ruling out sqlite because there is no support for this as far as I know. But on the other hand, nobody should really even consider using sqlite for production or even development. So this is, in my opinion, not really a problem, as long as it is properly communicated to the user.

@ojii
Copy link
Contributor

ojii commented Feb 2, 2013

I've not looked into this too much, but couldn't we just post-save set the new is_home field to something using a (rather complex) update statement? Page.objects.filter(our complex is-home filter).update(is_home=True) (and then somehow un-set is_home=True on the other pages)?

@BinaryKhaos
Copy link
Author

This smells like trouble. Imagine you have two transactions running in parallel, both insert pages and both qualify to be the new home page. You will end up with two records both marked as home page (even though you both update the old home page, which does not really matter in this case).

The trouble is, both transactions cannot see what the other one is doing (naturally) and all data they select is what is already committed in the database. So once you check for the smallest tree_id (+ additional checks), you can only do this against what is currently there. That data won't magically change because during the cause of your transaction, some other transaction has been committed which would change your picture entirely. The same also goes for sub-selects and so on...

Constructing a single update statement that does not use sub-selects to query for the smallest tree_id and satisfies all constraints, is (imho) not possible.

With the select_for_update things are a lot better but not perfect, though. You have to be careful here too, especially if there is no home page yet (e.g. no pages exist yet). Now you imagine you have two parallel transactions running, both only seeing themselves as the new home page. Since both should end up with the same tree_id, the save will fail and django-mptt will try again if I remember the code correctly... but just saying, this is a special case where there are no rows to lock which can act as a nice synchronization point.

@BinaryKhaos
Copy link
Author

Hm... thinking about it, if we make the the new field a Field.unique, both the model validation and the database will check for the integrity of this. This will raise a nice integrity exception which we could catch and handle if a second home page already exists (race condition vs another thread/process).

@BinaryKhaos
Copy link
Author

So, while I was doing some reading up on some specifics to figure out how this could be properly and safely implemented, I looked around in django-mptt's source to figure out its exact behaviour and what I saw quite surprised me: tree_id is not unique and some parts are rather optimistically implemented (e.g. race conditions) which saves some error handling at the cost of (in the worst case) corrupted structures or whatever.

For example, someone here complained that he got duplicated root nodes every once in a while:
http://stackoverflow.com/questions/9894199/can-mysql-constrain-based-on-a-null-value-combined-with-an-id

Please keep in mind that I have not studied the source for hours but at first sight it appears to me that django-mptt is based on the assumption that the tree_id is always unique if you look at comparisons done with it.

Generally, and please correct me if I am wrong and take this with a huge grain of salt, I am getting the impression that as concurrency grows, so does the likelihood of ending up with a corrupted database (and I define everything as corrupted that does not fall in line within the [unfortunately implicitly] defined data constraints).

So... back on topic: We could add a boolean field which is null=true and add a Options.unique_together over whatever fields we need unique as a combination (e.g. draft/public status, is_home, ...). That way, we can never end up with two home pages because we get an integrity error during commit.

That brings me to the next problem: If the database throws an integrity error, at least PostgreSQL will mark the current transaction as erroneous and expects a rollback. Nothing else will work, except for a rollback. We can get around this by being more careful with the transaction by either using/combining:

https://docs.djangoproject.com/en/dev/topics/db/transactions/#savepoints
https://docs.djangoproject.com/en/dev/topics/db/transactions/#controlling-transaction-management-in-views (used as context manager though)

Important to note: Savepoints are only supported from PostgreSQL 8.2 (I believe) upwards, Oracle and MySQL (I don't know the exact versions there). If it is not supported, those methods are noops which is a pity. And for MySQL it is apparently only supported for InnoDB.

So our best bet would probably be: Let django-mptt do its job (by saving the model) and do our work in a transactional safe way and check for an integrity key error and try again in that case (only our part since the model has already been saved).

@BinaryKhaos
Copy link
Author

Just a quick note: I'm currently busy because I've a deadline to meet for a project, so if someone wants to have a go at this, please go ahead... otherwise I'll work on it after my project's current iteration (hopefully in a week or so).

I still think fixing this for 2.4 would be a good idea and not introduce such a change as a simple point release.

@SpComb
Copy link

SpComb commented Mar 23, 2013

Cache the pks through django.core.cache across process boundaries. This would involve making the caching optional and instructing the user that activating the cache is only safe if he is using a fast and inter-process caching mechanism like memcache. The default cache is per process and thus not suitable.

django-cms already relies on consistent/invalidating inter-process caches for the menu tree cache, IMO this should just be a documented configuration requirement.

Globally caching Page.home_pk by draft/public + site would probably be the easiest fix for this? You would need to invalidate the cache on page editing, but option 4 would also involve that ("caching"/denormalizing in the db model)?

@digi604
Copy link
Contributor

digi604 commented Jul 11, 2013

3.1 will drop mptt in favor of materialized trees. this should speed things up considerable.

@yakky
Copy link
Member

yakky commented Dec 12, 2013

Isn't this solved by #2323 ?

@digi604
Copy link
Contributor

digi604 commented Jan 27, 2014

this is fixed

@digi604 digi604 closed this as completed Jan 27, 2014
yakky added a commit that referenced this issue Jul 23, 2014
Performance improvement for get_absolute_url (#1623, #1671)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants