Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking removed in session::read #27

Closed
melvyn-sopacua opened this issue Oct 13, 2013 · 8 comments
Closed

Locking removed in session::read #27

melvyn-sopacua opened this issue Oct 13, 2013 · 8 comments

Comments

@melvyn-sopacua
Copy link

Hi Colin,

we've removed the locking for a client of ours in session::read and are currently running a battery of tests.
The rationale is this:

  • all single operations in redis are atomic
  • in the current code the locking mechanism itself introduces the need for transactional operations, as one wants to update the expiry information with the locking info (pid and lock reference accounting) in one transaction
  • in the current code the reading of session data is done regardless of the fact the lock is obtained, so this operation is already not an isolated transaction

By removing the locking in session::read we are no longer seeing slowdowns in session::read (which ran upwards of 40 seconds in the site in question).
So far we've not seen any drawbacks. A test is currently running which:

  1. log in the user
  2. on every subsequent pageview the login status is verified
    3)1 item is added to the cart
  3. the cart content (number of items and price of item) is verified on every subsequent pageview
  4. another item is added to the cart
  5. the cart content is now two items with a given price total, which is verified
  6. a loop is started of 20 iterations where categories and products are visited and cart contents verified
  7. the items are now deleted from the cart
  8. go back to 2

This site is load balanced with 2 webservers, a shared session redis store, shared and loadbalanced redis cache and shared single database. 50 users with different login id's are fired at the site. Software used for testing is Neoload 4.1.
At present we have not seen any errors and each virtual user has now completed 2 full iterations.
Browsing through the session store in phpRedisAdmin we're also seeing the sessions with a TTL of 3599 or 3600 seconds (session time is set to 1 hour).

There's also a lot of custom code in this Magento installation, but the sessions still identify cart ownership. I can also see things like last visited url in the session data correctly change over time.

I think our tests cover the possible breakage that could happen, but I'm open to other tests that should be done to verify session accuracy. Do you have any concerns about removing the locking code and how we should test for it?

@colinmollenhour
Copy link
Owner

Hi Melvyn. I agree that the merit of locking is arguable given the performance/complexity downsides. However, if one wants a redis session store that does not use locking they could just use the handler built into phpredis which I know some people are already doing. So although I hadn't strongly considered supporting a non-locking version I am not opposed to it (as long as it isn't the default). The only two benefits I can see over the built-in handler would be support without the phpredis module (no idea how many people go this route) and compression of the session data. I think the data compression is very valuable since it will let you store more sessions or for longer in the same amount of memory which is a limited resource on a large site.

Regarding the use-case for locking, the typical basic ecommerce site would work fine without it for the most part. But, if a user were to for example add two products to the cart in quick succession (or your site was really slow ;) then they might for example see two success messages on one page and none on the other, or see only one success message, or possibly worse (only one product gets added?). Another example might be the order success page.. If while an order was processing another page load occurred it could clobber the "last order id" and then the success page would throw an error and redirect and the user would never know if their order completed. I haven't tested all of the scenarios and instead have just opted to go the safe/proper route. It is also quite possible that some other feature or third-party module really depends on stable session data and you wouldn't know it until it started causing problems. For what it's worth, the "files" session handler does use locking (with no break/timeout mechanism) and the other handlers do not support any locking.

In short: If I receive a well-formatted pull request that adds a 'use_locking' option that defaults to '1' (for all of the poeple out there expecting to get locking) then I will merge it. :)

@melvyn-sopacua
Copy link
Author

Hi Collin,

The session data can become quite big, because of site-specific data being stored in there so compression is needed. Adding two products to the cart I can see how that's an issue with Ajax-based cartadd. As for page loads during order completion, since the write locking is still in place, I don't think a conflict would occur, but it's an interesting test.
I'll work on properly formatted patch and perhaps revisit my initial idea about rwlocks, so we only have to do accounting for the write locks.
I'm still not sure why we're seeing queues between 6 and 32 long (upping concurrency to 32 got rid of 503's), but due to time constraints at present, not able to dive into that.

@colinmollenhour
Copy link
Owner

The checkout success scenario:

  • User starts load of slow page, session is read
  • User completes checkout, session is read again, last order id is written to session
  • Slow page load completes, session is written, clobbers last order id
  • Checkout success page is rendered, last order id doesn't exist

It is a very unlikely race condition, but a race condition with negative consequences nevertheless.

The adminhtml probably has more scenarios with all of the "form_data" and grids and ajax features. Perhaps the configuration setting could be separate for each "area"?

@colinmollenhour
Copy link
Owner

Added a config option and also a constant to allow disabling locking in 0a03d8c.

@melvyn-sopacua
Copy link
Author

Today at another site we replicated the issue. It was caused by a security scanner. It fires requests in parallel, causing the lock to occur. The issue to solve here is the lock wait timeout. When two requests are fired simultaneously, the lock wait timeout is the one causing the problem:

        if ($tries == 1 /* TODO - $tries % 10 == 0 ? */) {
            $detectZombies = TRUE;
            // TODO: allow configuration of sleep period?
            usleep(1500000); // 1.5 seconds
        }

If this were a spinlock that bailed out to a time wait, we'd solve it, as can be observed:

2014-07-11T06:14:16+00:00 DEBUG (7): web1.example.org|22261: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00224 seconds
2014-07-11T06:14:28+00:00 DEBUG (7): web1.example.org|22261: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00102 seconds
2014-07-11T06:14:34+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00097 seconds
2014-07-11T06:14:49+00:00 DEBUG (7): web1.example.org|22261: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00107 seconds
2014-07-11T06:15:01+00:00 DEBUG (7): web1.example.org|21734: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00112 seconds
2014-07-11T06:15:15+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00493 seconds
2014-07-11T06:15:17+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00133 seconds
2014-07-11T06:15:23+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00105 seconds
2014-07-11T06:15:43+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00108 seconds
2014-07-11T06:15:49+00:00 DEBUG (7): web1.example.org|21734: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00435 seconds
2014-07-11T06:15:53+00:00 DEBUG (7): web1.example.org|22104: Data read for ID sess_p9i3vvr885k8548pf5ghvu3ke1 after 0.00533 seconds

These are all reads without locking. They are in milliseconds, yet we sleep for 1.5 seconds. That's enough time for additional requests to come in and so the waits pile up.
While this was artificially created by a security scanner, the same will happen when asynchronous requests come in, for example when using Varnish/ESI blocks or Ajax requests for product data on a category page.
I see two solutions:

  1. configurable sleep time, like you already mentioned in the code (if you already did that, this is the Magento 1.8.1 bundled redis session module and we'll merge in upstream code)
  2. spinlocks with a configurable max spins that on failure go to the sleeping code

Or even both. What do you think?

@colinmollenhour
Copy link
Owner

You have to be careful to not ignore other situations when trying to optimize another.. IIRC, the motivation behind a 1.5 second timeout is:

  1. Resources, namely number of req/sec that the Redis server/network can handle, are obviously limited so from this perspective, longer lock times means fewer requests (in theory based on the assumption that a lock is held by a long-running process).
  2. If there is a session locking issue it is due to either: a long running process (e.g. render cart with many shipping quotes) or multiple parallel requests. In eCommerce it seems that the former is more likely.
  3. On the first spin, wait a little longer than subsequent spins to allow time for all other waiters to increment the lock. That is, if 'waiting' is 5, by the time the newest process checks again, 'lock' should be old_lock+waiting or else there is something wrong.

All of that said, this was addressed already (0.51 second sleep time) so you should be using the github version. :)

@colinmollenhour
Copy link
Owner

I just found a bug where there should have been a "continue" after the sleep, so it was actually sleeping twice. Fixing this should further help your issue with high concurrency so give the latest version a go.

@melvyn-sopacua
Copy link
Author

That may explain why I always saw a minimum delay of 3 seconds. Good catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants