Unable to write session after 85.14168 seconds, another process took the lock for ID 83ra6bnjs887hq2l7928mc2as1 #35

sofam · 2014-05-20T08:28:11Z

Hi,

I am seeing lots of messages similar to these across my 16 web nodes during high load, there are quite a few sessions in the redis database (3,9 million). I am using PHP-FPM 5.4.4.

This is our configuration file:

        <session_save><![CDATA[db]]></session_save>
        <redis_session>                       <!-- All options seen here are the defaults -->
            <host>10.254.253.9</host>            <!-- Specify an absolute path if using a unix socket -->
            <port>6379</port>
            <password></password>             <!-- Specify if your Redis server requires authentication -->
            <timeout>3.5</timeout>            <!-- This is the Redis connection timeout, not the locking timeout -->
            <persistent></persistent>         <!-- Specify unique string to enable persistent connections. E.g.: sess-db0; bugs with phpredis and php-fpm are known: https://github.com/nicolasff/phpredis/issues/70 -->
            <db>1</db>                        <!-- Redis database number; protection from accidental loss is improved by using a unique DB number for sessions -->
            <compression_threshold>2048</compression_threshold>  <!-- Set to 0 to disable compression (recommended when suhosin.session.encrypt=on); known bug with strings over 64k: https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/issues/18 -->
            <compression_lib>lzf</compression_lib>              <!-- gzip, lzf or snappy -->
            <log_level>4</log_level>               <!-- 0 (emergency: system is unusable), 4 (warning; additional information, recommended), 5 (notice: normal but significant condition), 6 (info: informational messages), 7 (debug: the most information for development/testing) -->
            <max_concurrency>5000</max_concurrency>                 <!-- maximum number of processes that can wait for a lock on one session; for large production clusters, set this to at least 10% of the number of PHP processes -->
            <break_after_frontend>6</break_after_frontend>       <!-- seconds to wait for a session lock in the frontend; not as critical as admin -->
            <break_after_adminhtml>30</break_after_adminhtml>
            <bot_lifetime>7200</bot_lifetime>                    <!-- Bots get shorter session lifetimes. 0 to disable -->
        </redis_session>

sofam · 2014-05-20T13:07:55Z

Just an update, I upgraded to the latest version of Cm_RedisSession. If I switch one of the web nodes to use memcached as session backend, I get more requests through for that web node (12 instead of 2 / sec). Redis is set up to not save and not do appendfsync. I am using separate redis instances for sessions and cache.

colinmollenhour · 2014-05-20T14:51:24Z

Wow, that is a lot of nodes. I am really curious, how many key requests per second does your Redis server get?

Some notes:

I don't see any reason for max_concurrency to be so high. I don't remember writing the 10% guidance but I think 5 is generally a good max. If you have some special need (e.g. load a huge gallery of image that are generated real-time) you could raise it further. This variable is mainly to prevent one user from finding a slow page on your site and hammering it to lock up your web server. After they surpass max_concurrency they would just start getting 503 errors.

It sounds like since you can switch a single node to memcached your cluster uses sticky sessions? If you are already using sticky sessions then one possibility is to run a separate Redis process for each app node. E.g. if you have 16 app nodes with 16GB of RAM each, run Redis on localhost on each one and dedicate 1GB of RAM to it and you have enough for around 11 million sessions. Of course if one node goes down you lose 1/16th of your sessions.

Serving 3.9 million sessions with no persistence would scare the crap out of me.. IMO session persistence is an absolute must since the cost of losing all of your sessions is annoyed customers and a lot of lost sales. If you aren't using SSDs, persisting to a location on an SSD should help. There are also some code tweaks you can make to Magento to remove some data duplication which will help shrink your storage cost and thereby help reduce your time to persist to disk.

If your Redis backend is simply too large to persist to disk (without locking up your site) then you are approaching the limits of a single Redis node. Options:

Shard using your load balancer's sticky sessions feature. That is, have each node connect to a separate Redis process whether those are on separate machines or one machine.
Add a clustering feature using Credis_Cluster and run say 16 separate Redis processes and have them persist to disk in alternating time increments.
Porting the Redis session code to a database that is better at persistence, such as MongoDb.

The log messages include a lock wait count. If the lock wait count is high (e.g. number of seconds minus 1) then there is a locking contention issue. If it is low then the delay is due to something other than locking.

parhamr · 2014-05-20T16:52:23Z

I added the 10% bit to the config, based upon some rough testing. It’s guidance, but I haven’t had the chance to verify if things go wrong when that’s set too high (I only know it’s problematic for search bots when set too low).

sofam · 2014-05-20T16:59:36Z

Hi, first of all thanks for the quick and in depth reply, we're actually not using sticky sessions, the reason for 16 nodes is so we could scale out horizontally instead of making fewer larger nodes.
We're also using Varnish in front of the webnodes and the Turpentine module for magento for handling the ESI etc.

During peak hours I can see upwards of 4500 instantaneous ops / sec on redis. Is there another metric I should be looking at?

I'll try adjusting the value of max concurrency down to 5 and start from there.

Have you seen anything similar to this behaviour before? The redis node isn't even that loaded really, and disabling saving momentarily doesn't affect performance at all, so saving to disk isn't an issue.

All other backend services seem to be operating normally as well, but when you see a traffic peak it really starts hurting, and the only thing I've been able to find so far to point me in any direction is the redis session handling.

parhamr · 2014-05-20T17:03:20Z

Redis readily scales to hundreds of thousands of operations per second.

Here’s an example cluster with 11 web nodes that has been observed at a combined throughput of 11,500 operations per second (typically with more than 6,000,000 sessions in Redis).

sofam · 2014-05-20T17:05:50Z

Oh and I of course missed the last part, the lock wait count is pretty high, low numbers around 4-11 and upwards of 60-90 at worst. What can I do to fix the lock contention?

colinmollenhour · 2014-05-20T18:34:57Z

Looking at the code it seems the timing reported was different depending on your log level so I just pushed a fix for this. If you were below debug then it was reporting the write time as the total time since the class was instantiated rather than just the time to write the session.

Lowering the max_concurrency is definitely a good idea IMO. Most bots (including GoogleBot) don't even use cookies so max_concurrency won't apply to them. If a user has a really fast finger and you have a really slow page load, a user might be able to get up to 5 with some practice.

Given that in your config break_after_frontend is 6, I don't see how any more than ~6 seconds of the request time can be attributed to the session backend. If the lock number is 90, then I would think that means there are 15 or more processes fighting for the lock. Reducing max_concurrency will result in some 503s for this user but should reduce lock contention issues. If your "attempts" number is greater than 6 then there is a bug.

The good thing is that these locking issues are mostly harmless. Unless you have a high percentage of issues to users I would just make the changes I suggested and ignore the lock issue reports.

colinmollenhour closed this as completed in fbcfe2e May 20, 2014

ghost mentioned this issue Nov 23, 2015

Random 503's in production environment #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to write session after 85.14168 seconds, another process took the lock for ID 83ra6bnjs887hq2l7928mc2as1 #35

Unable to write session after 85.14168 seconds, another process took the lock for ID 83ra6bnjs887hq2l7928mc2as1 #35

sofam commented May 20, 2014

sofam commented May 20, 2014

colinmollenhour commented May 20, 2014

parhamr commented May 20, 2014

sofam commented May 20, 2014

parhamr commented May 20, 2014

sofam commented May 20, 2014

colinmollenhour commented May 20, 2014

Unable to write session after 85.14168 seconds, another process took the lock for ID 83ra6bnjs887hq2l7928mc2as1 #35

Unable to write session after 85.14168 seconds, another process took the lock for ID 83ra6bnjs887hq2l7928mc2as1 #35

Comments

sofam commented May 20, 2014

sofam commented May 20, 2014

colinmollenhour commented May 20, 2014

parhamr commented May 20, 2014

sofam commented May 20, 2014

parhamr commented May 20, 2014

sofam commented May 20, 2014

colinmollenhour commented May 20, 2014