Skip to content

Connection Storm

Felipe Megale edited this page May 18, 2026 · 85 revisions

Introduction

Connection Storms may happen in any relational database. Here I will be more specific about Db2.

The intention of this post is to add information about:

  • how to identify the the symptoms of a Connection Storm.
  • how to make Db2 more resilient when facing Connection Storms.

What is a Connection Storm?

There are many articles about Connection Storm in the internet. I am referencing one in Reference section for more information about the Behavior.

But, a brief about what Connection Storm is:

The short answer

It is when in fact the database suddenly gets a lot of connection in a very short time. Let's say it went from 300 connections to 3000 in 5 minutes or even few seconds.

A more complete answer

Connection Storm may be summarized in few phases:

1- Something gets slow in some layer that answers the application. It could be the Application Server, the network, the database server. The slowdown could even include the Operation System from the Servers involved. It could be bad queries with slow response time. Anyway, something gets slow.

2- The slowdown in some of the mentioned layers make the Application Server to open more connections to attend new request. The Application Server may also terminate the connection that was slow (by its an internal timeout) and open a new connection with the same request that was slow before.

3- About the terminated connections: the first connection that was terminated by the Application Server is gone in the point of view of the Application Server. But in the Database Server that connection should be processing yet, maybe finishing the authentication, maybe rolling back the transaction. And those processes gets CPU. But Application Server keeps opening more connections than terminating those. That causes the Connection Pool from the Application Sever to get full and Database Server to get very busy.

4- That behavior becomes a loop. Actually, much more like a snow ball where things gets worst each second:

  • the App. Server opens a connection to the database
  • the slowdown makes the request to not answer in time to the App. Server
  • App. Server opens more connections and times out others (but opens more than terminates)
  • Database gets busy processing new connections plus finishing connections that the Applications Server timed out.
  • The new connection isn't answered in time to the App. Server again. The App. Server then opens even more new ones.
  • And this Opening and Terminating Connections (especially Opening) increases very very fast like snow ball

What happens in the database server?

Actually, let's talk about a more specific database: Db2 Server.

When a connection comes to the Db2 database a daemon called db2ckpwd is called.

That is the daemon responsible to talk to the Operation System to authenticate the user/password. That causes CPU use.

Let's add, for example, a bad query/UOW (Unit of Work) consuming CPU by the App. Server request.

Add a rollback to that bad UOW caused by the App. Server terminating the connection. And that consumes CPU as well.

Now, add the App. Server sending several New Connections, making the daemons (that are limited - default: 3) to consume CPU, sending the same bad transaction (consuming CPU), terminating those transactions (consuming CPU), and then sending even more connections and making all of that again.

Possible Results/Symptoms in the Database Server

  • entire Server hang
  • very very slowdown in entire server
  • CPU topped 100%
  • Load Average more than 100%

How to diagnose a Connection Storm?

After your server came back, that means.. If you got a Connection Storm your Database Server had a very very slowdown or hang. So, you will start a RCA (Root Cause Analysis) with the Server working (out of the Connection Storm).

So, after your server came back you need to check the Connection history. Maybe from DMC (now, Db2 Genius) graphs or maybe any monitoring you have.

A graph like this is a clear Connection Storm:

image


If you don't have such monitoring (you should), you can go to db2diag.log.

Remember App. Server may terminate the connections? That logs to the db2diag.log like this:

2012-10-07-03.31.44.543149+480 I560509E538 LEVEL: Error
PID : 2160 TID : 140574002767616PROC : db2sysc
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-4510 APPID: XX.XX.XX.XX.2286.121006192634
AUTHID : DB2INST1
EDUID : 570 EDUNAME: db2agent (SAMPLE)
FUNCTION: DB2 UDB, common communication, sqlcctcptest, probe:11
MESSAGE : Detected client termination
DATA #1 : Hexdump, 2 bytes
0x00007FD9EF7F6748 : 3600

Also, it may log messages with function AgentBreathingPoint, like this:

2021-06-07-16.35.03.960357-180 I130974058A552       LEVEL: Error
PID     : 44389                TID : 4327502506256  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000           DB   : XXXXXXX
APPHDL  : 0-28469              APPID: XX.XX.XX.XX.46374.210627173418
AUTHID  : XXXXXX               HOSTNAME: XXXXXXXX
EDUID   : 25871                EDUNAME: db2agent (XXXXXXX) 0
FUNCTION: DB2 UDB, base sys utilities, sqeAgent::AgentBreathingPoint, probe:10
CALLED  : DB2 UDB, common communication, sqlcctest
RETCODE : ZRC=0x00000036=54

So you can grep your db2diag.log to count those messages in a short time (some minutes). Lots of those messages in short time indicates a Connection Storm.

Example about to grep:

db2diag -t 2026-03-10-03.20:2026-03-10-03.28  | grep  -i "AgentBreathingPoint" |wc -l  
db2diag -t 2026-03-10-03.20:2026-03-10-03.28 | grep -i "Detected client termination" |wc -l  

Another possible message is this one, indicating the connections increased (see the last row "New Total").

2025-05-09-15.43.31.566822-180 I439866318A457       LEVEL: Warning
PID     : 7304                 TID : 4395865467072  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
HOSTNAME: XXXXXXXX
EDUID   : 18                   EDUNAME: db2tcpcm 0
FUNCTION: DB2 UDB, fast comm manager, sqkfDynamicResourceMgr::AdjustResources, probe:175
MESSAGE : FCM Automatic/Dynamic Resource Adjustment (Session): 895 successfully
          allocated. New total is 3580

But remember that those are possible messages. Maybe you will see only one type (only "Detected Client Termination", or only "AgentBreathingPoint" or just the sqkfDynamicResourceMgr::AdjustResources") of those messages in your db2diag.log.

How to Prevent a Connection Storm

This is a really complex question.

Websphere Application Server has a mechanism called Surge Protection that is controlled by two properties, "Surge Threshold" and "Surge Creation Interval" (see the References section for more information).

But I never saw that configured by myself and I am not a Websphere Application Server specialist. But infrastructure of configured DataSources in the Websphere Application Server may or may not allow that protection to be configured. Talk to your Application Server team.

Anyway, that Surge Protection or some configuration like tuning Connection Pool in the Application Server would prevent a Connection Storm.

What about the Database Server?

Best practices performance (that includes the application consuming the database) will help to prevent the Connection Storm.

Remember that Something must goes slow to trigger a Connection Storm. If WAS (Websphere Application Server) is good, database is good but the application, then a storm may happen. If those are good but the Network, again, storm may happen... You got it..

How to make the Db2 Database resilient to the Connection Storm?

For a while what I've been seeing implemented was:

  • to increase DB2_NUM_CKPW_DAEMONS.

The default is 3. Usually increased to 10, 20 or even 30. That makes more db2ckpwd daemons to be created. It makes sense to increase it once the documentation (see Reference section) says:

" Increasing the setting for DB2_NUM_CKPW_DAEMONS can decrease the time required to establish a database connection, but this is only effective in scenarios where many connections are being made simultaneously and where authentication is expensive. "

  • limit MAXAPPLS and MAX_CONNECTIONS database configuration

That is to limit the number of applications and connections. Makes sense... But, think that you limited it to 2.000 connections and you saw your database working thing with that quantity. But now, think all those 2.000 coming to database in seconds or few minutes. It's different if the connections were opened in hours.

  • Use WLM (Workload Manager)

Configure, for example, the following in WLM:

• CONCURRENTDBCOORDACTIVITES Controls the number of concurrent activities that can run on the database at a time

• UOWTOTALTIME Specifies the maximum amount of time that a unit of work may spend in the DB2 engine. Minimum time unit – 10 secs (9.7FP6+), 1 min for earlier versions

• ACTIVITYTOTALTIME Specifies the maximum amount of time the data server can spend on an activity. Granularity – 10 secs (9.7FP5+), 1 min for earlier versions

All of those makes sense...

But I saw Db2 Servers getting very very slow with CPU 100% and Load Average more than 100% during a Connection Storm, preventing DBAs to run commands in the server.

What really helped?

The only Connection Storm event I saw where the Db2 Server got resilient, that means.. A Connection Storm started and the Db2 Server didn't get topped CPU for much time was after enabling User Authentication Cache (see documentation about in the Reference sections).

Note* : Authentication Cache is available since Db2 v11.5.3.0

With the Cache enabled there isn't a need to increase the db2ckpwd daemons (DB2_NUM_CKPW_DAEMONS) (not for a first action at least) once those will not be used when the authentication is cached.

So CPU is alleviated for authentication during the storm. The authentication seems to be worst in such event with no cached users.

The Cache is enabled with AUTHN_CACHE_USERS and AUTHN_CACHE_DURATION (db cfg).

It is a online change. Let's say you would cache 1000 users for 24 hours. It will be:

db2 update db cfg using AUTHN_CACHE_USERS 1000  IMMEDIATE
db2 update db cfg using AUTHN_CACHE_DURATION 1440 IMMEDIATE

With that configuration I saw a Db2 Server during a Connection Storm working fine.

The graphs below from a real production environment with Authentication Cache enabled shows the Total Connections increasing when the Authentication Time was decreasing (AVG_CONNECT_REQUEST_TIME is milliseconds):

image

That graph is from a Production system - Real Word, with Db2 v11.5.9.0 Linux.

Now, a test in a test environment

Db2 v12.1.3.0 Linux 4 java application opening 1000 connections each (total: 4.000 connections)
The host was doing nothing else.
The java application was build to open connections, only.

NO Cache

The output below is from sar.
We see Idle CPU column in the very right side.
CPU gets topped:

image

Average time each application spent to get their 1.000 connections: 58.1 seconds
Max time an application spent to get its 1.000 connections: 58.8 seconds

image

Cache Enabled

We see Idle CPU. CPU gets topped too. It is for shorter time, around 24% of the time of the first test:

image

Average time each application spent to get their 1.000 connections: 10.2 seconds
Max time an application spent to get its 1.000 connections: 10.9 seconds

image

Comparing

image

How to monitor Authentication/Connections?

You can monitor the Authentication Cache with the following queries:

  • Cache Hit Ratio:
SELECT 
    authn_cache_lookups, 
    authn_cache_hits,
    CASE 
        WHEN authn_cache_lookups > 0 
        THEN decimal((CAST(authn_cache_hits AS FLOAT) / authn_cache_lookups) * 100,9,2) 
        ELSE 0 
    END AS cache_hit_ratio
FROM TABLE(MON_GET_DATABASE(-2));

Output example with Cache Enabled:

image

Whit this query you can see the benefits of the cache. You can get some outputs of it before and after enabling the cache to compare. - Check Connect Request Time, Authentication Time, Authentication Processing Time and Authentication Waits.

WITH
CONN_LIST AS (
    SELECT
        T1.CONNECTION_START_TIME,
        T1.UOW_START_TIME,
        T1.UOW_STOP_TIME,
        T1.SESSION_AUTH_ID                                                              AS USERNAME,
        T1.CLIENT_HOSTNAME,
        T1.CLIENT_IPADDR,
        T1.CLIENT_APPLNAME,
        DECIMAL(T1.TOTAL_CONNECT_REQUEST_TIME        / 1000., 14, 3)                   AS TOTAL_CONNECT_REQUEST_TIME,
        DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_TIME / 1000., 14, 3)                   AS TOTAL_CONNECT_AUTHENTICATION_TIME,
        DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME / 1000., 14, 3)                   AS TOTAL_CONNECT_AUTHENTICATION_PROC_TIME,
        DECIMAL((T1.TOTAL_CONNECT_AUTHENTICATION_TIME
               - T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME) / 1000., 14, 3)            AS AUTH_WAIT_S
    FROM TABLE(MON_GET_CONNECTION(CAST(NULL AS BIGINT), -2)) AS T1
)
SELECT
    SUBSTR(CHAR(CONNECTION_START_TIME), 1, 26)  AS CONN_START,
    SUBSTR(CHAR(UOW_START_TIME),        1, 26)  AS UOW_START,
    SUBSTR(CHAR(UOW_STOP_TIME),         1, 26)  AS UOW_STOP,
    SUBSTR(USERNAME,                    1, 20)  AS USERNAME,
    SUBSTR(CLIENT_HOSTNAME,             1, 30)  AS HOSTNAME,
    SUBSTR(CLIENT_IPADDR,               1, 15)  AS IP,
    SUBSTR(CLIENT_APPLNAME,             1, 30)  AS APPLNAME,
    TOTAL_CONNECT_REQUEST_TIME                  AS TCONN_REQ_TIME,
    TOTAL_CONNECT_AUTHENTICATION_TIME           AS TCONN_AUTH_TIME,
    TOTAL_CONNECT_AUTHENTICATION_PROC_TIME      AS TCONN_AUTH_PROC_TIME,
    AUTH_WAIT_S
FROM CONN_LIST
WHERE 1 = 1
--AND USERNAME NOT IN ('usr1', 'usr2')
--AND AUTH_WAIT_S > 0
--ORDER BY AUTH_WAIT_S DESC
FETCH FIRST 6000 ROWS ONLY
WITH UR;

Output example with Cache Enabled (you can open the screenshot in a new tab for better view):
output with "ORDER by TCONN_REQ_TIME DESC"

image


Output example with NO Cache (you can open the screenshot in a new tab for better view):

output with "ORDER by TCONN_REQ_TIME DESC"

image

Conclusion

Connection Storm may happen and it will impact the environment anyway, but there are actions DBA may take to make the Database more resilient during this kind of event.

Your team (DBAs, Application Developers, Application Server Administrators, Networks Administrators, Operation System Administrators) should work together for Performance Best Practices being in use.

Once there are so many layers involved, DBA should try to make the database more resilient to Connection Storm. That will avoid is most cases the system to hang and allow DBAs to take actions with application/application server team to reestablish the system.

A good (and easy) way to try to make database more resilient to Connection Storms is enabling the Authentication Cache.

It is a Gold Bullet? There are several configurations and environments but it should help.

The RCA for the Connection Storm should be investigated to prevent it to happen again in that way again.


References:

Beware of Connection Storm
https://www.ibm.com/support/pages/beware-connection-storm

"Detected client termination" in db2diag.log file.
https://www.ibm.com/support/pages/detected-client-termination-db2diaglog-file

Authentication and group cache
https://www.ibm.com/docs/en/db2/11.5.x?topic=details-authentication-group-cache

authn_cache_users - Authentication cache users configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-users-authentication-cache-users-configuration-parameter

authn_cache_duration - Authentication cache duration configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-duration-authentication-cache-duration-configuration-parameter

DB2_NUM_CKPW_DAEMONS
https://www.ibm.com/docs/en/db2/11.5.x?topic=variables-miscellaneous

MON_GET_CONNECTION table function - Get connection metrics
https://www.ibm.com/docs/en/db2/11.5.x?topic=routines-mon-get-connection-get-connection-metrics

max_connections - Maximum number of client connections configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-max-connections-maximum-number-client-connections

maxappls - Maximum number of active applications configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-maxappls-maximum-number-active-applications

Connection pool settings
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-settings

Connection pool advanced settings - Surge Protection
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-advanced-settings

Clone this wiki locally