Skip to content

Connection Storm

Felipe Megale edited this page May 14, 2026 · 85 revisions

Introduction

Connection Storms may happen in any relational database. Here I will be more specific about Db2.

The intention of this post is to add information about:

  • how to identify the the symptoms of a Connection Storm.
  • how to make Db2 more resilient when facing Connection Storms.

What is a Connection Storm?

There are articles about Connection Storm in the internet. I am referencing one in Reference section for more information about the Behavior.

But, a breath about what Connection Storm is:

The short answer

It is when in fact the database suddenly gets a lot of connection in a very few time. Let's say it went from 300 connections to 3000 in 5 minutes.

A more complete answer

Connection Storm may be summarized in few phases:

1- Something gets slow in some layer that answers the application. It could be the Application Server, the network, the database server. The slowdown could even include the Operation System from the Servers involved. It could be bad queries with slow response time. Anyway, something gets slow.

2- The slowdown in some of the mentioned layers make the Application Server to terminate the connection and open a new connection with the same request that was slow before.

3- The first connection that was terminated by the Application Server is gone in the point of view of the Application Server. But in the Database Server that connection should be processing yet, maybe finishing the authentication, maybe rolling back the transaction. And those processes gets CPU.

4- That behavior becomes a loop. Actually, much more like a snow ball where things gets worst each second:

  • the App. Server opens a connection to the database
  • the slow down makes the request to not answer in time to the App. Server
  • App. Server terminates the connection
  • App. Server opens a new connection with the same request as the first connection that was terminated
  • This time it could be now processing 2 connections in the database (the one terminated and the new one just opened)
  • The new connection isn't answered in time to the App. Server again. The App. Server terminates the connection and opens a new one
  • And this Opening and Terminating Connections increases very very fast like snow ball

What happens in the database server?

Actually, let's talk about a more specific database: Db2 Server.

When a connection comes to the Db2 database a daemon called db2ckpwd is called.

That is the daemon responsible to talk to the Operation System to authenticate the user/password. That causes CPU use.

Let's add, for example, a bad query/UOW (Unit of Work) consuming CPU by the App. Server request.

Add a rollback to that bad UOW caused by the App. Server terminating the connection. And that consumes CPU as well.

Now, add the App. Server sending several New Connections, making the daemons (that are limited - default: 3) to consume CPU, sending the same bad transaction (consuming CPU), terminating those transactions (consuming CPU), and then sending even more connections and making all of that again.

Possible Results/Symptoms in the Database Server

  • entire Server hang
  • very very slowdown in entire server
  • CPU topped 100%
  • Load Average more than 100%

How to diagnose a Connection Storm?

After your server came back, that means.. If you got a Connection Storm your Database Server had a very very slowdown or hang. So, you will start a RCA (Root Cause Analysis) with the Server working (out of the Connection Storm).

So, after your server came back you need to check the Connection history. Maybe from DMC (Now, Db2 Genius) graphs or maybe any monitoring you have.

A graph like this is a clear Connection Storm:

image


If you don't have such monitoring (you should), you can go to db2diag.log.

Remember App. Server may terminate the connections? That logs to the db2diag.log like this:

2012-10-07-03.31.44.543149+480 I560509E538 LEVEL: Error
PID : 2160 TID : 140574002767616PROC : db2sysc
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-4510 APPID: XX.XX.XX.XX.2286.121006192634
AUTHID : DB2INST1
EDUID : 570 EDUNAME: db2agent (SAMPLE)
FUNCTION: DB2 UDB, common communication, sqlcctcptest, probe:11
MESSAGE : Detected client termination
DATA #1 : Hexdump, 2 bytes
0x00007FD9EF7F6748 : 3600

Also, it may log messages with function AgentBreathingPoint, like this:

2021-06-07-16.35.03.960357-180 I130974058A552       LEVEL: Error
PID     : 44389                TID : 4327502506256  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000           DB   : XXXXXXX
APPHDL  : 0-28469              APPID: XX.XX.XX.XX.46374.210627173418
AUTHID  : XXXXXX               HOSTNAME: XXXXXXXX
EDUID   : 25871                EDUNAME: db2agent (XXXXXXX) 0
FUNCTION: DB2 UDB, base sys utilities, sqeAgent::AgentBreathingPoint, probe:10
CALLED  : DB2 UDB, common communication, sqlcctest
RETCODE : ZRC=0x00000036=54

So you can grep your db2diag.log to count those messages in a short time (some minutes). Lots of those messages in short time indicates a Connection Storm.

Example about to grep:

db2diag -t 2026-03-10-03.20:2026-03-10-03.28  | grep  -i "AgentBreathingPoint" |wc -l  
db2diag -t 2026-03-10-03.20:2026-03-10-03.28 | grep -i "Detected client termination" |wc -l  

Another possible message is this one, indicating the connections increased (see the last row "New Total").

2025-05-09-15.43.31.566822-180 I439866318A457       LEVEL: Warning
PID     : 7304                 TID : 4395865467072  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
HOSTNAME: XXXXXXXX
EDUID   : 18                   EDUNAME: db2tcpcm 0
FUNCTION: DB2 UDB, fast comm manager, sqkfDynamicResourceMgr::AdjustResources, probe:175
MESSAGE : FCM Automatic/Dynamic Resource Adjustment (Session): 895 successfully
          allocated. New total is 3580

But remember that those are possible messages. Maybe you will see only one type (only "Detected Client Termination", or only "AgentBreathingPoint" or just the sqkfDynamicResourceMgr::AdjustResources") of those messages in your db2diag.log.

How to Prevent a Connection Storm

This is a really complex question.

Websphere Application Server has a mechanism called Surge Protection that is controlled by two properties, surge threshold and surge creation interval (see the References section for more information).

But I am never saw that configured by myself and I am not a Websphere Application Server support. But infrastructure of configured DataSources in the Websphere Application may or may not allow that protection to be configured. Talk to your Application Server team.

Anyway, that Surge Protection or some configuration like tuning Connection Pool in the Application Server would prevent a Connection Storm.

What about the Database Server?

Best practices performance (that includes the application consuming the database) will help to prevent the Connection Storm.

Remember that Something must goes slow to trigger a Connection Storm. If WAS (Websphere Application Server) is good, database is good but the application. Then a storm may happen. If those are good but the Network. Again, storm may happen... You got it..

How to make the Db2 Database resilient to the Connection Storm?

For a while what I've been seeing implemented was:

  • to increase DB2_NUM_CKPW_DAEMONS.

The default is 3. Usually increased to 10, 20 or even 30. That makes more db2ckpwd daemons to be created. It makes sense to increase it once the documentation (see Reference section) says:

" Increasing the setting for DB2_NUM_CKPW_DAEMONS can decrease the time required to establish a database connection, but this is only effective in scenarios where many connections are being made simultaneously and where authentication is expensive. "

  • limit MAXAPPLS and MAX_CONNECTIONS database configuration

That is to limit the number of applications and connections. Makes sense...

  • Use WLM (Workload Manager)

Configure, for example, the following in WLM:

• CONCURRENTDBCOORDACTIVITES Controls the number of concurrent activities that can run on the database at a time

• UOWTOTALTIME Specifies the maximum amount of time that a unit of work may spend in the DB2 engine. Minimum time unit – 10 secs (9.7FP6+), 1 min for earlier versions

• ACTIVITYTOTALTIME Specifies the maximum amount of time the data server can spend on an activity. Granularity – 10 secs (9.7FP5+), 1 min for earlier versions

All of those makes sense...

But I saw Db2 Servers getting very very slow with CPU 100% and Load Average more than 100% during a Connection Storm, preventing DBAs to run commands in the server.

What really helped?

The only Connection Storm event I saw where the Db2 Server got resilient, that means.. A Connection Storm started and the Db2 Server didn't get topped CPU was after enabling User Authentication Cache (see documentation about in the Reference sections).

With the Cache enabled there is (not for a first action at least) a need to increase the db2ckpwd daemons (DB2_NUM_CKPW_DAEMONS) once those will not be used when the authentication is cached.

So no CPU is used for authentication during the storm. The authentication seems to be worst in such event with no cached users.

The Cache is enabled with AUTHN_CACHE_USERS and AUTHN_CACHE_DURATION (db cfg).

It is a online change. Let's say you would cache 1000 users for 24 hours. It will be:

db2 update db cfg using AUTHN_CACHE_USERS 1000  IMMEDIATE
db2 update db cfg using AUTHN_CACHE_DURATION 1440 IMMEDIATE

With that configuration I saw a Db2 Server during a Connection Storm working fine. And the graphs shows the Total Connections increasing when the Authentication Time was decreasing (AVG_CONNECT_REQUEST_TIME is milliseconds):

image


You can monitor the Authentication Cache with the following queries:

  • Cache Hit Ratio:

SELECT authn_cache_lookups, authn_cache_hits, CASE WHEN authn_cache_lookups > 0 THEN (CAST(authn_cache_hits AS FLOAT) / authn_cache_lookups) * 100 ELSE 0 END AS cache_hit_ratio FROM TABLE(MON_GET_DATABASE(-2));

  • Check Connect and Authentication Time

Whit this query you can see the benefits of the cache. You can get some outputs of it before and after enabling the cache to compare.

WITH
CONN_LIST AS (
    SELECT
        T1.CONNECTION_START_TIME,
        T1.UOW_START_TIME,
        T1.UOW_STOP_TIME,
        T1.SESSION_AUTH_ID                                                              AS USERNAME,
        T1.CLIENT_HOSTNAME,
        T1.CLIENT_IPADDR,
        T1.CLIENT_APPLNAME,
        DECIMAL(T1.TOTAL_CONNECT_REQUEST_TIME        / 1000., 14, 3)                   AS TOTAL_CONNECT_REQUEST_TIME,
        DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_TIME / 1000., 14, 3)                   AS TOTAL_CONNECT_AUTHENTICATION_TIME,
        DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME / 1000., 14, 3)                   AS TOTAL_CONNECT_AUTHENTICATION_PROC_TIME,
        DECIMAL((T1.TOTAL_CONNECT_AUTHENTICATION_TIME
               - T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME) / 1000., 14, 3)            AS AUTH_WAIT_S
    FROM TABLE(MON_GET_CONNECTION(CAST(NULL AS BIGINT), -2)) AS T1
)
SELECT
    SUBSTR(CHAR(CONNECTION_START_TIME), 1, 26)  AS CONN_START,
    SUBSTR(CHAR(UOW_START_TIME),        1, 26)  AS UOW_START,
    SUBSTR(CHAR(UOW_STOP_TIME),         1, 26)  AS UOW_STOP,
    SUBSTR(USERNAME,                    1, 20)  AS USERNAME,
    SUBSTR(CLIENT_HOSTNAME,             1, 30)  AS HOSTNAME,
    SUBSTR(CLIENT_IPADDR,               1, 15)  AS IP,
    SUBSTR(CLIENT_APPLNAME,             1, 30)  AS APPLNAME,
    TOTAL_CONNECT_REQUEST_TIME                  AS TCONN_REQ_TIME,
    TOTAL_CONNECT_AUTHENTICATION_TIME           AS TCONN_AUTH_TIME,
    TOTAL_CONNECT_AUTHENTICATION_PROC_TIME      AS TCONN_AUTH_PROC_TIME,
    AUTH_WAIT_S
FROM CONN_LIST
WHERE 1 = 1
--AND USERNAME NOT IN ('DB2MON', 'DB2INST1')
--AND AUTH_WAIT_S > 0
--ORDER BY AUTH_WAIT_S DESC
FETCH FIRST 6000 ROWS ONLY
WITH UR;

Conclusion

Connection Storm may happen. Your team (DBAs, Application Developers, Application Server Administrators, Networks Administrators, Operation System Administrators) should work together for Performance Best Practices being in use.

Once there are so many layers involved, DBA should try to make the database more resilient to Connection Storm.

A good (and easy) way to try to reach that is enabling the Authentication Cache.

It is a Gold Bullet? There are several configurations and environments but it should help for sure.


References:

Beware of Connection Storm
https://www.ibm.com/support/pages/beware-connection-storm

"Detected client termination" in db2diag.log file.
https://www.ibm.com/support/pages/detected-client-termination-db2diaglog-file

Authentication and group cache
https://www.ibm.com/docs/en/db2/11.5.x?topic=details-authentication-group-cache

authn_cache_users - Authentication cache users configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-users-authentication-cache-users-configuration-parameter

authn_cache_duration - Authentication cache duration configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-duration-authentication-cache-duration-configuration-parameter

DB2_NUM_CKPW_DAEMONS
https://www.ibm.com/docs/en/db2/11.5.x?topic=variables-miscellaneous

MON_GET_CONNECTION table function - Get connection metrics
https://www.ibm.com/docs/en/db2/11.5.x?topic=routines-mon-get-connection-get-connection-metrics

max_connections - Maximum number of client connections configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-max-connections-maximum-number-client-connections

maxappls - Maximum number of active applications configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-maxappls-maximum-number-active-applications

Connection pool settings
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-settings

Connection pool advanced settings - Surge Protection
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-advanced-settings

Clone this wiki locally