-
Notifications
You must be signed in to change notification settings - Fork 2
Connection Storm
Connection Storms may happen in any relational database. Here I will be more specific about Db2.
The intention of this post is to add information about:
- how to identify the the symptoms of a Connection Storm.
- how to make Db2 more resilient when facing Connection Storms.
There are many articles about Connection Storm in the internet. I am referencing one in Reference section for more information about the Behavior.
But, a breath about what Connection Storm is:
It is when in fact the database suddenly gets a lot of connection in a very short time. Let's say it went from 300 connections to 3000 in 5 minutes.
Connection Storm may be summarized in few phases:
1- Something gets slow in some layer that answers the application. It could be the Application Server, the network, the database server. The slowdown could even include the Operation System from the Servers involved. It could be bad queries with slow response time. Anyway, something gets slow.
2- The slowdown in some of the mentioned layers make the Application Server to terminate the connection and open a new connection with the same request that was slow before.
3- The first connection that was terminated by the Application Server is gone in the point of view of the Application Server. But in the Database Server that connection should be processing yet, maybe finishing the authentication, maybe rolling back the transaction. And those processes gets CPU.
4- That behavior becomes a loop. Actually, much more like a snow ball where things gets worst each second:
- the App. Server opens a connection to the database
- the slow down makes the request to not answer in time to the App. Server
- App. Server terminates the connection
- App. Server opens a new connection with the same request as the first connection that was terminated
- This time it could be now processing 2 connections in the database (the one terminated and the new one just opened)
- The new connection isn't answered in time to the App. Server again. The App. Server terminates the connection and opens a new one
- And this Opening and Terminating Connections increases very very fast like snow ball
Actually, let's talk about a more specific database: Db2 Server.
When a connection comes to the Db2 database a daemon called db2ckpwd is called.
That is the daemon responsible to talk to the Operation System to authenticate the user/password. That causes CPU use.
Let's add, for example, a bad query/UOW (Unit of Work) consuming CPU by the App. Server request.
Add a rollback to that bad UOW caused by the App. Server terminating the connection. And that consumes CPU as well.
Now, add the App. Server sending several New Connections, making the daemons (that are limited - default: 3) to consume CPU, sending the same bad transaction (consuming CPU), terminating those transactions (consuming CPU), and then sending even more connections and making all of that again.
- entire Server hang
- very very slowdown in entire server
- CPU topped 100%
- Load Average more than 100%
After your server came back, that means.. If you got a Connection Storm your Database Server had a very very slowdown or hang. So, you will start a RCA (Root Cause Analysis) with the Server working (out of the Connection Storm).
So, after your server came back you need to check the Connection history. Maybe from DMC (Now, Db2 Genius) graphs or maybe any monitoring you have.
A graph like this is a clear Connection Storm:
If you don't have such monitoring (you should), you can go to db2diag.log.
Remember App. Server may terminate the connections? That logs to the db2diag.log like this:
2012-10-07-03.31.44.543149+480 I560509E538 LEVEL: Error
PID : 2160 TID : 140574002767616PROC : db2sysc
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-4510 APPID: XX.XX.XX.XX.2286.121006192634
AUTHID : DB2INST1
EDUID : 570 EDUNAME: db2agent (SAMPLE)
FUNCTION: DB2 UDB, common communication, sqlcctcptest, probe:11
MESSAGE : Detected client termination
DATA #1 : Hexdump, 2 bytes
0x00007FD9EF7F6748 : 3600
Also, it may log messages with function AgentBreathingPoint, like this:
2021-06-07-16.35.03.960357-180 I130974058A552 LEVEL: Error
PID : 44389 TID : 4327502506256 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : XXXXXXX
APPHDL : 0-28469 APPID: XX.XX.XX.XX.46374.210627173418
AUTHID : XXXXXX HOSTNAME: XXXXXXXX
EDUID : 25871 EDUNAME: db2agent (XXXXXXX) 0
FUNCTION: DB2 UDB, base sys utilities, sqeAgent::AgentBreathingPoint, probe:10
CALLED : DB2 UDB, common communication, sqlcctest
RETCODE : ZRC=0x00000036=54
So you can grep your db2diag.log to count those messages in a short time (some minutes). Lots of those messages in short time indicates a Connection Storm.
Example about to grep:
db2diag -t 2026-03-10-03.20:2026-03-10-03.28 | grep -i "AgentBreathingPoint" |wc -l
db2diag -t 2026-03-10-03.20:2026-03-10-03.28 | grep -i "Detected client termination" |wc -l
Another possible message is this one, indicating the connections increased (see the last row "New Total").
2025-05-09-15.43.31.566822-180 I439866318A457 LEVEL: Warning
PID : 7304 TID : 4395865467072 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
HOSTNAME: XXXXXXXX
EDUID : 18 EDUNAME: db2tcpcm 0
FUNCTION: DB2 UDB, fast comm manager, sqkfDynamicResourceMgr::AdjustResources, probe:175
MESSAGE : FCM Automatic/Dynamic Resource Adjustment (Session): 895 successfully
allocated. New total is 3580
But remember that those are possible messages. Maybe you will see only one type (only "Detected Client Termination", or only "AgentBreathingPoint" or just the sqkfDynamicResourceMgr::AdjustResources") of those messages in your db2diag.log.
This is a really complex question.
Websphere Application Server has a mechanism called Surge Protection that is controlled by two properties, surge threshold and surge creation interval (see the References section for more information).
But I am never saw that configured by myself and I am not a Websphere Application Server support. But infrastructure of configured DataSources in the Websphere Application may or may not allow that protection to be configured. Talk to your Application Server team.
Anyway, that Surge Protection or some configuration like tuning Connection Pool in the Application Server would prevent a Connection Storm.
Best practices performance (that includes the application consuming the database) will help to prevent the Connection Storm.
Remember that Something must goes slow to trigger a Connection Storm. If WAS (Websphere Application Server) is good, database is good but the application. Then a storm may happen. If those are good but the Network. Again, storm may happen... You got it..
For a while what I've been seeing implemented was:
- to increase DB2_NUM_CKPW_DAEMONS.
The default is 3. Usually increased to 10, 20 or even 30. That makes more db2ckpwd daemons to be created. It makes sense to increase it once the documentation (see Reference section) says:
" Increasing the setting for DB2_NUM_CKPW_DAEMONS can decrease the time required to establish a database connection, but this is only effective in scenarios where many connections are being made simultaneously and where authentication is expensive. "
- limit MAXAPPLS and MAX_CONNECTIONS database configuration
That is to limit the number of applications and connections. Makes sense...
- Use WLM (Workload Manager)
Configure, for example, the following in WLM:
• CONCURRENTDBCOORDACTIVITES Controls the number of concurrent activities that can run on the database at a time
• UOWTOTALTIME Specifies the maximum amount of time that a unit of work may spend in the DB2 engine. Minimum time unit – 10 secs (9.7FP6+), 1 min for earlier versions
• ACTIVITYTOTALTIME Specifies the maximum amount of time the data server can spend on an activity. Granularity – 10 secs (9.7FP5+), 1 min for earlier versions
All of those makes sense...
But I saw Db2 Servers getting very very slow with CPU 100% and Load Average more than 100% during a Connection Storm, preventing DBAs to run commands in the server.
The only Connection Storm event I saw where the Db2 Server got resilient, that means.. A Connection Storm started and the Db2 Server didn't get topped CPU was after enabling User Authentication Cache (see documentation about in the Reference sections).
With the Cache enabled there is (not for a first action at least) a need to increase the db2ckpwd daemons (DB2_NUM_CKPW_DAEMONS) once those will not be used when the authentication is cached.
So no CPU is used for authentication during the storm. The authentication seems to be worst in such event with no cached users.
The Cache is enabled with AUTHN_CACHE_USERS and AUTHN_CACHE_DURATION (db cfg).
It is a online change. Let's say you would cache 1000 users for 24 hours. It will be:
db2 update db cfg using AUTHN_CACHE_USERS 1000 IMMEDIATE
db2 update db cfg using AUTHN_CACHE_DURATION 1440 IMMEDIATE
With that configuration I saw a Db2 Server during a Connection Storm working fine. And the graphs shows the Total Connections increasing when the Authentication Time was decreasing (AVG_CONNECT_REQUEST_TIME is milliseconds):
You can monitor the Authentication Cache with the following queries:
- Cache Hit Ratio:
SELECT
authn_cache_lookups,
authn_cache_hits,
CASE
WHEN authn_cache_lookups > 0
THEN (CAST(authn_cache_hits AS FLOAT) / authn_cache_lookups) * 100
ELSE 0
END AS cache_hit_ratio
FROM TABLE(MON_GET_DATABASE(-2));
- Check Connect and Authentication Time
Whit this query you can see the benefits of the cache. You can get some outputs of it before and after enabling the cache to compare.
WITH
CONN_LIST AS (
SELECT
T1.CONNECTION_START_TIME,
T1.UOW_START_TIME,
T1.UOW_STOP_TIME,
T1.SESSION_AUTH_ID AS USERNAME,
T1.CLIENT_HOSTNAME,
T1.CLIENT_IPADDR,
T1.CLIENT_APPLNAME,
DECIMAL(T1.TOTAL_CONNECT_REQUEST_TIME / 1000., 14, 3) AS TOTAL_CONNECT_REQUEST_TIME,
DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_TIME / 1000., 14, 3) AS TOTAL_CONNECT_AUTHENTICATION_TIME,
DECIMAL(T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME / 1000., 14, 3) AS TOTAL_CONNECT_AUTHENTICATION_PROC_TIME,
DECIMAL((T1.TOTAL_CONNECT_AUTHENTICATION_TIME
- T1.TOTAL_CONNECT_AUTHENTICATION_PROC_TIME) / 1000., 14, 3) AS AUTH_WAIT_S
FROM TABLE(MON_GET_CONNECTION(CAST(NULL AS BIGINT), -2)) AS T1
)
SELECT
SUBSTR(CHAR(CONNECTION_START_TIME), 1, 26) AS CONN_START,
SUBSTR(CHAR(UOW_START_TIME), 1, 26) AS UOW_START,
SUBSTR(CHAR(UOW_STOP_TIME), 1, 26) AS UOW_STOP,
SUBSTR(USERNAME, 1, 20) AS USERNAME,
SUBSTR(CLIENT_HOSTNAME, 1, 30) AS HOSTNAME,
SUBSTR(CLIENT_IPADDR, 1, 15) AS IP,
SUBSTR(CLIENT_APPLNAME, 1, 30) AS APPLNAME,
TOTAL_CONNECT_REQUEST_TIME AS TCONN_REQ_TIME,
TOTAL_CONNECT_AUTHENTICATION_TIME AS TCONN_AUTH_TIME,
TOTAL_CONNECT_AUTHENTICATION_PROC_TIME AS TCONN_AUTH_PROC_TIME,
AUTH_WAIT_S
FROM CONN_LIST
WHERE 1 = 1
--AND USERNAME NOT IN ('DB2MON', 'DB2INST1')
--AND AUTH_WAIT_S > 0
--ORDER BY AUTH_WAIT_S DESC
FETCH FIRST 6000 ROWS ONLY
WITH UR;
Connection Storm may happen. Your team (DBAs, Application Developers, Application Server Administrators, Networks Administrators, Operation System Administrators) should work together for Performance Best Practices being in use.
Once there are so many layers involved, DBA should try to make the database more resilient to Connection Storm.
A good (and easy) way to try to reach that is enabling the Authentication Cache.
It is a Gold Bullet? There are several configurations and environments but it should help for sure.
Beware of Connection Storm
https://www.ibm.com/support/pages/beware-connection-storm
"Detected client termination" in db2diag.log file.
https://www.ibm.com/support/pages/detected-client-termination-db2diaglog-file
Authentication and group cache
https://www.ibm.com/docs/en/db2/11.5.x?topic=details-authentication-group-cache
authn_cache_users - Authentication cache users configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-users-authentication-cache-users-configuration-parameter
authn_cache_duration - Authentication cache duration configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=dcp-authn-cache-duration-authentication-cache-duration-configuration-parameter
DB2_NUM_CKPW_DAEMONS
https://www.ibm.com/docs/en/db2/11.5.x?topic=variables-miscellaneous
MON_GET_CONNECTION table function - Get connection metrics
https://www.ibm.com/docs/en/db2/11.5.x?topic=routines-mon-get-connection-get-connection-metrics
max_connections - Maximum number of client connections configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-max-connections-maximum-number-client-connections
maxappls - Maximum number of active applications configuration parameter
https://www.ibm.com/docs/en/db2/11.5.x?topic=parameters-maxappls-maximum-number-active-applications
Connection pool settings
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-settings
Connection pool advanced settings - Surge Protection
https://www.ibm.com/docs/en/was-nd/8.5.5?topic=applications-connection-pool-advanced-settings
Felipe Megale - www.linkedin.com/in/fsmegale