Emit OpenTelemetry metrics #2211

roji · 2023-11-07T16:39:58Z

OpenTelemetry has become the de-facto standard for distributed, cross-platform tracing; it is being rapidly adopted both inside the .NET ecosystem and outside. OTel defines semantic conventions for how a database client emits metrics, which are aggregated, numeric counters about the state and performance of the driver (e.g. number of commands executed, idle/busy connections in the pool, etc.). these specifications are currently in experimental state, but stabilization is likely start happening soon.

Note that unlike with tracing, AFAIK there's no way to get data like this out of SqlClient at the moment. This sort of data can be invaluable in diagnosing performance problems (e.g. by inspecting real-time connection pool state) and tracking what's going on.

JRahnama · 2023-11-07T16:44:54Z

~~Is this a duplicate of #2210 ?~~

Forget my comment. I just noticed the difference.

roji · 2023-11-08T14:01:17Z

Thanks @gfoidl, corrected the link.

edwardneal · 2024-02-29T23:43:28Z

SqlClient actually already emits some metrics, but their reporting is a little spotty. On .NET Framework in Windows, every process which uses the library has a set of performance counters (listed here.) On .NET Core and .NET Standard 2.1 and above, Microsoft.Data.SqlClient.EventSource has a number of event counters (listed here.) Although the counter names are slightly different, their meanings are the same:

Performance counter	Event counter	Unit	Description
(none)	active-hard-connections	count	Actual active connections current made to servers
HardConnectsPerSecond	hard-connects	count / sec	The number of actual connections per second that are being made to servers
HardDisconnectsPerSecond	hard-disconnects	count / sec	The number of actual disconnects per second that are being made to servers
(none)	active-soft-connects	count	Active connections retrieved from the connection pool
SoftConnectsPerSecond	soft-connects	count / sec	The number of connections we get from the pool per second
SoftDisconnectsPerSecond	soft-disconnects	count / sec	The number of connections we return to the pool per second
NumberOfNonPooledConnections	number-of-non-pooled-connections	count	The number of connections that are not using connection pooling
NumberOfPooledConnections	number-of-pooled-connections	count	The number of connections that are managed by the connection pooler
NumberOfActiveConnectionPoolGroups	number-of-active-connection-pool-groups	count	The number of unique connection strings
NumberOfInactiveConnectionPoolGroups	number-of-inactive-connection-pool-groups	count	The number of unique connection strings waiting for pruning
NumberOfActiveConnectionPools	number-of-active-connection-pools	count	The number of active connection pools
NumberOfInactiveConnectionPools	number-of-inactive-connection-pools	count	The number of inactive connection pools
NumberOfActiveConnections	number-of-active-connections	count	The number of connections currently in-use
NumberOfFreeConnections	number-of-free-connections	count	The number of connections currently available for use
NumberOfStasisConnections	number-of-stasis-connections	count	The number of connections currently waiting to be made ready for use
NumberOfReclaimedConnections	number-of-reclaimed-connections	count	The number of connections we reclaim from GC'd external connections

The SoftConnectsPerSecond, SoftDisconnectsPerSecond, NumberOfActiveConnections and NumberOfFreeConnections performance counters are only created if the ConnectionPoolPerformanceCounterDetail trace switch is set to Verbose.

Having a consistent interface to increment/decrement the metrics would be helpful for code merging, but the real value would probably come from making System.Diagnostics.Metrics the primary metrics source. I think the right thing to do would be to also add an AppContext switch to control the legacy metrics sources - but I'm not sure whether it'd be better to have 6.0 disable the legacy metrics sources by default and drop them in 7.0, or to have 6.0 implement the new metrics source, 7.0 disable the legacy ones and 8.0 to drop these legacy ones.

For the current metrics collection, perhaps OpenTelemetry could listen to these event counters or performance counters?

roji · 2024-03-01T00:47:58Z

@edwardneal there's generally no specific need to drop the old metrics or introduce an app context switch for them; as long as they're not listened to, their perf overhead is negligible. So this really is about emitting OpenTelemetry metrics via System.Diagnostics.Metrics.

edwardneal · 2024-03-01T22:45:43Z

That makes sense, thanks. There's only one mandatory counter (db.client.connections.usage), which maps neatly to the existing counters of number-of-active-connections and number-of-free-connections for state attributes of "used" and "idle" respectively. If the specification allows us to add extra state attribute values, I'd add "stasis" and "reclaimed" for the number-of-stasis-connections and number-of-reclaimed-connections counters. The concept of a connection which is currently being brought back into an issuable state might make sense for the upstream specification.

Besides the mandatory, other standard metrics are:

db.client.connections.idle.max: I read this to be a configuration parameter: the maximum number of idle connections allowed, rather than the maximum number of idle connections which the pool has historically possessed. I think this'll always be the value of DbConnectionPool.MaxPoolSize.
db.client.connections.idle.min: Similar principle as above, but it'd be DbConnectionPool.MinPoolSize.
db.client.connections.max: DbConnectionPool has a maximum capacity, and any connection in its backing store can be idle or used, so I'd say that this should be the same as db.client.connections.idle.max. I'm a little unsure on whether to include this though - the .NET connection pool has minimum + maximum sizes, while I'm a bit unclear on what the semantics expect. This comment seems to suggest that clients should expect .max (or .limit) to be the sum of .usage. That won't be true in our implementation.
db.client.connections.pending_requests, db.client.connections.timeouts, db.client.connections.create_time: Not currently instrumented, but probably should be. The integration points for them exist.
db.client.connections.wait_time: As I understand it, this duration will essentially start at the same time as db.client.connections.create_time and end when SqlConnection.Open() completes
db.client.connections.use_time: We already instrument something similar to this and make it available via SqlConnection.RetrieveStatistics().

Outside the standard metrics, the rest of the existing counters should also be mapped. I'd add these as:

dotnet.db.client.connection_pools.usage, with a state attribute of "used" and "idle" and the same meaning of pool.name as db.client.connections.usage: number-of-active-connection-pools and number-of-inactive-connection-pools.
dotnet.db.client.connection_pool_groups.usage, with a state attribute of "used" and "idle": number-of-active-connection-pool-groups and number-of-inactive-connection-pool-groups
dotnet.db.client.connections.hard.usage, with a type attribute of "pooled" and "unpooled": number-of-pooled-connections and number-of-non-pooled-connections.
dotnet.db.client.connections.hard.total: active-hard-connections. This is tentative - at a glance, I think this works out as a sum of the metric above. I'm not absolutely certain though
dotnet.db.client.connection.operations.connect, with a type attribute of "hard" and "soft": hard-connects and soft-connects. This is a behaviour change: it'll contain a counter, rather than a rate
dotnet.db.client.connection.operations.disconnect, with a type attribute of "hard" and "soft": hard-disconnects and soft-disconnects. Same behaviour change as above.

Finally, there are some interesting fields in SqlConnection.RetrieveStatistics():

dotnet.db.client.connections.command_time: This comes from the ExecutionTime key. It describes the total time spent executing a possible server-side command. That server-side command would be a user SQL statement, but it could also be SqlConnection.ChangeDatabase, SqlCommand.Prepare, any kind of transaction, and getting a field's value (whether that field's value is part of a packet which has already been received or not.)
dotnet.db.client.connections.network_wait_time: This comes from the NetworkServerTime key. It describes the amount of time spent waiting for the TDS response from the remote server.
dotnet.db.client.connections.io, with a direction attribute of "receive" and "transmit": BytesReceived and BytesSent.
dotnet.db.client.connections.buffers, with a direction attribute of "receive" and "transmit": BuffersReceived and BuffersSent.
dotnet.db.client.commands.count, with a type attribute of "insert", "delete", "update", "merge" and "select": CursorOpens, IduCount and SelectCount. I've pulled these apart quite a bit, largely because I don't think "IDU" is particularly useful to clients.
dotnet.db.client.commands.rows, with the type attribute above: IduRows and SelectRows.

I started looking to see what other metrics might be available or tracked within the library for exposure, and this response ballooned. If the metrics in the OTel specification can be checked for correctness, there's at least a place to start.

roji · 2024-03-01T22:55:10Z

db.client.connections.max: [...] open-telemetry/semantic-conventions#211 (comment) seems to suggest that clients should expect .max (or .limit) to be the sum of .usage. That won't be true in our implementation.

The definition in the specs seems pretty clear and unambiguous to me: "The maximum number of open connections allowed". Max/min connections is a really standard thing across DB connection pools, I certainly implemented it as MaxPoolSize in the Npgsql driver.

In general, metrics on commands are currently missing in the specs. I went ahead and implemented some in Npgsql (code) - if implementing something similar, I'd call these out as experimental and subject to change, since I'm assuming some command-related metrics will make it in to the final specs etc.

edwardneal · 2024-03-02T00:01:55Z

I was thinking of a situation where a connection could have been allocated by the client, added to the connection pool, (to meet MinPoolSize's constraint) but not yet opened, because it's either not yet been used or it's still in the process of opening. In that situation, it would be neither idle nor used, but its existence in the pool would mean that MaxPoolSize wasn't really the maximum number of open connections, because this closed (or still-opening) connection would reduce the number of slots available for completely opened connections.

From re-checking the logic which creates a connection, it looks like a background thread synchronously opens the connection and adds it to the connection pool pre-opened (rather than being created closed and opened dynamically or asynchronously) so my confusion's cleared now: MaxPoolSize is the correct approach. If that creation methodology changes in the future, I think the ambiguity would return though.

I agreed on the point around command-related metrics. It'd be useful to be able to assign some kind of tag (or dictionary of tags) to a SqlCommand and aggregate at that tag level, but that's probably a better fit for DbCommand. I've prefixed the existing metrics with dotnet because it seems to fit the pattern here but this'll no doubt need to change when the specs are finalised.

roji · 2024-03-02T09:22:39Z

I was thinking of a situation where a connection could have been allocated by the client, added to the connection pool, (to meet MinPoolSize's constraint) but not yet opened, because it's either not yet been used or it's still in the process of opening. In that situation, it would be neither idle nor used [...]

I'm not aware of such a state - I'm also not sure what it would mean for a connection to be "added but not idle". If a physical connection has been opened (which it must have been) and is in the pool in order to meet MinPoolSize, then it seems to me that it's just idle. Also, if there were such a special opened-for-MinPoolSize state, and then it's rented out by the application and then returned, does it then become just regular-idle? That would seem to be very weird, since the pool is exactly in the same state as it originally was.

In any case, I don't think any legitimate usage scenario out there would care (or want to deal with the added complexity) of an additional 3rd state... The important things to track with metrics are how many total physical connections the application is using, and to have a breakdown of how many of those are actually in-use vs. not in-use (to track application efficiency with regards to connection management etc.).

edwardneal · 2024-03-02T15:21:00Z

No, I fully agree - a background thread synchronously opens the connections, pushing them onto the pool. By the time the pool knows about them, the connections are in exactly the state you're describing.

For what it's worth though, I was thinking about two scenarios:

Large-scale expansion of the connection pool when the connections are slow to open. In that scenario, a two second connection time would take 40s to add twenty connections. My first thought there would have been to add the still-opening connection to the pool so that the pool expansion could continue apace, and the first client to actually use the connection would pay any remaining time required to open it. The three states we'd care about in the pool would be "active", "idle" and "starting".
Throttling connections to the SQL Server. In database-constrained scenarios such as Azure SQL Edge, we might want to allocate the memory for a connection on the client end, but not want to consume resources on the SQL Server. There's no ambiguity or background state transitions going on there: the connections added to the pool would simply be closed. The three states here would be "active", "idle" and "closed".

Neither of these scenarios are supported by the pool at the moment, so the point's academic. I can imagine situations where someone might want them though, so it's worth noting that the metrics might need some adjustment if that's implemented in the future.

I think we could match most of the command metrics in Npgsql in advance of the specs. By way of comparison, I think the full list of metrics should be:

Proposed metric	Npgsql metric
OpenTelemetry specifications
db.client.connections.usage	db.client.connections.usage	Already reported as a counter
db.client.connections.idle.max	db.client.connections.idle.max	Report DbConnectionPool's MaxPoolSize
db.client.connections.idle.min	db.client.connections.idle.min	Report DbConnectionPool's MinPoolSize
db.client.connections.idle.max	db.client.connections.max	Report DbConnectionPool's MaxPoolSize
db.client.connections.pending_requests	db.client.connections.pending_requests	To be instrumented
db.client.connections.timeouts	db.client.connections.timeouts	To be instrumented
db.client.connections.create_time	db.client.connections.create_time	To be instrumented, with units in milliseconds to match spec. Npgsql reports seconds - is this deliberate?
db.client.connections.wait_time		To be instrumented
db.client.connections.use_time		Report based on SqlStatistics' ConnectionTime key
Existing counters
dotnet.db.client.connection_pools.usage		Reported as counters
dotnet.db.client.connection_pool_groups.usage		Reported as counters
dotnet.db.client.connections.hard.usage		Reported as counters
dotnet.db.client.connections.connects		Reported as counters
dotnet.db.client.connections.disconnects		Reported as counters
SqlStatistics keys
dotnet.db.client.connections.server_command_time		Report ExecutionTime key, and document that this is more than just the time to execute SqlCommands directly
dotnet.db.client.connections.network_wait_time		Reported as NetworkServerTime key
dotnet.db.client.connections.io	db.client.commands.bytes_written, db.client.commands.bytes_read	BytesReceived and BytesSent. I include the bytes required to set up the connection, which is why it's in the .connections namespace. I've picked the .io name to match OTel's metric guidelines
dotnet.db.client.connections.buffers		Report BuffersReceived and BuffersSent
dotnet.db.client.commands.count		Expand CursorOpens, IduCount and SelectCount
dotnet.db.client.commands.rows		Expand IduRows and SelectRows
Npgsql metrics
dotnet.db.client.commands.pending_requests	db.client.commands.executing	To be instrumented. Synced name to connection metric names
dotnet.db.client.commands.failed	db.client.commands.failed	To be instrumented
dotnet.db.client.commands.use_time	db.client.commands.duration	To be instrumented. Ambivalent on the name, but have tried to sync it to the connection metric names
dotnet.db.client.commands.prepared_ratio	db.client.commands.prepared_ratio	Reported as PreparedExecs, UnpreparedExecs
Additional mopup metric
dotnet.db.client.commands.total		To be instrumented - total commands executed
New transactions metrics
dotnet.db.client.transactions.committed		To be instrumented
dotnet.db.client.transactions.active		To be instrumented
dotnet.db.client.transactions.rolled_back		To be instrumented
dotnet.db.client.transactions.total		To be instrumented
dotnet.db.client.transactions.commit_time		To be instrumented
dotnet.db.client.transactions.rollback_time		To be instrumented
dotnet.db.client.transactions.active_time		To be instrumented

Most of these are pretty generic across different implementations of DbCommand, so might be a useful model for other providers to follow. There's nothing stopping them from being implemented in SqlCommand and being moved into core later. They might also be informative to the rest of the specification's development.

A few points remain:

We've currently got commands.[total|failed|pending_requests] as three metrics. I'm thinking about whether it'd be better semantics to have commands.count{status=failed|pending|succeeded} and commands.total as rollups
It'd be good to have a PoolName property or similar in the connection string, so pool.name doesn't need to just be the redacted connection string. This is definitely limited to the wish list
A way to group specific commands. This might be a list of KeyValuePairs on the SqlCommand, so that commands which we expect to have a different performance profile can have their metrics tracked independently

If you and someone from the SqlClient team are happy with the approach, I can start work on the basic infrastructure, implement the first mandatory metric and then start trickling the rest of the metrics in piecemeal.

roji · 2024-03-02T16:40:25Z

FWIW I'd prefix the non-standard metrics with sqlclient rather than dotnet - I think the idea is for that to represent the specific component emitting the non-standard metric (so you know where it's coming from). In this case it's SqlClient rather than the .NET runtime etc.

Other than that, I'd recommend concentrating on the basic support as a first step, and then iterate from there, rather than thinking too much up-front about e.g. a new PoolName in the connection string or command groupings. There's definitely a lot that's possible with OpenTelemetry instrumentation, but it's probably a good idea to design the advanced stuff separately and as they become relevant etc.

JRahnama added the untriaged label Nov 7, 2023

JRahnama added 💡 Enhancement New feature request and removed untriaged labels Nov 7, 2023

This comment was marked as outdated.

Sign in to view

Wraith2 mentioned this issue Nov 16, 2023

Changing Microsoft.Data.SqlClient diagnostic format open-telemetry/opentelemetry-dotnet-contrib#1732

Open

roji mentioned this issue Mar 12, 2024

Remove dependencies on unstable OpenTelemetry packages in Aspire Components dotnet/aspire#2819

Closed

edwardneal mentioned this issue Mar 16, 2024

Proof of concept: OpenTelemetry metrics #2409

Draft

David-Engel mentioned this issue May 28, 2024

Add a method to dump connection pool information #655

Closed

edwardneal mentioned this issue Jul 22, 2024

Core pooling logic #2686

Merged

edwardneal mentioned this issue Aug 24, 2024

Merged DbConnectionClosed, DbConnectionFactory, SqlAuthenticationProviderManager #2410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit OpenTelemetry metrics #2211

Emit OpenTelemetry metrics #2211

roji commented Nov 7, 2023 •

edited

Loading

JRahnama commented Nov 7, 2023 •

edited

Loading

This comment was marked as outdated.

roji commented Nov 8, 2023

edwardneal commented Feb 29, 2024

roji commented Mar 1, 2024

edwardneal commented Mar 1, 2024

roji commented Mar 1, 2024

edwardneal commented Mar 2, 2024

roji commented Mar 2, 2024

edwardneal commented Mar 2, 2024

roji commented Mar 2, 2024

Emit OpenTelemetry metrics #2211

Emit OpenTelemetry metrics #2211

Comments

roji commented Nov 7, 2023 • edited Loading

JRahnama commented Nov 7, 2023 • edited Loading

This comment was marked as outdated.

roji commented Nov 8, 2023

edwardneal commented Feb 29, 2024

roji commented Mar 1, 2024

edwardneal commented Mar 1, 2024

roji commented Mar 1, 2024

edwardneal commented Mar 2, 2024

roji commented Mar 2, 2024

edwardneal commented Mar 2, 2024

roji commented Mar 2, 2024

roji commented Nov 7, 2023 •

edited

Loading

JRahnama commented Nov 7, 2023 •

edited

Loading