Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture document size, RU usage, retry stats for get and put in CosmosDB #4652

Merged
merged 9 commits into from Oct 1, 2019

Conversation

chetanmeh
Copy link
Member

@chetanmeh chetanmeh commented Sep 30, 2019

This PR enables capturing stats related to document size and RU used for get and put operation.

Description

Document Size Stats

With this PR the get and put operation would include the document size and RU used

[2019-09-30T11:29:53.553Z] [INFO] [#tid_1] [CosmosDBArtifactStore] [PUT] 'subjects' completed document: 'id: anon-OIPNjW7IigJetLxDwwlPwlcG2Fl, rev: null', size=235, ru=7.05 [marker:database_saveDocument_finish:9308:685]

Further the PR also add a histogram for document size.

# TYPE histogram_cosmosdb_doc_size_bytes histogram
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="512.0"} 0.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="1024.0"} 0.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="2048.0"} 0.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="4096.0"} 0.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="16384.0"} 0.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="65536.0"} 3815.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="524288.0"} 3815.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="1048576.0"} 3815.0
histogram_cosmosdb_doc_size_bytes_bucket{collection="activations",le="+Inf"} 3815.0
histogram_cosmosdb_doc_size_bytes_count{collection="activations"} 3815.0
histogram_cosmosdb_doc_size_bytes_sum{collection="activations"} 192398080.0

Client Diagnostic Stats

If the TransactionId has debug mode enabled (extension of support added in #4275) then it would also include some client side diagnostic info generated by CosmosDB when using Direct mode

[2019-09-30T11:50:34.472Z] [INFO] [#tid_42] [CosmosDBArtifactStore] [PUT] 'subjects' completed document: 'id: anon-MZIVqUOiMiyxxYXJTyLxYnihVNI, rev: null', size=235, ru=7.05 RequestStartTime: "30 Sep 2019 11:50:30.908", RequestEndTime: "30 Sep 2019 11:50:34.464", Duration: 3556 ms, Number of regions attempted: 1 StoreResponseStatistics{requestResponseTime="30 Sep 2019 11:50:33.281", storeResult=storePhysicalAddress: rntbd://cdb-ms-prod-eastus1-fd18.documents.azure.com:14047/apps/a12ab0b2-98f6-4f5a-a7d4-f67e3a08d496/services/e50f5eae-0700-4539-9323-e4a587786587/partitions/eaa72284-4d0b-4ddf-a919-c890c12803a3/replicas/132138294859720735p/, lsn: 1177, globalCommittedLsn: 1176, partitionKeyRangeId: 0, isValid: true, statusCode: 201, subStatusCode: 0, isGone: false, isNotFound: false, isInvalidPartition: false, requestCharge: 7.05, itemLSN: -1, sessionToken: -1#1177, exception: null, requestResourceType=Document, requestOperationType=Create} AddressResolutionStatistics{startTime="30 Sep 2019 11:50:31.471", endTime="30 Sep 2019 11:50:31.705", targetEndpoint='https://foo-eastus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FXk1oAA%3D%3D%2Fcolls%2FXk1oAKUKbiU%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=0'} StoreResponseStatistics{requestResponseTime="30 Sep 2019 11:50:34.464", storeResult=storePhysicalAddress: rntbd://cdb-ms-prod-eastus1-fd18.documents.azure.com:14064/apps/a12ab0b2-98f6-4f5a-a7d4-f67e3a08d496/services/e50f5eae-0700-4539-9323-e4a587786587/partitions/eaa72284-4d0b-4ddf-a919-c890c12803a3/replicas/132119752170912464s/, lsn: 1177, globalCommittedLsn: 1177, partitionKeyRangeId: 0, isValid: true, statusCode: 204, subStatusCode: 0, isGone: false, isNotFound: false, isInvalidPartition: false, requestCharge: 0.0, itemLSN: 660, sessionToken: -1#1177, exception: null, requestResourceType=DocumentCollection, requestOperationType=Head} [marker:database_saveDocument_finish:15080:3568]

Retry stats

The PR also enabled to capture the internal retry done by SDK upon throttling. Currently there is no direct support to capture such stats (Azure/azure-cosmosdb-java#262). So this PR makes use of Logback to capture the stats by intercepting the warning log message generated upon a 429 response

[2019-09-27T07:57:15.161Z] [WARN] Operation will be retried after 6 milliseconds. [2019-09-27T07:57:15.161Z] [WARN] Operation will be retried after 6 milliseconds. Current attempt 1, Cumulative delay PT0.006S
com.microsoft.azure.cosmosdb.DocumentClientException: Message: {"Errors":["Request rate is large"]}
ActivityId: 28ebd9fc-2615-4e9f-b4a7-ec3ebb464b36, Request URI: /apps/cebf41ab-a2c2-4055-b592-9887b5c5a634/services/91918f88-9f66-459c-93cf-53644d05f280/partitions/2baa1e8f-3751-444c-8be7-f831dc692a63/replicas/132121471945325759p/, RequestStats:
RequestStartTime: 2019-09-27T07:57:14.6731085Z, RequestEndTime: 2019-09-27T07:57:14.6731085Z, Number of regions attempted: 1
ResponseTime: 2019-09-27T07:57:14.6731085Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-eastus1-fd34.documents.azure.com:16701/apps/cebf41ab-a2c2-4055-b592-9887b5c5a634/services/91918f88-9f66-459c-93cf-53644d05f280/partitions/2baa1e8f-3751-444c-8be7-f831dc692a63/replicas/132121471945325759p/, LSN: 1898, GlobalCommittedLsn: 1897, PartitionKeyRangeId: , IsValid: True, StatusCode: 429, SubStatusCode: 3200, RequestCharge: 0.38, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: null, ResourceType: Document, OperationType: Create
, SDK: Microsoft.Azure.Documents.Common/2.5.1, StatusCode: TooManyRequests
    at com.microsoft.azure.cosmosdb.rx.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:444)
    at com.microsoft.azure.cosmosdb.rx.internal.RxGatewayStoreModel.lambda$null$8(RxGatewayStoreModel.java:382)

A typical log message looks like above which has info related to

  1. Attempt count
  2. If retry would be done or it was the last retry
  3. Operation type like Get or Create etc

Prometheus metrics for retry

counter_cosmosdb_retry_success_total{type="Create"} 164.0

Prometheus histogram for retry attempt

# TYPE histogram_cosmosdb_retry_success histogram
histogram_cosmosdb_retry_success_bucket{type="Create",le="1.0"} 0.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="2.0"} 0.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="3.0"} 0.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="5.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="7.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="10.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="12.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="15.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="20.0"} 1.0
histogram_cosmosdb_retry_success_bucket{type="Create",le="+Inf"} 1.0
histogram_cosmosdb_retry_success_count{type="Create"} 1.0
histogram_cosmosdb_retry_success_sum{type="Create"} 4.0

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@codecov-io
Copy link

codecov-io commented Sep 30, 2019

Codecov Report

Merging #4652 into master will decrease coverage by 6.03%.
The diff coverage is 20.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4652      +/-   ##
==========================================
- Coverage   84.46%   78.42%   -6.04%     
==========================================
  Files         183      184       +1     
  Lines        8377     8433      +56     
  Branches      566      604      +38     
==========================================
- Hits         7076     6614     -462     
- Misses       1301     1819     +518
Impacted Files Coverage Δ
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.89%) ⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 7.69% <100%> (-48.31%) ⬇️
...core/database/cosmosdb/RetryMetricsCollector.scala 35.13% <35.13%> (ø)
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0% <0%> (-100%) ⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala 0% <0%> (-94.74%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.6%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedListener.scala 0% <0%> (-86.67%) ⬇️
...e/database/cosmosdb/cache/KafkaEventProducer.scala 0% <0%> (-76.48%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0% <0%> (-74.08%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a581bea...20069e7. Read the comment docs.

@@ -476,10 +488,11 @@ class CosmosDBArtifactStore[DocumentAbstraction <: DocumentSerializer](protected
(clusterId, clusterIdValue))
val fieldsToRemove = Seq(_id, _rev)
val mapped = transform(json, fieldsToAdd, fieldsToRemove)
val jsonString = mapped.compactPrint
val doc = new Document(mapped.compactPrint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should avoid calling mapped.compactPrint twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that while refactoring. Reused existing computed json

Copy link
Contributor

@tysonnorris tysonnorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one minor suggestion

@chetanmeh chetanmeh merged commit 3f9a780 into apache:master Oct 1, 2019
@chetanmeh chetanmeh deleted the cosmosdb-metrics-v3 branch October 1, 2019 17:13
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
…sDB (apache#4652)

* Enable collection of retry stats
* Log document size and RU used for get and put operation
* Include extra logs for transaction with debug mode enabled
* Histogram metric for document size
* Add custom histogram buckets for retry stats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants