Skip to content

Validate ShardingKey against Entity tags#1069

Open
aliyasirnac wants to merge 19 commits intoapache:mainfrom
aliyasirnac:feature/13814-add-validation-for-banyandb/measure
Open

Validate ShardingKey against Entity tags#1069
aliyasirnac wants to merge 19 commits intoapache:mainfrom
aliyasirnac:feature/13814-add-validation-for-banyandb/measure

Conversation

@aliyasirnac
Copy link
Copy Markdown
Contributor

Fix missing validation between Measure.ShardingKey and Measure.Entity

ShardingKey must be a superset of Entity to preserve entity locality;
otherwise TopN results become incorrect. This PR enforces the rule at
schema validation time.

@aliyasirnac aliyasirnac changed the title Validate ShardingKey against Entity tags #13814 Validate ShardingKey against Entity tags Apr 14, 2026
@aliyasirnac aliyasirnac marked this pull request as draft April 14, 2026 14:42
@aliyasirnac aliyasirnac marked this pull request as ready for review April 14, 2026 15:06
@ButterBright ButterBright requested a review from hanahmily April 14, 2026 16:35
@ButterBright ButterBright added the enhancement New feature or request label Apr 14, 2026
@ButterBright ButterBright added this to the 0.11.0 milestone Apr 14, 2026
@ButterBright
Copy link
Copy Markdown
Member

LGTM

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.04%. Comparing base (3530dd9) to head (94618be).
⚠️ Report is 223 commits behind head on main.

Files with missing lines Patch % Lines
banyand/metadata/schema/property/client.go 0.00% 2 Missing and 2 partials ⚠️
banyand/measure/metadata.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1069      +/-   ##
==========================================
+ Coverage   45.97%   51.04%   +5.07%     
==========================================
  Files         328      417      +89     
  Lines       55505    68041   +12536     
==========================================
+ Hits        25520    34734    +9214     
- Misses      27909    30342    +2433     
- Partials     2076     2965     +889     
Flag Coverage Δ
banyand 52.38% <0.00%> (?)
bydbctl 82.35% <ø> (?)
fodc 70.53% <ø> (?)
integration-distributed 98.40% <ø> (?)
integration-standalone 97.99% <ø> (?)
pkg 31.02% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ButterBright
Copy link
Copy Markdown
Member

@aliyasirnac Please fix the ci failures caused by the new restriction.

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

Hello, following your feedback, I reviewed and tested my changes which only add a schema-level validation ensuring ShardingKey contains all Entity tags. The TestCollectWithPartialClosedSegments test in the storage package fails both on my branch and on main, so the issue is pre-existing. Setting SegmentIdleTimeout to 300ms resolves it, pointing to a timing sensitivity in the segment lifecycle.

Any suggestions on the right approach here? @ButterBright

@hanahmily
Copy link
Copy Markdown
Contributor

Hello, following your feedback, I reviewed and tested my changes which only add a schema-level validation ensuring ShardingKey contains all Entity tags. The TestCollectWithPartialClosedSegments test in the storage package fails both on my branch and on main, so the issue is pre-existing. Setting SegmentIdleTimeout to 300ms resolves it, pointing to a timing sensitivity in the segment lifecycle.

Any suggestions on the right approach here? @ButterBright

It looks good to me

Comment thread api/validate/validate.go Outdated
@ButterBright
Copy link
Copy Markdown
Member

Hello, following your feedback, I reviewed and tested my changes which only add a schema-level validation ensuring ShardingKey contains all Entity tags. The TestCollectWithPartialClosedSegments test in the storage package fails both on my branch and on main, so the issue is pre-existing. Setting SegmentIdleTimeout to 300ms resolves it, pointing to a timing sensitivity in the segment lifecycle.

Any suggestions on the right approach here? @ButterBright

Test data under /pkg/test/measure (e.g., service_instance_cpm_minute.json) should be updated to pass the validation checks.

Comment thread api/validate/validate.go Outdated
}
for i, tag := range measure.ShardingKey.TagNames {
if measure.Entity.TagNames[i] != tag {
return errors.New("ShardingKey must be a prefix of Entity tags to guarantee entity locality")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e failed. Could you print the measure.ShardingKey.TagNames and measure.Entity.TagNames. Then I can see the detail of the error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added debug logging as suggested. Here are the results the validation passes correctly, no errors are returned:
{"level":"info","time":"2026-04-16T18:53:23+03:00","message":"Full ShardingKey.TagNames: [service_id entity_id]"}
{"level":"info","time":"2026-04-16T18:53:23+03:00","message":"Full Entity.TagNames: [service_id entity_id]"}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on that, the validation should pass when ShardingKey is identical to the Entity, not just a subset.

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

I investigated the CI failures. These are not caused by the validation changes in this PR, the same tests also fail on the main branch.

The failures are infrastructure/timing issues in the distributed integration
tests:

  • stat /tmp/.../discovery.yaml: no such file or directory
  • rpc error: code = Unavailable desc = connection refused
  • fail to locate test-trace-group/sw(1,0): no nodes available

All measure schemas pass the new ShardingKey validation without issues.

@hanahmily
Copy link
Copy Markdown
Contributor

I investigated the CI failures. These are not caused by the validation changes in this PR, the same tests also fail on the main branch.

The failures are infrastructure/timing issues in the distributed integration tests:

  • stat /tmp/.../discovery.yaml: no such file or directory
  • rpc error: code = Unavailable desc = connection refused
  • fail to locate test-trace-group/sw(1,0): no nodes available

All measure schemas pass the new ShardingKey validation without issues.

The e2e failed due to the validation. I have provided a suggestion. Please follow it to improve the validation logic.

@hanahmily
Copy link
Copy Markdown
Contributor

@aliyasirnac The OAP get the error:

2026-04-22 01:11:10,484 org.apache.skywalking.oap.server.starter.OAPServerBootstrap 64 [main] ERROR [] - fail to create schema endpoint_cpm
org.apache.skywalking.oap.server.library.module.ModuleStartException: fail to create schema endpoint_cpm
	at org.apache.skywalking.oap.server.storage.plugin.banyandb.BanyanDBStorageProvider.start(BanyanDBStorageProvider.java:243)
	at org.apache.skywalking.oap.server.library.module.BootstrapFlow.start(BootstrapFlow.java:46)
	at org.apache.skywalking.oap.server.library.module.ModuleManager.init(ModuleManager.java:75)
	at org.apache.skywalking.oap.server.starter.OAPServerBootstrap.start(OAPServerBootstrap.java:52)
	at org.apache.skywalking.oap.server.starter.OAPServerStartUp.main(OAPServerStartUp.java:23)
Caused by: org.apache.skywalking.oap.server.core.storage.StorageException: fail to create schema endpoint_cpm
	at org.apache.skywalking.oap.server.storage.plugin.banyandb.BanyanDBIndexInstaller.createTable(BanyanDBIndexInstaller.java:262)
	at org.apache.skywalking.oap.server.core.storage.model.ModelInstaller.whenCreating(ModelInstaller.java:66)
	at org.apache.skywalking.oap.server.core.storage.model.StorageModels.addModelListener(StorageModels.java:197)
	at org.apache.skywalking.oap.server.storage.plugin.banyandb.BanyanDBStorageProvider.start(BanyanDBStorageProvider.java:241)
	... 4 more
Caused by: org.apache.skywalking.library.banyandb.v1.client.grpc.exception.UnknownException: io.grpc.StatusRuntimeException: UNKNOWN: ShardingKey must be a prefix of Entity tags to guarantee entity locality
	at org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBApiExceptionFactory.createException(BanyanDBApiExceptionFactory.java:61)
	at org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBGrpcApiExceptionFactory.create(BanyanDBGrpcApiExceptionFactory.java:52)
	at org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBGrpcApiExceptionFactory.createException(BanyanDBGrpcApiExceptionFactory.java:40)
	at org.apache.skywalking.library.banyandb.v1.client.grpc.HandleExceptionsWith.callAndTranslateApiException(HandleExceptionsWith.java:47)
	at org.apache.skywalking.library.banyandb.v1.client.grpc.MetadataClient.execute(MetadataClient.java:97)
	at org.apache.skywalking.library.banyandb.v1.client.metadata.MeasureMetadataRegistry.create(MeasureMetadataRegistry.java:39)
	at org.apache.skywalking.library.banyandb.v1.client.BanyanDBClient.define(BanyanDBClient.java:395)
	at org.apache.skywalking.oap.server.storage.plugin.banyandb.BanyanDBIndexInstaller.createTable(BanyanDBIndexInstaller.java:224)
	... 7 more
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: ShardingKey must be a prefix of Entity tags to guarantee entity locality
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:351)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:332)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:174)
	at org.apache.skywalking.banyandb.database.v1.MeasureRegistryServiceGrpc$MeasureRegistryServiceBlockingStub.create(MeasureRegistryServiceGrpc.java:428)
	at org.apache.skywalking.library.banyandb.v1.client.metadata.MeasureMetadataRegistry.lambda$create$0(MeasureMetadataRegistry.java:40)
	at org.apache.skywalking.library.banyandb.v1.client.grpc.HandleExceptionsWith.callAndTranslateApiException(HandleExceptionsWith.java:45)
	... 11 more
2026-04-22T01:11:10.487063473Z pool-1-thread-1 INFO Stopping configuration XmlConfiguration[location=/skywalking/config/log4j2.xml, lastModified=2026-04-22T01:08:59.742Z]...
2026-04-22T01:11:10.487307938Z pool-1-thread-1 INFO Configuration XmlConfiguration[location=/skywalking/config/log4j2.xml, lastModified=2026-04-22T01:08:59.742Z] stopped

In order to analyze the failed reason, you should append the measure.ShardingKey.TagNames and measure.Entity.TagNames to the error message. You can run e2e test locally to debug the root cause.

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

I ran into a Docker-related issue while trying to run the E2E tests locally. I've been quite busy with work over the past few days, but I plan to dedicate some time this weekend to resolve it.

@ButterBright
Copy link
Copy Markdown
Member

Here is a guide about debugging e2e tests locally that might help. Feel free to share the errors if you encounter any issues.

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

Here is a guide about debugging e2e tests locally that might help. Feel free to share the errors if you encounter any issues.

Thanks..

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

Got the e2e working locally and tracked down the root cause. Legacy schemas like endpoint_cpm have a ShardingKey that isn't a prefix of their Entity tags, so BanyanDB rejects them on startup and OAP can't boot.

Instead of hard-failing, I changed the validation to log a warning instead of returning an error. Schemas register fine, OAP starts up, and operators still see the bad locality in the logs. Let me know what you think.

@hanahmily
Copy link
Copy Markdown
Contributor

Got the e2e working locally and tracked down the root cause. Legacy schemas like endpoint_cpm have a ShardingKey that isn't a prefix of their Entity tags, so BanyanDB rejects them on startup and OAP can't boot.

Instead of hard-failing, I changed the validation to log a warning instead of returning an error. Schemas register fine, OAP starts up, and operators still see the bad locality in the logs. Let me know what you think.

Could you show me the ShardingKey and Entity for endpoint_cpm? We might enhance the validation logic.

@hanahmily
Copy link
Copy Markdown
Contributor

hanahmily commented Apr 27, 2026

Tested this PR against the standard OAP compose (test/e2e-v2/cases/storage/banyandb/, OAP @ b4a8811d). The advisory check fires 174 times across 87 unique measures at OAP startup — every one of them with the same mismatch:

ShardingKey [service_id] must be a prefix of Entity tags [entity_id]

The affected measures are the entire endpoint-scoped, browser-page, service-relation, instance-relation, and kubernetes-endpoint metric families. The schema OAP installs:

entity:      { tagNames: [entity_id] }
shardingKey: { tagNames: [service_id] }

This is OAP's intentional composite-id pattern: entity_id is a base64 that already contains service_id, and service_id is exposed separately so service-level TopN aggregation is correct. The current "prefix of Entity tags" rule can't express this — it compares tag names literally.

Two other issues with the rule as written:

  1. The PR description says "ShardingKey must be a superset of Entity," but the code checks for a prefix. The real invariant for entity locality is ShardingKey ⊆ Entity (set-wise).
  2. The test case invalid sharding key (subset with different tag) (Entity=[s,i], ShardingKey=[i]) is rejected, but locality is fine here.

Proposal

// (a) Skip when len(Entity.TagNames) == 1 — composite-id pattern (OAP's entity_id).
// (b) Otherwise require ShardingKey ⊆ Entity AND shared tags in the same relative order.

(a) eliminates the 87 false positives; (b) catches reorder bugs and the superset case (ShardingKey tag absent from Entity), both of which the current rule either misses or over-rejects.

@aliyasirnac
Copy link
Copy Markdown
Contributor Author

aliyasirnac commented Apr 27, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(banyandb/measure): add validation to enforce ShardingKey compatibility with Entity

4 participants