Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Add is_active to sys.segments, update examples and docs. #11550

Merged
merged 6 commits into from
May 19, 2022

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Aug 4, 2021

is_active is short for:

(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.

is_active is short for:

  (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.
Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc part is quite clear and helpful, thanks. Suggested a few refinements.

|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the wording a bit? Seems the key bit for a user to know is: For a published segment, the number will either be null or accurate. If null, then the Broker has not received the row count yet. For an unpublished segment, the number will be slightly out of date as new data arrives. (Assuming this is an accurate statement.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little bit of delay between when a segment is published and when num_rows becomes fully accurate, because it's fetched via doing a query to a data server, rather than appearing in the published segment descriptor. I updated the wording to the following, which is hopefully more clear:

Number of rows in this segment, or zero if the number of rows is not known.

This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a count(*) query because the cached num_rows on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.

(I also changed "null" to "zero" because that's what it actually is.)

Copy link

@loquisgon loquisgon May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that the state of "we don't know the number of rows yet because we haven't finished checking" is zero. I rather have it null or some other indications ("?" or "processing"... I know is not easy to find an alternative). Until now I puzzled why there is a lag of time when rows are zero in the web console and suddenly they are not. And this happened when I was working in tombstones because "zero" rows was an indication to me that the segment might be a tombstone until it was not lol....

|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second (third) place in the docs that emphasizes should. Is this notion explained anywhere? Does this mean that the segment is scheduled to load into a Historical, but has not yet done so? Or, does it mean there is some kind of problem that the user must resolve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context with the "should be" is that everything with regard to ingestion and segment availability happens in the background and is asynchronous. So some segments maybe should be available, but aren't right now, and the system will work to make them available. Some others maybe are available, but shouldn't be (because they were dropped or replaced), and the system will work to make them unavailable.

I changed the wording to hopefully be more clear:

True for segments that represent the latest state of a datasource.

Equivalent to (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1. In steady state, when no ingestions or data management operations are happening, is_active will be equivalent to is_available. However, they may differ from each other when ingestions or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given by is_active.

|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably "published to the metadata store" means "by the MiddleManager at the completion of ingestion"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consistent use of code font: is_overshadowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

|shard_spec|STRING|JSON-serialized form of the segment `ShardSpec`|
|dimensions|STRING|JSON-serialized form of the segment dimensions|
|metrics|STRING|JSON-serialized form of the segment metrics|
|last_compaction_state|STRING|JSON-serialized form of the compaction task's config (compaction task which created this segment). May be null if segment was not created by compaction task.|

For example to retrieve all segments for datasource "wikipedia", use the query:
For example to retrieve all currently-active segments for datasource "wikipedia", use the query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example to retrieve all currently-active segments for datasource "wikipedia", use the query:
For example to retrieve all currently active segments for datasource "wikipedia", use the query:

@techdocsmith
Copy link
Contributor

@vtlim, i think this might have merge conflicts due to the sql refactor. Any way we can get @gianm updates into the current structure?

@gianm
Copy link
Contributor Author

gianm commented May 14, 2022

I've merged master with this branch and re-pushed it. The doc changes are now made in querying/sql-metadata-tables.md.

|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
|shard_spec|STRING|JSON-serialized form of the segment `ShardSpec`|
|num_rows|LONG|Number of rows in this segment, or zero if the number of rows is not known.<br /><br />This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a `count(*)` query because the cached `num_rows` on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.|
|is_active|LONG|True for segments that represent the latest state of a datasource.<br /><br />Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`. In steady state, when no ingestion or data management operations are happening, `is_active` will be equivalent to `is_available`. However, they may differ from each other when ingestion or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given by `is_active`.|
Copy link

@loquisgon loquisgon May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nit: ... At the end of this great explanation, just to repeat it so it sticks: "given by is_active. In other words, a segment that is in the is_active state may not be available, not queryable, yet, but it will be in the near future".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"might be"? I guess it is possible that due to some other activities (segment was overshadowed before being available for instance) a segment in is_active may never make it to is_available....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah: there's a couple reasons a segment in is_active state won't eventually become is_available. Maybe it's dropped before that happens. Or maybe something is broken. In the interest of keeping the doc from getting too long I'm thinking to leave it as-is. But I invite follow-up patches that improve things 🙂

@@ -313,7 +316,10 @@ public Enumerable<Object[]> scan(DataContext root)
(long) segment.getShardSpec().getPartitionNum(),
numReplicas,
numRows,
IS_PUBLISHED_TRUE, //is_published is true for published segments
//is_active is true for published segments that are not overshadowed
val.isOvershadowed() ? IS_ACTIVE_FALSE : IS_ACTIVE_TRUE,
Copy link

@loquisgon loquisgon May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm.. isn't it a requirement for being active that is_overshadow and is_publish both be true? Oh...got it. We already know that it is published if we are here. So it is fine...never mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. The branch is for published segments only.

Copy link

@loquisgon loquisgon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM..Thanks @gianm this PR will be very helpful to the community.

@gianm
Copy link
Contributor Author

gianm commented May 19, 2022

thanks for reviewing @loquisgon!

@gianm gianm merged commit 65a1375 into apache:master May 19, 2022
@gianm gianm deleted the sql-sys-segments-examples branch May 19, 2022 21:23
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants