Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

findingrish · 2024-02-01T04:53:27Z

Description

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.

Design

Database

Schema Table

Table Name: SegmentSchema
Purpose: Store unique schema for segment.

Columns

Column Name	Data Type	Description
id	autoincrement	primary key
created_date	varchar	creation time, allows filtering schema created after a point
datasource	varchar	datasource
fingerprint	varchar	unique identifier for the schema, sha-256 hash of payload, datasource & version
payload	blob	includes rowSignature, aggregatorFactories
used	boolean	true if the schema is referenced by `used` segments
used_status_last_updated	varchar	timestamp when the used status was last updated
version	int	schema version

Segments Table

New columns will be added to the already existing Segments table.

Columns

Column Name	Data Type	Description
num_rows	long	number of rows in the segment
schema_fingerprint	string	schema fingerprint

Task

Changes are required in the task to publish schema along with segment metadata.

Introduce a new class SchemaPayload to encapsulate RowSignature and AggregatorFactories.
Introduce a new class SegmentSchemaMetadata to encapsulate SchemaPayload and numRows.
Introduce a new class SegmentSchemaMapping to encapsulate schema and numRows information for multiple segments.
Update SegmentInsertAction, SegmentTransactionalReplaceAction, SegmentTransactionalAppendAction & SegmentTransactionalInsertAction to take in segment schema.
Changes in AbstractBatchIndexTask#buildPublishAction to take segment schema.
Changes in SegmentAndCommitMetadata to take segment schema.
Changes in TransactionalSegmentPublisher to take segment schema for publishing to the DB.

Streaming

Changes in StreamAppenderator to get the RowSignature, AggregatorFactories and numRows for the segment.

Batch

AppenderatorImpl#push to build the segment schema and add it to SegmentsAndCommitMetadata.
BatchAppenderator#push to build the segment schema and add it to SegmentsAndCommitMetadata.

IndexTask

Changes in BatchAppenderatorDriver#publishAll to pass segment schema for publishing.
Change in IndexTask#generateAndPublishSegments to fetch segment schema from pushed segments and publish.

ParallelIndexSupervisorTask

Changes in ParallelIndexSupervisorTask#publishSegments to combine segment schema from segments and publish them.
SinglePhaseSubTask
PartialSegmentMergeTask

MSQ

Changes in SegmentGeneratorFrameProcessor to return segment schema along with segment metadata.
Changes in SegmentGeneratorFrameProcessorFactory and ControllerImpl.
Note, these changes are reverted for now.

Overlord

Changes are required in the Overlord (IndexerSQLMetadataStorageCoordintor) to persist the schema along with segment metadata in the database.

Coordinator

Schema Poll

Changes in SqlSegmentsMetadataManager to poll schema along with segments.
Also poll schema_id and num_rows additionally from segments table.
Update schema cache.

Schema Caching

Maintain a cache of segment schema. Refer SegmentSchemaCache.
It caches following information,

Information	Writer	Cleanup
SegmentMetadata . SegmentId -> schema fingerprint, numRows	Replaced on each DB poll	Not required.
Schema for finalised segments. Schema fingerprint -> SchemaPayload	Replaced on each DB poll.	Not required.
Realtime segment schema. SegmentId -> SegmentSchemaMetadata	Whenever Peons push schema update.	When the segment is removed.
SMQResults which are not published. SegmentId -> SegmentSchemaMetadata	Added after SMQ query is executed.	If SegmentSchemaBackFill queue successfully writes the schema to the database, it is removed from this map.
SMQResults which have been published. SegmentId -> SegmentSchemaMetadata	Added after segment schema is published to the DB.	Cleared after each DB Poll.

SegmentMetadataCache changes

Changes in AbstractSegmentMetadataCache class to add new method which will be overridden by child classes,

additionalInitializationCondition
removeSegmentAction
segmentMetadataQueryResultHandler

Changes in CoordinatorSegmentMetadataCache to override methods from AbstractSegmentMetadataCache,

Implement additionalInitializationCondition to wait for the segmentSchemaCache to be initialized.
Implement removeSegmentAction to remove the schema from the schema cache.
Override segmentMetadataQueryResultHandler to additionally publish and cache the schema.

Schema Backfill

Added a new class SegmentSchemaBackFillQueue which accepts segment schema and publish them in batch.

Schema Cleanup

CoordinatorDuty to clean up schema which is not referenced by any segment.

Coordinator leader flow changes

CoordinatorSegmentMetadataCache refresh is executed only on the leader node.
CoordinatorSegmentMetadataCache timeline callback continue to function on all Coordinator nodes.
SegmentSchemaCache is populated only on the leader node, except for the realtime schema information which is updated on all Coordinator nodes.
SegmentSchemaBackFillQueue functions only on the leader node.

Testing

The changes have been tested locally with the wikipedia dataset.
Unit test has been added.
All of the existing integration tests have been tested with feature enabled (e8a6d9b).
Integration test with the group name centralized-table-schema runs successfully.
The changes have also been tested in a Druid cluster.

Upgrade considerations

The general upgrade order should be followed. The new code is behind a feature flag, so it is compatible with existing setups. Task with new changes can communicate with old version of Overlord.

Release Notes

This feature addresses multiple challenges outlined in the linked issue. To enable it, set druid.centralizedDatasourceSchema.enabled.
If MM is used then set druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled.

When the feature is enabled,

Realtime segment schema change would be periodically pushed to the Coordinator,
Finalized segment schema would be written to the metadata database.
Coordinator would poll the schema along with segment metadata.
Coordinator would build the datasource schema and broker would fetch it from the Coordinator.

To rollback, simply turn off the feature flag. The database schema change is not rolled back on turning off the feature.

New configs,

Name	Purpose
`druid.coordinator.kill.segmentSchema.on`	Config to enabled kill segment schema Coordinator duty
`druid.coordinator.kill.segmentSchema.period`	Kill segment schema Coordinator duty period
`druid.coordinator.kill.segmentSchema.durationToRetain`	Duration to retain segment schema after being marked as unused.

Important metrics to track,

Metric	Purpose
`metadatacache/schemaPoll/count`	Number of coordinator polls to fetch datasource schema.
`metadatacache/schemaPoll/failed`	Number of failed coordinator polls to fetch datasource schema.
`metadatacache/schemaPoll/time`	Time taken for coordinator polls to fetch datasource schema.
`metadatacache/init/time`	Time taken to initialize the coordinator segment metadata cache. Depends on the number of segments.
`metadatacache/refresh/count`	Number of segments to refresh in coordinator segment metadata cache.
`metadatacache/refresh/time`	Time taken to refresh segments in coordinator segment metadata cache.
`metadatacache/backfill/count`	Number of segments for which schema was backfill in the DB.
`schemacache/realtime/size`	Number of realtime segment for which schema is cached.
`schemacache/finalizedSegmentMetadata/size`	Number of finalized segments for which schema is cached.
`schemacache/finalizedSchemaPayload/size`	Number of distinct schema payload cached.
`schemacache/inTransitSMQResults/size`	Number of segment schema cached as a result of SMQ.
`schemacache/inTransitSMQPublishedResults/size`	Number of segment schema backfilled in the DB.

This PR has:

…add javadocs for classes

…ds_schema

…aFactory

…s_ds_schema

…segment polled from coordiantor

…s_ds_schema

….coordinator.centralizedSchemaManagement.enabled

…s_ds_schema

… config

…s_ds_schema

… for broker-coordinator communication

…s_ds_schema

…eMap

…in segment metadata query. Changes in BrokerSegmentMetadataCache to refresh even if no new segments are added to the inventory.

…a_read_write

cryptoe

Left a comment. The PR seems very close to merge. I would review the SQL changes again.

cryptoe · 2024-04-22T05:16:17Z

...main/java/org/apache/druid/msq/indexing/processor/SegmentGeneratorFrameProcessorFactory.java

@@ -192,7 +193,8 @@ public Pair<Integer, ReadableInput> apply(ReadableInput readableInput)
                  frameContext.indexMerger(),
                  meters,
                  parseExceptionHandler,
-                  true
+                  true,
+                  CentralizedDatasourceSchemaConfig.create(false)


MSQ does not support centralized data source schema yet. I think we should put his comment here.

Better to have this comment in the javadoc of CentralizedDatasourceSchemaConfig itself.

cryptoe · 2024-04-22T05:19:23Z

services/src/main/java/org/apache/druid/cli/CliPeon.java

@@ -220,31 +217,10 @@ protected List<? extends Module> getModules()
          @Override
          public void configure(Binder binder)
          {
+


MM less ingestion would need this check.

cryptoe · 2024-04-22T05:21:46Z

services/src/main/java/org/apache/druid/cli/ServerRunnable.java

@@ -197,4 +204,31 @@ public void stop()
      return new Child();
    }
  }
+
+  protected void validateCentralizedDatasourceSchemaConfig(Properties properties)


This can be static as well.

cryptoe · 2024-04-22T05:24:55Z

server/src/main/java/org/apache/druid/segment/metadata/FingerprintGenerator.java

@@ -45,18 +51,27 @@ public FingerprintGenerator(ObjectMapper objectMapper)
  /**
   * Generates fingerprint or hash string for an object using SHA-256 hash algorithm.
   */
-  public String generateFingerprint(Object payload)
+  @SuppressWarnings("UnstableApiUsage")
+  public String generateFingerprint(SchemaPayload schemaPayload, String dataSource, int version)


This should have UT's so that we can assert the changes.

Dependency update might cause issues.

cryptoe · 2024-04-23T10:21:10Z

docs/operations/metrics.md

@@ -75,6 +75,12 @@ Most metric values reset each emission period, as specified in `druid.monitoring
 |`metadatacache/schemaPoll/count`|Number of coordinator polls to fetch datasource schema.||
 |`metadatacache/schemaPoll/failed`|Number of failed coordinator polls to fetch datasource schema.||
 |`metadatacache/schemaPoll/time`|Time taken for coordinator polls to fetch datasource schema.||
+|`metadatacache/backfill/count`|Number of segments for which schema was back filled in the database.|`dataSource`|
+|`schemacache/realtime/size`|Number of realtime segments for which schema is cached.||Depends on the number of realtime segments.|
+|`schemacache/finalizedSegmentMetadata/size`|Number of finalized segments for which schema metadata is cached.||Depends on the number of segments in the cluster.|


Nit: this should be count and not size. We can do this change as a followup.

cryptoe · 2024-04-23T10:29:28Z

server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java

   */
  SegmentPublishResult commitAppendSegments(
      Set<DataSegment> appendSegments,
      Map<DataSegment, ReplaceTaskLock> appendSegmentToReplaceLock,
-      @Nullable MinimalSegmentSchemas minimalSegmentSchemas
+      String taskAllocatorId,


Why is this required ?
There are no javadocs for this.

It is there in the master https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java#L498.
It seems you are looking at a subset of commits from PR, check this diff https://github.com/apache/druid/pull/15817/files#diff-519b0b98ee6a12cbb850f2f27fb6947e86e9353c79373ccb8d78d6113d1304b5R551.

cryptoe · 2024-04-23T10:30:32Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

-          "Schema version [%d] doesn't match the current version [%d], dropping the schema [%s].",
-          minimalSegmentSchemas.getSchemaVersion(),
+          "Schema version [%d] doesn't match the current version [%d]. Not persisting this schema [%s]. "
+          + "Schema for this segment will be poppulated by the schema backfill job in Coordinator.",


Suggested change

+ "Schema for this segment will be poppulated by the schema backfill job in Coordinator.",

+ "Schema for this segment will be populated by the schema back-fill job in Coordinator.",

cryptoe · 2024-04-23T11:16:33Z

server/src/main/java/org/apache/druid/segment/metadata/CoordinatorSegmentMetadataCache.java

@@ -175,7 +175,7 @@ public void stop()

  public void leaderStart()
  {
-    log.info("%s starting cache initialization.", getClass().getSimpleName());
+    log.info("Initializing cache.");


I would recommend adding the name of the class so I can search
cache %s in the logs :). It makes it easier for me to search thought tons of logs. since each loggers can have its own format.
Nit: can be done in a followup.

cryptoe · 2024-04-23T11:17:30Z

server/src/main/java/org/apache/druid/segment/metadata/SegmentSchemaCache.java

-    finalizedSegmentStats = ImmutableMap.of();
-    finalizedSegmentSchema.clear();
+    finalizedSegmentMetadata = ImmutableMap.of();
+    finalizedSegmentSchema = ImmutableMap.of();


Can this cause threadSafety issues if we change the reference ?

cryptoe · 2024-04-23T11:18:59Z

server/src/main/java/org/apache/druid/segment/metadata/SegmentSchemaCache.java

@@ -106,10 +108,10 @@ public SegmentSchemaCache(ServiceEmitter emitter)

  public void setInitialized()
  {
-    log.info("[%s] initializing.", getClass().getSimpleName());
+    log.info("Initializing SegmentSchemaCache.");


This should have isInitalized() on the top no ?

…askResourceTest

… to avoid race

…a_read_write

findingrish · 2024-04-24T07:10:21Z

I have addressed the feedback on the PR. I will raise a followup PR to enable schema publish in MSQ and address any feedback meant for later.

cryptoe

Changes lgtm. There are some rough edges that can be taken care as part of a follow up PR.
Thanks @findingrish for taking up this monumental effort.

cryptoe · 2024-04-24T11:10:31Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

+      log.error(
+          "Schema version [%d] doesn't match the current version [%d]. Not persisting this schema [%s]. "
+          + "Schema for this segment will be populated by the schema backfill job in Coordinator.",
+          segmentSchemaMapping.getSchemaVersion(),


This check should be outside the transaction. Lets create a follow up patch for that.

cryptoe · 2024-04-24T11:15:35Z

server/src/main/java/org/apache/druid/metadata/SQLMetadataConnector.java

+
+    Set<String> columnsToAdd = new HashSet<>();
+
+    for (String columnName : columnNameTypes.keySet()) {


We should add a test case where we are checking this logic.

Note: Followup item.

kfaraz

+1 to what @cryptoe said, I agree that we can refine this as we go but overall the changes seem okay.

Thanks for your patience on this, @findingrish !!

kfaraz · 2024-04-22T06:27:35Z

server/src/main/java/org/apache/druid/segment/metadata/SegmentSchemaManager.java

+  {
+    log.debug("Updating segment with schema and numRows information: [%s].", batch);
+
+    // update schemaId and numRows in segments table


Suggested change

// update schemaId and numRows in segments table

// update fingerprint and numRows in segments table

kfaraz · 2024-04-24T14:44:18Z

docs/configuration/index.md

@@ -1435,6 +1435,7 @@ MiddleManagers pass their configurations down to their child peons. The MiddleMa
 |`druid.worker.baseTaskDirs`|List of base temporary working directories, one of which is assigned per task in a round-robin fashion. This property can be used to allow usage of multiple disks for indexing. This property is recommended in place of and takes precedence over `${druid.indexer.task.baseTaskDir}`.  If this configuration is not set, `${druid.indexer.task.baseTaskDir}` is used. For example, `druid.worker.baseTaskDirs=[\"PATH1\",\"PATH2\",...]`.|null|
 |`druid.worker.baseTaskDirSize`|The total amount of bytes that can be used by tasks on any single task dir. This value is treated symmetrically across all directories, that is, if this is 500 GB and there are 3 `baseTaskDirs`, then each of those task directories is assumed to allow for 500 GB to be used and a total of 1.5 TB will potentially be available across all tasks. The actual amount of memory assigned to each task is discussed in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes)|`Long.MAX_VALUE`|
 |`druid.worker.category`|A string to name the category that the MiddleManager node belongs to.|`_default_worker_category`|
+|`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled`| This config should be set when CentralizedDatasourceSchema feature is enabled. |false| 


For follow-up PR:
The config description should be more like Indicates whether centralized schema management is enabled. The description should also link to the page which contains the details of the feature.

kfaraz · 2024-04-24T14:45:13Z

docs/operations/metrics.md

@@ -75,6 +75,12 @@ Most metric values reset each emission period, as specified in `druid.monitoring
 |`metadatacache/schemaPoll/count`|Number of coordinator polls to fetch datasource schema.||
 |`metadatacache/schemaPoll/failed`|Number of failed coordinator polls to fetch datasource schema.||
 |`metadatacache/schemaPoll/time`|Time taken for coordinator polls to fetch datasource schema.||
+|`metadatacache/backfill/count`|Number of segments for which schema was back filled in the database.|`dataSource`|
+|`schemacache/realtime/count`|Number of realtime segments for which schema is cached.||Depends on the number of realtime segments.|


For follow-up PR
Do these rows render correctly? The preceding rows have only 3 columns, this one seems to have 4.

kfaraz · 2024-04-24T14:46:15Z

docs/operations/metrics.md

+|`schemacache/finalizedSegmentMetadata/count`|Number of finalized segments for which schema metadata is cached.||Depends on the number of segments in the cluster.|
+|`schemacache/finalizedSchemaPayload/count`|Number of finalized segment schema cached.||Depends on the number of distinct schema in the cluster.|
+|`schemacache/inTransitSMQResults/count`|Number of segments for which schema was fetched by executing segment metadata query.||Eventually it should be 0.|
+|`schemacache/inTransitSMQPublishedResults/count`|Number of segments for which schema is cached after back filling in the database.||Eventually it should be 0.|


For follow-up PR:
Is schemacache/ not the same as metadatacache? The similar yet different names can be confusing.

kfaraz · 2024-04-24T14:50:20Z

server/src/main/java/org/apache/druid/metadata/SQLMetadataConnector.java

+
+    for (String column : columns) {
+      createStatementBuilder.append(column);
+      createStatementBuilder.append(",");


For follow-up PR:
Nit: We seem to have removed the new line characters. They formatted the statement nicely in case we wanted to debug it.

kfaraz · 2024-04-24T15:59:52Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java

@@ -905,7 +907,7 @@ private TaskStatus generateAndPublishSegments(
    try (final BatchAppenderatorDriver driver = BatchAppenderators.newDriver(appenderator, toolbox, segmentAllocator)) {
      driver.startJob();

-      SegmentsAndCommitMetadata pushed = InputSourceProcessor.process(
+      Pair<SegmentsAndCommitMetadata, SegmentSchemaMapping> commitMetadataAndSchema = InputSourceProcessor.process(


For follow-up PR
Why is SegmentSchemaMapping not included inside the SegmentsAndCommitMetadata object itself?

kfaraz · 2024-04-24T16:00:56Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/InputSourceProcessor.java

@@ -58,7 +61,7 @@ public class InputSourceProcessor
   *
   * @return {@link SegmentsAndCommitMetadata} for the pushed segments.
   */
-  public static SegmentsAndCommitMetadata process(
+  public static Pair<SegmentsAndCommitMetadata, SegmentSchemaMapping> process(


For follow-up PR
We should include SegmentSchemaMapping inside the SegmentsAndCommitMetadata itself.

kfaraz · 2024-04-24T16:04:22Z

indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SequenceMetadata.java

+                 ? SegmentTransactionalAppendAction.forSegmentsAndMetadata(segmentsToPush, startMetadata, endMetadata,
+                                                                           segmentSchemaMapping
+        )
+                 : SegmentTransactionalInsertAction.appendAction(segmentsToPush, startMetadata, endMetadata,
+                                                                 segmentSchemaMapping
+                 );


For follow-up PR
Please fix the formatting here.

kfaraz · 2024-04-24T16:04:59Z

indexing-service/src/test/java/org/apache/druid/indexing/common/TestIndexTask.java

@@ -110,4 +115,14 @@ public TaskStatus runTask(TaskToolbox toolbox)
  {
    return status;
  }
+
+  public TaskAction<SegmentPublishResult> testBuildPublishAction(


For follow-up PR

Suggested change

public TaskAction<SegmentPublishResult> testBuildPublishAction(

public TaskAction<SegmentPublishResult> buildPublishAction(

kfaraz · 2024-04-24T16:06:15Z

...vice/src/test/java/org/apache/druid/indexing/common/actions/RetrieveSegmentsActionsTest.java

@@ -65,7 +65,7 @@ public static void setup() throws IOException
    expectedUnusedSegments.add(createSegment(Intervals.of("2017-10-07/2017-10-08"), UNUSED_V1));

    actionTestKit.getMetadataStorageCoordinator()
-                 .commitSegments(expectedUnusedSegments);
+                 .commitSegments(expectedUnusedSegments, null);


For follow-up PR
Since passing null is a very common usage right now, it would be better to keep two variants of the new methods. It would be easier to identify the usages which pass non-null values and we could also avoid passing nulls all over the place.

findingrish added 30 commits September 12, 2023 21:37

Remove enabled field from SegmentMetadataCacheConfig

33a8dd5

Add class to manage druid table information in SegmentMetadataCache, …

7a7ca55

…add javadocs for classes

Merge remote-tracking branch 'origin/master' into coordinator_builds_…

eb6a145

…ds_schema

Minor refactoring in SegmentMetadataCache

b9fb83d

Make SegmentMetadataCache generic

aa2bfe7

Add a generic abstract class for segment metadata cache

e97dcda

Rename SegmentMetadataCache to CoordinatorSegmentMetadataCache

7badce1

Rename PhysicalDataSourceMetadataBuilder to PhysicalDataSourceMetadat…

25cdce6

…aFactory

Fix json property key name in DataSourceInformation

5f5ad18

Add validation in MetadataResource#getAllUsedSegments, update javadocs

08e949e

Minor changes

80fc09d

Minor change

4217cd8

Merge remote-tracking branch 'upstream/master' into coordinator_build…

8b7e483

…s_ds_schema

Update base property name for query config classes in Coordinator

d6ac350

Log ds schema change when polling from coordinator

533236b

update the logic to determine is_active status in segments table for …

70f0888

…segment polled from coordiantor

Merge remote-tracking branch 'upstream/master' into coordinator_build…

a176bfe

…s_ds_schema

Update the logic to set numRows in the sys segments table, add comments

b32dfd6

Rename config druid.coordinator.segmentMetadataCache.enabled to druid…

17417b5

….coordinator.centralizedSchemaManagement.enabled

Merge remote-tracking branch 'upstream/master' into coordinator_build…

6a395a9

…s_ds_schema

Report cache init time irrespective of the awaitInitializationOnStart…

907ace3

… config

Merge remote-tracking branch 'upstream/master' into coordinator_build…

cf68c38

…s_ds_schema

Report metric for fetching schema from coordinator

441f37a

Add auth check in api to return dataSourceInformation, report metrics…

bd5b048

… for broker-coordinator communication

Fix bug in Coordinator api to return dataSourceInformation

933d8d1

Minor change

9e7e364

Merge remote-tracking branch 'upstream/master' into coordinator_build…

e7356ce

…s_ds_schema

Address comments around docs, minor renaming

5d16148

Remove null check from MetadataResource#getDataSourceInformation

d8884be

Merge remote-tracking branch 'upstream/master' into coordinator_build…

0f0805a

…s_ds_schema

findingrish added 16 commits April 22, 2024 00:07

Change schemaMap in SegmentSchemaCache from ConcurrentMap to Immutabl…

ca571ee

…eMap

Update docs

4277760

Update test

5e7a6a6

Fix checkstyle

86351ae

Add test for FingerprintGenerator

a5827a6

Add config validation in CliPeon

5383967

Initialize schema cache

5bf86be

Enable feature in IngestionTestBase

a07eccf

Fix SqlSegmentsMetadataManagerSchemaPollTest

0023d24

Minor code changes

85208f2

Poll schema for all datasources in the inventory in Broker

11e5ca7

Tests for coverage

e78fb65

Minor change in SegmentSchemaCache

bb7e98b

Tests to meet coverage

ae43a83

Expose a method in AbstractSegmentMetadataCache to fetch aggregators …

26a6472

…in segment metadata query. Changes in BrokerSegmentMetadataCache to refresh even if no new segments are added to the inventory.

Merge remote-tracking branch 'upstream/master' into coordinator_schem…

e65ff4d

…a_read_write

cryptoe reviewed Apr 23, 2024

View reviewed changes

findingrish added 7 commits April 23, 2024 17:25

Add a note

a05043b

nit changes

c44b289

Rename some methods, add null check when accessing cacheExecFuture

9033a46

Enable CentralizedDatasourceSchema config in ParallelIndexSupervisorT…

7e349ce

…askResourceTest

Encapsulate finalizedSegmentStats & finalizedSegmentSchema in a class…

472bce9

… to avoid race

Fix potential npe while accessing segmentMetadataInfo

bc1e131

Merge remote-tracking branch 'upstream/master' into coordinator_schem…

4adfc77

…a_read_write

cryptoe approved these changes Apr 24, 2024

View reviewed changes

kfaraz approved these changes Apr 24, 2024

View reviewed changes

cryptoe merged commit e30790e into apache:master Apr 24, 2024
87 checks passed

findingrish mentioned this pull request May 2, 2024

Followup changes to 15817 (Segment schema publishing and polling) #16368

Merged

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

findingrish commented Feb 1, 2024 •

edited

cryptoe left a comment

cryptoe Apr 22, 2024

kfaraz Apr 24, 2024

cryptoe Apr 22, 2024

cryptoe Apr 22, 2024

cryptoe Apr 22, 2024

cryptoe Apr 22, 2024

cryptoe Apr 23, 2024

cryptoe Apr 23, 2024

findingrish Apr 23, 2024

cryptoe Apr 23, 2024

cryptoe Apr 23, 2024

cryptoe Apr 23, 2024

cryptoe Apr 23, 2024

findingrish commented Apr 24, 2024

cryptoe left a comment

cryptoe Apr 24, 2024

cryptoe Apr 24, 2024

kfaraz left a comment

kfaraz Apr 22, 2024

kfaraz Apr 24, 2024 •

edited

kfaraz Apr 24, 2024 •

edited

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

kfaraz Apr 24, 2024

	+ "Schema for this segment will be poppulated by the schema backfill job in Coordinator.",
	+ "Schema for this segment will be populated by the schema back-fill job in Coordinator.",


		Set<String> columnsToAdd = new HashSet<>();

		for (String columnName : columnNameTypes.keySet()) {

	// update schemaId and numRows in segments table
	// update fingerprint and numRows in segments table

	public TaskAction<SegmentPublishResult> testBuildPublishAction(
	public TaskAction<SegmentPublishResult> buildPublishAction(

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

Conversation

findingrish commented Feb 1, 2024 • edited

Description

Design

Database

Schema Table

Segments Table

Task

Streaming

Batch

IndexTask

ParallelIndexSupervisorTask

MSQ

Overlord

Coordinator

Schema Poll

Schema Caching

SegmentMetadataCache changes

Schema Backfill

Schema Cleanup

Coordinator leader flow changes

Testing

Upgrade considerations

Release Notes

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findingrish commented Apr 24, 2024

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Apr 24, 2024 • edited

Choose a reason for hiding this comment

kfaraz Apr 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findingrish commented Feb 1, 2024 •

edited

kfaraz Apr 24, 2024 •

edited

kfaraz Apr 24, 2024 •

edited