From c0dbe668b1959251fc95de705fb87e45fef397bc Mon Sep 17 00:00:00 2001
From: Jaikumar V <41040911+Jaivarathan@users.noreply.github.com>
Date: Wed, 26 Mar 2025 14:48:46 +0530
Subject: [PATCH] Create SQL-based ingestion API

Add new SQL-based ingestion API
---
 docs/SQL-based ingestion API | 98 ++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 docs/SQL-based ingestion API
diff --git a/docs/SQL-based ingestion API b/docs/SQL-based ingestion API
new file mode 100644
index 000000000000..8a887dfdfee4
--- /dev/null
+++ b/docs/SQL-based ingestion API	
@@ -0,0 +1,98 @@
+The Query view in the web console provides a friendly experience for the multi-stage query task engine (MSQ task engine) and multi-stage query architecture. We recommend using the web console if you don't need a programmatic interface.
+
+When using the API for the MSQ task engine, the action you want to take determines the endpoint you use:
+
+/druid/v2/sql/task: Submit a query for ingestion.
+/druid/indexer/v1/task: Interact with a query, including getting its status or details, or canceling the query. This page describes a few of the Overlord Task APIs that you can use with the MSQ task engine.
+
+Submit a query
+Submits queries to the MSQ task engine.
+
+The /druid/v2/sql/task endpoint accepts the following:
+
+SQL requests in the JSON-over-HTTP form using the query, context, and parameters fields. The endpoint ignores the resultFormat, header, typesHeader, and sqlTypesHeader fields.
+INSERT and REPLACE statements.
+SELECT queries (experimental feature). SELECT query results are collected from workers by the controller, and written into the task report as an array of arrays. The behavior and result format of plain SELECT queries (without INSERT or REPLACE) is subject to change.
+URL
+POST /druid/v2/sql/task
+
+Responses
+200 SUCCESS
+400 BAD REQUEST
+500 INTERNAL SERVER ERROR
+Successfully submitted query.
+Sample response
+View the response
+Response fields
+
+Field	Description
+taskId	Controller task ID. You can use Druid's standard Tasks API to interact with this controller task.
+state	Initial state for the query.
+
+Get the status for a query task
+Retrieves the status of a query task. It returns a JSON object with the task's status code, runner status, task type, datasource, and other relevant metadata.
+
+URL
+GET /druid/indexer/v1/task/{taskId}/status
+
+Responses
+200 SUCCESS
+404 NOT FOUND
+
+Successfully retrieved task status
+
+Sample response
+The response shows an example report for a query.
+
+View the response
+The following table describes the response fields when you retrieve a report for a MSQ task engine using the /druid/indexer/v1/task/{taskId}/reports endpoint:
+
+Field	Description
+multiStageQuery.taskId	Controller task ID.
+multiStageQuery.payload.status	Query status container.
+multiStageQuery.payload.status.status	RUNNING, SUCCESS, or FAILED.
+multiStageQuery.payload.status.startTime	Start time of the query in ISO format. Only present if the query has started running.
+multiStageQuery.payload.status.durationMs	Milliseconds elapsed after the query has started running. -1 denotes that the query hasn't started running yet.
+multiStageQuery.payload.status.workers	Workers for the controller task.
+multiStageQuery.payload.status.workers.<workerNumber>	Array of worker tasks including retries.
+multiStageQuery.payload.status.workers.<workerNumber>[].workerId	Id of the worker task.
+multiStageQuery.payload.status.workers.<workerNumber>[].status	RUNNING, SUCCESS, or FAILED.
+multiStageQuery.payload.status.workers.<workerNumber>[].durationMs	Milliseconds elapsed between when the worker task was first requested and when it finished. It is -1 for worker tasks with status RUNNING.
+multiStageQuery.payload.status.workers.<workerNumber>[].pendingMs	Milliseconds elapsed between when the worker task was first requested and when it fully started RUNNING. Actual work time can be calculated using actualWorkTimeMS = durationMs - pendingMs.
+multiStageQuery.payload.status.pendingTasks	Number of tasks that are not fully started. -1 denotes that the number is currently unknown.
+multiStageQuery.payload.status.runningTasks	Number of currently running tasks. Should be at least 1 since the controller is included.
+multiStageQuery.payload.status.segmentLoadStatus	Segment loading container. Only present after the segments have been published.
+multiStageQuery.payload.status.segmentLoadStatus.state	Either INIT, WAITING, SUCCESS, FAILED or TIMED_OUT.
+multiStageQuery.payload.status.segmentLoadStatus.startTime	Time since which the controller has been waiting for the segments to finish loading.
+multiStageQuery.payload.status.segmentLoadStatus.duration	The duration in milliseconds that the controller has been waiting for the segments to load.
+multiStageQuery.payload.status.segmentLoadStatus.totalSegments	The total number of segments generated by the job. This includes tombstone segments (if any).
+multiStageQuery.payload.status.segmentLoadStatus.usedSegments	The number of segments which are marked as used based on the load rules. Unused segments can be cleaned up at any time.
+multiStageQuery.payload.status.segmentLoadStatus.precachedSegments	The number of segments which are marked as precached and served by historicals, as per the load rules.
+multiStageQuery.payload.status.segmentLoadStatus.onDemandSegments	The number of segments which are not loaded on any historical, as per the load rules.
+multiStageQuery.payload.status.segmentLoadStatus.pendingSegments	The number of segments remaining to be loaded.
+multiStageQuery.payload.status.segmentLoadStatus.unknownSegments	The number of segments whose status is unknown.
+multiStageQuery.payload.status.segmentReport	Segment report. Only present if the query is an ingestion.
+multiStageQuery.payload.status.segmentReport.shardSpec	Contains the shard spec chosen.
+multiStageQuery.payload.status.segmentReport.details	Contains further reasoning about the shard spec chosen.
+multiStageQuery.payload.status.errorReport	Error object. Only present if there was an error.
+multiStageQuery.payload.status.errorReport.taskId	The task that reported the error, if known. May be a controller task or a worker task.
+multiStageQuery.payload.status.errorReport.host	The hostname and port of the task that reported the error, if known.
+multiStageQuery.payload.status.errorReport.stageNumber	The stage number that reported the error, if it happened during execution of a specific stage.
+multiStageQuery.payload.status.errorReport.error	Error object. Contains errorCode at a minimum, and may contain other fields as described in the error code table. Always present if there is an error.
+multiStageQuery.payload.status.errorReport.error.errorCode	One of the error codes from the error code table. Always present if there is an error.
+multiStageQuery.payload.status.errorReport.error.errorMessage	User-friendly error message. Not always present, even if there is an error.
+multiStageQuery.payload.status.errorReport.exceptionStackTrace	Java stack trace in string form, if the error was due to a server-side exception.
+multiStageQuery.payload.stages	Array of query stages.
+multiStageQuery.payload.stages[].stageNumber	Each stage has a number that differentiates it from other stages.
+multiStageQuery.payload.stages[].phase	Either NEW, READING_INPUT, POST_READING, RESULTS_COMPLETE, or FAILED. Only present if the stage has started.
+multiStageQuery.payload.stages[].workerCount	Number of parallel tasks that this stage is running on. Only present if the stage has started.
+multiStageQuery.payload.stages[].partitionCount	Number of output partitions generated by this stage. Only present if the stage has started and has computed its number of output partitions.
+multiStageQuery.payload.stages[].startTime	Start time of this stage. Only present if the stage has started.
+multiStageQuery.payload.stages[].duration	The number of milliseconds that the stage has been running. Only present if the stage has started.
+multiStageQuery.payload.stages[].sort	A boolean that is set to true if the stage does a sort as part of its execution.
+multiStageQuery.payload.stages[].definition	The object defining what the stage does.
+multiStageQuery.payload.stages[].definition.id	The unique identifier of the stage.
+multiStageQuery.payload.stages[].definition.input	Array of inputs that the stage has.
+multiStageQuery.payload.stages[].definition.broadcast	Array of input indexes that get broadcasted. Only present if there are inputs that get broadcasted.
+multiStageQuery.payload.stages[].definition.processor	An object defining the processor logic.
+multiStageQuery.payload.stages[].definition.signature	The output signature of the stage.