Singular druid 30 changes by RonShub · Pull Request #18987 · apache/druid

RonShub · 2026-02-04T17:30:30Z

Fixes #XXXX.

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

…16510) * initial commit * add Javadocs * refine JSON input config * more test and fix build * extract existing behavior as default strategy * change template mapping fallback * add docs * update doc * fix doc * address comments * define Matcher interface * fix test coverage * use lower case for endpoint path * update Json name * add more tests * refactoring Selector class

* Update docs for K8s TaskRunner Dynamic Config * touchups * code review * npe * oopsies

* Add annotation for pod template * pr comments * add test cases * add tests

…e selector (apache#17400) * handling empty sets for dataSourceCondition and taskTypeCondition * using new HashSet<>() to fix forbidden api error in testCheck * fixing style issues

…ctor config (apache#17464)

Switch to using the reports file from the reports writer when pushing the reports. This fixes a bug where the AbstractTask could be unable to find the report.

…ic Pod Template Selection Logging

Problem: streamLogs() was using getPeonLogs() which returns a static snapshot via getLogInputStream(). This caused logs to only appear after task completion. Solution: Changed to use getPeonLogWatcher() which uses Kubernetes' watchLog() API for continuous streaming. Impact: - Live logs now stream in real-time during task execution - API endpoint GET /druid/indexer/v1/task/{id}/log works properly - Improves monitoring experience for long-running tasks See LIVE_LOG_STREAMING_FIX.md for detailed explanation and testing guide.

Added detailed diagnostic logging at every layer of the log streaming stack: 1. KubernetesTaskRunner.streamTaskLog() - API entry point - Logs request details and work item lookup 2. KubernetesWorkItem.streamTaskLogs() - Work item layer - Logs lifecycle availability and delegation 3. KubernetesPeonLifecycle.streamLogs() - Lifecycle layer - Logs state checks (RUNNING requirement) - Logs LogWatch creation - Enhanced state transition logging in join() 4. KubernetesPeonClient.getPeonLogWatcher() - K8s API layer - Logs Kubernetes API calls - Logs namespace, container, job name details - Enhanced error messages Log markers: - 📺 [STREAM] - API entry and work item - 📺 [WORKITEM] - Work item delegation - 📺 [LIFECYCLE] - State checks and streaming - 📺 [K8S-CLIENT] - Kubernetes API calls - 📊 [LIFECYCLE] - State transitions - ✅ Success, ⚠️ Warnings, ❌ Errors This enables complete tracing of the log streaming flow for debugging. See LIVE_LOG_STREAMING_DIAGNOSTICS.md for usage guide.

…apache#17431) * Fix save logs error * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * make things final * fix merge conflicts --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

Subsequent calls to KubernetesWorkItem.shutdown() should not block and return immediately.

Fixed compilation errors from cherry-picked commits: 1. Removed DruidException import (doesn't exist in Druid 30.0.0) 2. Reverted to KubernetesResourceNotFoundException (original 30.0.0 exception) 3. Fixed log.error() call to not duplicate exception message These changes maintain the improved error handling from upstream while being compatible with Druid 30.0.0 APIs.

…tch for saveLogs() CRITICAL FIX: Hybrid log streaming approach to prevent production outage Problem: - streamLogs() used .watchLog() which kept HTTP connections open for task duration - Long-running tasks blocked HTTP threads for hours - Risk of thread pool exhaustion and Overlord becoming unresponsive Solution: - streamLogs() now uses .getPeonLogs() (snapshot) - returns in 1-2 seconds - saveLogs() still uses LogWatch for complete, reliable S3 log capture - Best of both worlds: Fast HTTP responses + complete S3 logs Impact: - HTTP endpoint safe for concurrent use ✅ - No thread exhaustion risk ✅ - S3 logs still complete (OOM errors captured) ✅ - Production safe ✅ Files: - KubernetesPeonLifecycle.java: Modified streamLogs() method - HYBRID_LOG_STREAMING_APPROACH.md: Complete implementation guide - HYBRID_FIX_SUMMARY.md: Quick reference - LIVE_LOG_STREAMING_UPSTREAM_ANALYSIS.md: Upstream investigation results - deploy_hybrid_fix.sh: Deployment script

CRITICAL FIXES: 1. Exception Handling Bug - Fixed RuntimeException → IOException - KubernetesTaskRunner now throws IOException on HTTP failures - Allows SwitchingTaskLogStreamer to fall back to S3 properly - Fixes 500 errors when pods are terminated/unreachable 2. S3 Fallback Now Works - Completed tasks correctly fall back to S3 deep storage - Seamless transition from live reports to historical reports - No more 404 errors for completed tasks COMPREHENSIVE DIAGNOSTIC LOGGING: - Added logging to 3 layers: API → Switching → TaskRunner - Emoji markers for easy grep: 🌐 📊 🔀 ✅ ⚠️ ❌ - Shows exact flow: live reports attempt → S3 fallback → result - Logs which provider succeeded and why others failed Files Changed: - KubernetesTaskRunner.java - Fixed exception handling + added logs - SwitchingTaskLogStreamer.java - Added fallback flow logging - OverlordResource.java - Enhanced API endpoint logging - TASK_REPORTS_COMPREHENSIVE_FIX.md - Complete documentation - LIVE_REPORTS_DEBUG_GUIDE.md - Debugging guide Impact: - ✅ Live reports work for running tasks - ✅ S3 fallback works for completed tasks - ✅ Proper HTTP status codes (404 vs 500) - ✅ Production-safe graceful degradation - ✅ Full observability of the entire flow Deployment: Overlord only (indexing-service + k8s-overlord-extensions JARs)

CRITICAL DOCUMENTATION: MiddleManager Safety Analysis: - Analyzed Guice dependency injection bindings - Verified SwitchingTaskLogStreamer NOT instantiated on MMs - Confirmed OverlordResource NOT registered on MMs - Proven KubernetesTaskRunner NOT loaded on MMs Evidence: - CliOverlord.java binds TaskLogStreamer → SwitchingTaskLogStreamer - CliMiddleManager.java does NOT bind TaskLogStreamer - MMs use ForkingTaskRunner with separate logging - MMs expose /druid/worker/v1/* (not /druid/indexer/v1/*) Conclusion: ✅ 100% SAFE to deploy to Overlord ❌ DO NOT deploy to MiddleManagers (unnecessary and risky) ✅ Production workloads completely isolated See MIDDLEMANAGER_SAFETY_ANALYSIS.md for complete proof.

PMD Error: - UnnecessaryImport: Unused import 'com.google.common.base.Throwables' Fix: - Removed unused import from KubernetesTaskRunner.java - Import was left over from previous exception handling code

Added two deployment scripts: 1. rons_deploy_to_prodft.sh (semi-automated) - Copies JARs to Overlord - Shows commands for manual execution - Recommended for first-time deployment 2. rons_deploy_to_prodft_auto.sh (fully automated) - End-to-end automation via SSH - Automatic backups with timestamp - Zero manual intervention - Recommended for routine deployments Both scripts: ✅ Deploy ONLY to Overlord (safe for MiddleManagers) ✅ Create automatic backups ✅ Include rollback procedures ✅ Provide testing instructions Documentation: - DEPLOYMENT_SCRIPTS_README.md - Complete guide - Includes testing procedures - Safety guarantees - Rollback instructions

…amTaskReports is not being delegated to KubernetesTaskRunner

…askRunner Root cause: KubernetesAndWorkerTaskRunner implements TaskLogStreamer but was missing the streamTaskReports() method implementation. This caused all live task report requests to return Optional.absent() without ever delegating to the underlying KubernetesTaskRunner. The method now follows the same pattern as streamTaskLog(): 1. Try kubernetesTaskRunner first (for K8s tasks) 2. Fall back to workerTaskRunner if available (for MiddleManager tasks) 3. Return Optional.absent() if neither has reports This was the missing piece preventing live task reports from working!

…ions PROBLEM: - Overlord cannot reach pod IPs (172.31.x.x) directly when running outside K8s cluster - HTTP requests to /liveReports endpoint timeout after 10-20 seconds - Causes 'Faulty channel in resource pool' errors and blocks HTTP threads - S3-based reports work correctly for completed tasks SOLUTION: - Disable direct HTTP calls to pod /liveReports endpoint - Return Optional.absent() immediately to allow fast S3 fallback - Prevents HTTP thread exhaustion and connection timeout errors - Remove verbose diagnostic logging that is no longer needed IMPACT: - Live task reports during execution: NO (requires pod network access) - Task reports after completion: YES (via S3, works correctly) - MiddleManager tasks: UNAFFECTED (still get live reports) - Kubernetes tasks: Fall back to S3 reports immediately TO RE-ENABLE: Revert this commit if pod network becomes reachable, OR implement one of: - Kubernetes Services for each task pod - Network routing from Overlord to pod network - kubectl port-forward automation

This commit adds non-blocking I/O support for Kubernetes API calls by backporting the Vertx HTTP client from Druid 35. This addresses thread pool exhaustion issues when running many concurrent tasks. Changes: - Add kubernetes-httpclient-vertx dependency (v6.7.2) - Add HttpClientType enum (VERTX default, OKHTTP fallback) - Add DruidKubernetesHttpClientFactory interface - Add DruidKubernetesVertxHttpClientConfig for thread pool configuration - Add DruidKubernetesVertxHttpClientFactory for Vertx instance management - Modify DruidKubernetesClient to accept custom HTTP client factory - Modify KubernetesTaskRunnerConfig with httpClientType and vertxHttpClientConfig - Modify KubernetesOverlordModule to select HTTP client based on config - Fix buildJob() method to accept taskType parameter (pre-existing bug) - Add comprehensive logging for debugging and verification Configuration: - Vertx is enabled by default (no config needed) - Fallback: druid.indexer.runner.httpClientType=OKHTTP - Optional tuning: druid.indexer.runner.vertxHttpClientConfig.* See VERTX_HTTP_CLIENT_BACKPORT.md for full documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Verified implementation against actual upstream code (commit cabada6): - Use VertxOptions.DEFAULT_* constants instead of hardcoded values - Align createVertxInstance() method with upstream exactly - Add TYPE_NAME constant to factory - Remove unnecessary toString() from config - Remove conditional eventLoopPoolSize check (always set like upstream) Keep our additions (not in upstream, but useful): - Logging for debugging and verification - close() method for clean Vertx shutdown Updated documentation with upstream comparison section. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add context parameters workerStatusPollIntervalMs and workerStatusPollIntervalHighMs to MultiStageQueryContext for tuning polling frequency - Update MSQWorkerTaskLauncher to use configurable intervals instead of hardcoded values - Add comprehensive INFO-level logging for polling statistics: - Log configured intervals on launcher initialization - Log mode transitions (high→low frequency) - Log periodic stats every 60 seconds (total/high/low poll counts) - Log final stats on shutdown - Add documentation for configuration and verification Defaults match upstream Druid: 100ms high-freq, 2000ms low-freq, 10s switch threshold Configuration via cluster defaults: druid.query.default.context.workerStatusPollIntervalMs=5000 druid.query.default.context.workerStatusPollIntervalHighMs=500 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adarshsanjeev and others added 30 commits June 8, 2024 07:12

[maven-release-plugin] prepare release druid-30.0.0-rc3

09d36ee

Update docs for K8s TaskRunner Dynamic Config (apache#16600)

7b0bea8

* Update docs for K8s TaskRunner Dynamic Config * touchups * code review * npe * oopsies

Docs: Fix k8s dynamic config URL (apache#16720)

4437051

Add annotation for pod template (apache#16772)

663213f

* Add annotation for pod template * pr comments * add test cases * add tests

Handle empty values for task and datasource conditions in pod templat…

8edf21b

…e selector (apache#17400) * handling empty sets for dataSourceCondition and taskTypeCondition * using new HashSet<>() to fix forbidden api error in testCheck * fixing style issues

updated docs with behavior for empty collections in pod template sele…

af07d43

…ctor config (apache#17464)

md files

5536fb0

added logs

bf74439

Get reports file from file writer (apache#18379)

0132823

Switch to using the reports file from the reports writer when pushing the reports. This fixes a bug where the AbstractTask could be unable to find the report.

final changes to Task Reports Fix, MSQ Context Tags Support and Dynam…

09ab468

…ic Pod Template Selection Logging

remove unused import

12ec0c5

another leftover

d2f0362

Make KubernetesWorkItem.shutdown idempotent (apache#18576)

087882a

Subsequent calls to KubernetesWorkItem.shutdown() should not block and return immediately.

Add documentation for cherry-picked improvements

b3805dc

Fix PMD violation: Remove unused Throwables import

ffced7a

PMD Error: - UnnecessaryImport: Unused import 'com.google.common.base.Throwables' Fix: - Removed unused import from KubernetesTaskRunner.java - Import was left over from previous exception handling code

Add diagnostic logging to TaskRunnerTaskLogStreamer to debug why stre…

682b7c8

…amTaskReports is not being delegated to KubernetesTaskRunner

github-actions bot added Area - Documentation Area - Batch Ingestion Area - Querying Area - Metrics/Event Emitting Area - Streaming Ingestion labels Feb 4, 2026

RonShub closed this Feb 4, 2026

github-actions bot added Area - Lookups Area - Web Console Area - Dependencies Area - Ingestion labels Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Singular druid 30 changes#18987

Singular druid 30 changes#18987
RonShub wants to merge 30 commits intoapache:30.0.1from
singular-labs:singular-druid-30-changes

RonShub commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

RonShub commented Feb 4, 2026

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants