Skip to content

Auto-reload pipelines on TLS certificate rotation#18978

Merged
kaisecheng merged 37 commits intoelastic:mainfrom
kaisecheng:pipeline-reload-certs
Apr 23, 2026
Merged

Auto-reload pipelines on TLS certificate rotation#18978
kaisecheng merged 37 commits intoelastic:mainfrom
kaisecheng:pipeline-reload-certs

Conversation

@kaisecheng
Copy link
Copy Markdown
Contributor

@kaisecheng kaisecheng commented Apr 10, 2026

Release notes

Logstash can automatically reload affected pipelines when TLS certificate files change on disk. This removes the need for a manual restart or explicit reload trigger after certificate rotation.

Adds opt-in automatic pipeline reload on TLS certificate rotation with a new ssl.reload.automatic option that works when config.reload.automatic is enabled and accepts the following values:

  • true: reloads pipelines whose SSL certificate or key files have changed on disk
  • false (default): do not watch SSL files

What does this PR do?

Introduces the new ssl.reload.automatic setting plus three layered components that together enable TLS certificate hot-reload:

1. ssl.reload.automatic setting
New boolean in logstash.yml (default false). When true, Logstash starts the file watcher and SSL tracker and reloads affected pipelines on certificate change. Requires config.reload.automatic: true or Centralized Pipeline Management to be enabled xpack.management.enabled: true.

2. FileWatchService (Java)
Uses Java NIO WatchService to detect file modifications. Supports per-file callbacks within watched directories. Lazily starts the watcher thread on first registration and cleans up watch keys when all files in a directory are deregistered.

When a target file has changed, it fires callbacks to update the SHA-256 checksum.

3. SslFileTracker (Ruby)
Wraps FileWatchService and tracks which SSL cert/key paths belong to which pipelines. A path shared by multiple pipelines is only deregistered when the last pipeline releases it. Chooses the detection strategy at registration time:

  • Regular files: OS kernel events via FileWatchService, gated by SHA-256 checksum to suppress duplicate notifications
  • Symlinks: mtime polling on each converge cycle, because NIO WatchService tracks the symlink pointer rather than the link target. This covers the Kubernetes double-symlink scenario.

SslFileTracker registers SSL paths by scanning plugin config entries whose names start with ssl_ and validate as :path, plus explicit allowlisted SSL path settings such as certificate authorities and truststores.

4. Agent wiring

  • Agent#initialize constructs FileWatchService and SslFileTracker only when both config.reload.automatic and ssl.reload.automatic are enabled; otherwise agent.ssl_file_tracker is nil and all call sites become no-ops
  • converge_state_and_update calls refresh_pipeline_symlink_stamps on each cycle and queues Reload actions for affected pipelines (stale_pipeline_ids)
  • Pipeline actions keep SSL tracking in sync with the live pipeline set
  • On reload failure, SSL tracking remains registered so a later certificate repair can still be detected and trigger another reload

Why is it important/What is the impact to the user?

Without this change, rotating a TLS certificate requires restarting Logstash or manually triggering a reload. This causes an interruption of ingestion.

With this change, when ssl.reload.automatic: true is set and config.reload.automatic: true is enabled, Logstash detects the file change and reloads only the affected pipeline automatically. Pipelines using unrotated certs are unaffected.

The feature is off by default, so upgrading users see no behaviour change. Users who want it opt in via ssl.reload.automatic: true in logstash.yml (or SSL_RELOAD_AUTOMATIC=true for docker).

For Kubernetes environments, cert-manager rotates certificates via Secret volume mounts (atomic symlink swaps). The symlink-aware mtime polling strategy handles this pattern explicitly.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • Integration tests: 7 scenarios
    • regular file rotation triggers exactly one reload then stays stable
    • symlink target swap is detected via mtime poll and reloads the pipeline
    • rotating one pipeline cert does not reload the other (isolation)
    • shared cert rotation reloads all pipelines referencing it
    • ES output truststore rotation is detected, pipeline reloads, and events continue flowing to ES
    • ES output CA rotation reloads the pipeline and events continue flowing to ES
    • invalid CA cert rotation triggers reload failure, and restoring a valid CA triggers a later successful reload and ingestion recovery
  • Verified in ECK by updating the certificate in a Kubernetes Secret and observing automatic pipeline reload without restarting Logstash

How to test this PR locally

echo "ssl.reload.automatic: true" >> config/logstash.yml
echo "config.reload.automatic: true" >> config/logstash.yml
Regular file rotation

# Replace server.crt with a new cert on disk
cp new-server.crt /path/to/server.crt

# Observe pipeline reloads automatically (check logs or _node/stats API)
curl localhost:9600/_node/stats/pipelines/main | jq '.pipelines.main.reloads'

Symlink rotation (Kubernetes-style)

# Atomically swap the symlink to point at a new cert file
ln -sf new-server.crt /path/to/symlink.tmp && mv /path/to/symlink.tmp /path/to/server.crt

# Observe pipeline reloads on the next converge cycle

Related issues

@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • run exhaustive tests : Run the exhaustive tests Buildkite pipeline.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 10, 2026

This pull request does not have a backport label. Could you fix it @kaisecheng? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

Comment thread logstash-core/lib/logstash/ssl_file_tracker.rb Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic pipeline reloads when TLS certificate/key/CA files change on disk, enabling hot-reload of pipelines affected by certificate rotation (including Kubernetes-style symlink swaps) without a Logstash restart.

Changes:

  • Introduces a Java FileWatchService (NIO WatchService) with per-file callbacks and accompanying unit tests.
  • Adds a Ruby LogStash::SslFileTracker to register/deregister SSL-related paths per pipeline and detect staleness via checksum (regular files) or mtime polling (symlinks), with specs.
  • Wires the tracker into LogStash::Agent and all pipeline lifecycle actions; adds QA integration coverage including a TLS-enabled Elasticsearch fixture/service.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
qa/integration/specs/tls_hot_reload_spec.rb New integration suite covering reload behavior for regular files, symlinks, shared certs, and ES output CA changes.
qa/integration/services/service_locator.rb Updates service class name resolution to support underscored service names.
qa/integration/services/http_proxy_service.rb Renames service class to match updated locator naming (HttpProxyService).
qa/integration/services/elasticsearch_tls_service.rb Adds a TLS-enabled Elasticsearch service wrapper for integration tests.
qa/integration/services/elasticsearch_setup.sh Adds TLS-capable ES startup path and copies certs into ES config for entitlement constraints.
qa/integration/services/elasticsearch_teardown.sh Cleans ES data/logs and TLS-specific artifacts/users on teardown.
qa/integration/framework/cert_helpers.rb Adds helper functions to generate/write CA/leaf certs for integration testing.
qa/integration/fixtures/tls_hot_reload_spec.yml Adds fixture definition enabling elasticsearch_tls alongside logstash.
logstash-core/src/main/java/org/logstash/common/FileWatchService.java New core Java file watcher with callback dispatching and lazy watcher thread.
logstash-core/src/test/java/org/logstash/common/FileWatchServiceTest.java JUnit tests validating modify/rename events, multi-callback behavior, deregistration, and multi-dir support.
logstash-core/lib/logstash/ssl_file_tracker.rb New tracker that maps SSL file paths to pipelines and detects changes (checksum vs symlink mtime).
logstash-core/lib/logstash/agent.rb Creates/owns watcher+tracker when auto-reload is enabled; checks for stale pipelines each converge and triggers reload actions.
logstash-core/lib/logstash/pipeline_action/create.rb Registers SSL paths before pipeline start; deregisters on create failure.
logstash-core/lib/logstash/pipeline_action/reload.rb Deregisters old pipeline paths, registers new pipeline paths, and cleans up on failed start.
logstash-core/lib/logstash/pipeline_action/recover.rb Mirrors reload behavior for recover: deregister old, register new, deregister on failed start.
logstash-core/lib/logstash/pipeline_action/stop.rb Deregisters SSL paths when stopping a pipeline.
logstash-core/lib/logstash/pipeline_action/stop_and_delete.rb Deregisters SSL paths when stopping/deleting a pipeline.
logstash-core/lib/logstash/pipeline_action/delete.rb Deregisters SSL paths after successful delete.
logstash-core/spec/logstash/ssl_file_tracker_spec.rb New unit specs for registration, shared paths, stale detection, and symlink polling behaviors.
logstash-core/spec/logstash/agent_spec.rb Adds specs for tracker presence/absence, SSL converge behavior, and converge-result merging.
logstash-core/spec/logstash/pipeline_action/create_spec.rb Stubs agent.ssl_file_tracker and keeps existing create behavior tests passing.
logstash-core/spec/logstash/pipeline_action/reload_spec.rb Adds expectations for tracker deregister/register ordering on reload success/failure.
logstash-core/spec/logstash/pipeline_action/stop_spec.rb Stubs agent.ssl_file_tracker for updated stop action behavior.
logstash-core/spec/logstash/pipeline_action/stop_and_delete_spec.rb Stubs agent.ssl_file_tracker for updated stop-and-delete action behavior.
logstash-core/spec/logstash/pipeline_action/delete_spec.rb Stubs agent.ssl_file_tracker for updated delete action behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread logstash-core/lib/logstash/ssl_file_tracker.rb Outdated
Comment thread logstash-core/lib/logstash/ssl_file_tracker.rb
@kaisecheng kaisecheng marked this pull request as ready for review April 10, 2026 23:14
@kaisecheng kaisecheng force-pushed the pipeline-reload-certs branch from bfc0f16 to dc75eea Compare April 11, 2026 01:54
@kaisecheng kaisecheng marked this pull request as draft April 13, 2026 10:29
@kaisecheng kaisecheng marked this pull request as ready for review April 13, 2026 13:47
@andsel andsel self-requested a review April 15, 2026 14:36
new_pipeline = LogStash::JavaPipeline.new(@pipeline_config, @metric, agent)
agent.ssl_file_tracker&.register(new_pipeline)
success = new_pipeline.start # block until the pipeline is correctly started or crashed
# Keep the SSL file registered on failure so subsequent certificate recovery can be detected. Do not deregister it.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not deregister SSL tracking on reload failure.
Reload may fail because the new TLS material is temporarily invalid. In that situation we still want to observe future file changes, because the next cert update may repair the problem.

Copy link
Copy Markdown
Member

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the PR touches 3 layers I've started to scratch the first one, the file watch service. While we progress I'll look into the rest, else the review becomes a nightmare.

Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Comment thread logstash-core/src/main/java/org/logstash/common/FileWatchService.java Outdated
Implement FileWatchService using Java NIO WatchService to detect file
modifications via OS kernel notifications. Supports registering
per-file callbacks within watched directories, lazy-starts the watcher
thread on first registration, and deregisters watch keys when all files
in a directory are removed.
SslFileTracker wraps FileWatchService to track SSL certificate files
across pipelines. It maintains reference counts per path so shared
certs are only deregistered when the last pipeline releases them.

Detection uses a hybrid strategy chosen at registration time:
- Regular files: OS kernel events via FileWatchService, gated by
  SHA-256 checksum to suppress spurious notifications
- Symlinks: mtime polling on each converge cycle, since NIO WatchService
  tracks the symlink pointer rather than the link target
Wire SslFileTracker into Agent so that when a TLS certificate file
changes, the affected pipelines are automatically reloaded without
requiring a full Logstash restart.

- Agent.execute injects SslFileTracker when auto-reload is enabled
- converge_state_and_update calls stale_pipelines on each cycle and
  queues Reload actions for pipelines with changed certs
- Pipeline actions (Create, Reload, Stop, Delete, StopAndDelete) call
  register/deregister on SslFileTracker to keep tracked paths in sync
  with the live pipeline set
- regular file rotation triggers exactly one pipeline reload then stays stable
- symlink target swap is detected via mtime poll and reloads the pipeline
- rotating one pipeline cert does not reload the other (isolation)
- shared cert rotation reloads all pipelines referencing it
- ES output CA rotation reloads the pipeline and events continue flowing to ES
- invalid CA cert rotation triggers reload failure and stops sending events to ES
…ch cycle

Previously, refresh_pipeline_symlink_stamps compared each pipeline's
registered baseline stamp against the latest stamp for every tracked SSL
file on every converge cycle — O(pipelines × SSL files) work repeated
each interval.

Now, SslFileTracker maintains a @stale_pipeline_ids Set. Both :watch
paths (FileWatchService callbacks) and :poll paths (symlink mtime polls)
write directly to this Set when a change is detected. Each converge reads
the Set in O(1) instead of scanning all stamps.
Two plugins referencing the same cert file via different path styles
(one relative, one absolute) would previously create two separate
@watched_files entries and register the file with FileWatchService twice.
ssl_file_paths now expands every path with File.expand_path before
deduplication, so the same file is always tracked as a single entry
regardless of how it was declared in config.
…d stamps

- Remove baseline check from watch callback
- Replace @registered_stamps with @id_paths ({ id => [path] })
  baseline stamps are no longer needed
- Rename @stale_pipeline_ids to @stale_ids
- Rename @watched_files to @path_watched
Without this, a failed reload stops future cert change recovery
because the tracker is deregistered.

Added integration test for invalid cert recovery.
Shortened the integration test wait time.
Add a new `ssl.reload.automatic` boolean (default false) that controls
whether Logstash watches SSL cert and key files referenced by pipelines
and triggers a pipeline reload when they change. The watcher and file
tracker only start when both `ssl.reload.automatic` and
`config.reload.automatic` are enabled, or when Centralized Pipeline
Management is in use (CPM flips config.reload.automatic on at boot).

A bootstrap check fails fast if `ssl.reload.automatic: true` is set
without a compatible reload mode. The setting is exposed through the
docker env2yaml mapping and documented in the sample logstash.yml.
@kaisecheng
Copy link
Copy Markdown
Contributor Author

@andsel Changes after applying your suggestion

  • 1fe365b — Fixes an issue that Java input plugin fails to respond get_config.
  • c4db02f8 — Handle NIO watch service IOException (e.g. inotify exhausted). Populates error upward, and returns FailedAction as convergence result.
  • 69edc251 — New setting ssl.reload.automatic (default false) gates the feature.

@kaisecheng kaisecheng requested a review from andsel April 18, 2026 00:10
Copy link
Copy Markdown
Member

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestion to avoid to expose too much of internal state and suggestion to increase readability.

Comment thread logstash-core/lib/logstash/pipelines_registry.rb Outdated
Comment thread logstash-core/lib/logstash/agent.rb Outdated
Comment thread logstash-core/lib/logstash/pipeline_action/create.rb Outdated
Comment thread qa/integration/services/elasticsearch_setup.sh Outdated
Comment thread qa/integration/services/elasticsearch_setup.sh Outdated
Comment thread qa/integration/specs/tls_hot_reload_spec.rb Outdated
Comment thread qa/integration/specs/tls_hot_reload_spec.rb Outdated
@kaisecheng
Copy link
Copy Markdown
Contributor Author

@andsel this is ready for another look. Red CI is caused by the latest update of jruby-openssl

@kaisecheng kaisecheng requested a review from andsel April 20, 2026 21:00
@andsel
Copy link
Copy Markdown
Member

andsel commented Apr 21, 2026

@kaisecheng I think we can rebase to main since #19026 skipped those failing tests.

Copy link
Copy Markdown
Member

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested creating key and cert with

openssl req -x509 -newkey rsa:2048 \
  -keyout key_new.pem -out cert_new.pem \
  -days 365 -nodes \
  -subj "/CN=localhost" \
  -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"

And tinkered with copying files, symlinks overwrite, also with invalid cert/key and it worked as expected.

Good work @kaisecheng 👏

Just an ask, rebase on main and prove CI is green, then for me the PR is fine.

@kaisecheng
Copy link
Copy Markdown
Contributor Author

run exhaustive tests

1 similar comment
@kaisecheng
Copy link
Copy Markdown
Contributor Author

run exhaustive tests

@kaisecheng kaisecheng force-pushed the pipeline-reload-certs branch from a3658fb to 4c53832 Compare April 22, 2026 20:15
@kaisecheng
Copy link
Copy Markdown
Contributor Author

run exhaustive tests

@kaisecheng kaisecheng force-pushed the pipeline-reload-certs branch from 4c53832 to c4f0538 Compare April 23, 2026 17:41
Three cases cannot run reliably on Windows under JRuby:

* The poll-mode and shared-symlink contexts depend on File.symlink plus
  File.utime to advance mtime, which is unreliable on NTFS via JRuby.
* The relative-vs-absolute dedup spec uses Pathname#relative_path_from,
  which raises when Tempfile and the checkout live on different drives
  (Buildkite Windows runners put Tempfile on C: and checkout on A:).
@kaisecheng
Copy link
Copy Markdown
Contributor Author

run exhaustive tests

@elasticmachine
Copy link
Copy Markdown

💚 Build Succeeded

History

@kaisecheng
Copy link
Copy Markdown
Contributor Author

The fixes after approval are for test cases only.

  • amazonlinux2023 vm uses tmpfs which produces the same mtime in nanoseconds sometimes after changing the file content. The solution is to touch the file (bump mtime) after updated.
  • Jruby in Windows OS doesn't give a reliable mtime of symlink's real file. The feature is not designed with Windows, hence, some test cases are skipped.

@kaisecheng kaisecheng merged commit 91739d7 into elastic:main Apr 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reload pipelines on SSL file change Implement general purpose FileWatchService

4 participants