Skip to content

Strip userinfo from ES host URL before using it as task-log label#65349

Merged
potiuk merged 2 commits intoapache:mainfrom
potiuk:redact-es-host-credentials-in-logs
Apr 16, 2026
Merged

Strip userinfo from ES host URL before using it as task-log label#65349
potiuk merged 2 commits intoapache:mainfrom
potiuk:redact-es-host-credentials-in-logs

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented Apr 16, 2026

Summary

The Elasticsearch task-log handler groups log hits by host and falls back to the raw [elasticsearch] host config value when a hit does not carry a host field. That config is commonly set to a URL that embeds credentials:

[elasticsearch]
host = https://user:password@elk.example.com:9200

As a result, the full URL — including the user:password@ userinfo — appeared as a dictionary key in the task-log output, visible to any user with task-log read permission.

This PR adds a small _strip_userinfo helper that removes the userinfo portion of a URL, and uses it in ElasticsearchRemoteLogIO._group_logs_by_host for the host fallback value. The Elasticsearch client itself is still connected using the full unredacted URL, so authentication is unaffected.

Test plan

  • New test_strip_userinfo parametrized across 7 input URL shapes (with userinfo, without userinfo, username-only, non-URL, empty) — all pass
  • Full existing test_es_task_handler.py suite continues to pass (71/71)

Changelog

Added a "Bug fixes" section to providers/elasticsearch/docs/changelog.rst describing the redaction.

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.6 (1M context)

Generated-by: Claude Opus 4.6 (1M context) following the guidelines at
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

The Elasticsearch task-log handler grouped hits by host, falling back
to the raw ``[elasticsearch] host`` config value when a hit lacked a
``host`` field. That config commonly embeds credentials
(``https://user:password@elk.example.com:9200``), so the full URL —
including the ``user:password@`` userinfo — would appear as a
dictionary key in the task-log output, where any user with task-log
read permission could see it.

Add a ``_strip_userinfo`` helper and use it for the host fallback in
``_group_logs_by_host``. The Elasticsearch client is still connected
using the full unredacted URL, so authentication is unaffected.

Generated-by: Claude Opus 4.6 (1M context) following the guidelines at
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions
Comment thread providers/elasticsearch/docs/changelog.rst Outdated
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Apr 16, 2026

cc @Owen-CH-Leung

@potiuk
Copy link
Copy Markdown
Member Author

potiuk commented Apr 16, 2026

Yeah. would love to see your comments @Owen-CH-Leung :)

@potiuk
Copy link
Copy Markdown
Member Author

potiuk commented Apr 16, 2026

We can always post-review :)

@potiuk potiuk merged commit f924406 into apache:main Apr 16, 2026
102 checks passed
@potiuk potiuk deleted the redact-es-host-credentials-in-logs branch April 16, 2026 18:02
karenbraganz pushed a commit to karenbraganz/airflow that referenced this pull request Apr 16, 2026
…ache#65349)

* Strip userinfo from ES host URL before using it as task-log label

The Elasticsearch task-log handler grouped hits by host, falling back
to the raw ``[elasticsearch] host`` config value when a hit lacked a
``host`` field. That config commonly embeds credentials
(``https://user:password@elk.example.com:9200``), so the full URL —
including the ``user:password@`` userinfo — would appear as a
dictionary key in the task-log output, where any user with task-log
read permission could see it.

Add a ``_strip_userinfo`` helper and use it for the host fallback in
``_group_logs_by_host``. The Elasticsearch client is still connected
using the full unredacted URL, so authentication is unaffected.

Generated-by: Claude Opus 4.6 (1M context) following the guidelines at
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

* Apply suggestion from @potiuk
@Owen-CH-Leung
Copy link
Copy Markdown
Contributor

Great catch! The OpenSearch provider has the same issue also. Worth a follow-up PR to apply the same fix there.

potiuk added a commit to potiuk/airflow that referenced this pull request Apr 19, 2026
…abel

Follow-up to apache#65349 — OpenSearch's `_group_logs_by_host` had the same
credential-leak as Elasticsearch: the raw `[opensearch] host` config
value (which commonly embeds `user:password@...`) was used as a
log-source dictionary key, exposing credentials in task logs. Apply the
same `_strip_userinfo` helper; the OpenSearch client still connects
with the full URL so auth is unaffected. Both `OpensearchTaskHandler`
and `OpensearchRemoteLogIO` sites are patched.

Also add `AGENTS.md` to both `providers/opensearch` and
`providers/elasticsearch` noting that the two providers are forks and
most task-log-handler fixes should be cross-applied.
potiuk added a commit that referenced this pull request Apr 19, 2026
…abel (#65509)

Follow-up to #65349 — OpenSearch's `_group_logs_by_host` had the same
credential-leak as Elasticsearch: the raw `[opensearch] host` config
value (which commonly embeds `user:password@...`) was used as a
log-source dictionary key, exposing credentials in task logs. Apply the
same `_strip_userinfo` helper; the OpenSearch client still connects
with the full URL so auth is unaffected. Both `OpensearchTaskHandler`
and `OpensearchRemoteLogIO` sites are patched.

Also add `AGENTS.md` to both `providers/opensearch` and
`providers/elasticsearch` noting that the two providers are forks and
most task-log-handler fixes should be cross-applied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants