Skip to content

fix(amazon): EksPodOperator deferrable mode fails on remote triggerers#63020

Open
akhilesharora wants to merge 3 commits intoapache:mainfrom
akhilesharora:fix/issue-61736-eks-deferrable-credentials
Open

fix(amazon): EksPodOperator deferrable mode fails on remote triggerers#63020
akhilesharora wants to merge 3 commits intoapache:mainfrom
akhilesharora:fix/issue-61736-eks-deferrable-credentials

Conversation

@akhilesharora
Copy link

@akhilesharora akhilesharora commented Mar 7, 2026

Summary

Fix EksPodOperator with deferrable=True failing with 401 Unauthorized when the triggerer runs on a different host from the worker.

Root Cause: The kubeconfig exec block references a temp file path (/tmp/tmpXYZ) that only exists on the worker. When the trigger is serialized and sent to the triggerer, the exec block tries to source a file that doesn't exist.

Solution: Generate a kubeconfig with an embedded bearer token instead of an exec block with temp file references.

Changes

  • Added EksHook.generate_config_dict_for_deferral() - generates kubeconfig with embedded token
  • Override EksPodOperator.invoke_defer_method() to use token-based config for triggerer
  • Added comprehensive error handling for cluster lookup and token fetch failures
  • Added 5 new tests covering success and error scenarios

Security Considerations

  • Token is encrypted at rest (Fernet encryption in trigger serialization)
  • Token has short lifespan (~14 minutes for EKS)
  • Token is never logged
  • Robust error handling with actionable messages

Test Plan

  • test_generate_config_dict_for_deferral - verifies embedded token config
  • test_generate_config_dict_for_deferral_cluster_not_found - error handling
  • test_generate_config_dict_for_deferral_empty_token - security validation
  • test_generate_config_dict_for_deferral_token_fetch_failure - error handling
  • test_invoke_defer_method_generates_token_based_config - operator integration
  • All existing EKS tests pass

Closes #61736


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.5)

Generated-by: Claude Code (Opus 4.5) following the guidelines

@akhilesharora akhilesharora requested a review from o-nikolas as a code owner March 7, 2026 00:05
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 7, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Mar 7, 2026
@akhilesharora akhilesharora force-pushed the fix/issue-61736-eks-deferrable-credentials branch from fdad29e to dc9c61f Compare March 7, 2026 00:08
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. But there was one critical issue that I have commented on already. And I do have one other concern:

Since the EKS authentication token generated by fetch_access_token_for_cluster typically expires after approx. 15 minutes, I’m wondering how this behaves for longer-running triggers. If the trigger polls the Kubernetes API for longer than the token lifetime, could the embedded token expire and cause authentication failures?

CI needs to be run to see if there any other issues.


:param eks_cluster_name: The name of the cluster to generate kubeconfig for.
:param pod_namespace: The namespace to run within kubernetes.
:return: A kubeconfig dict with embedded bearer token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is too long. The emphasis should be on description not justification. Something like this would be better:

"""
Generate a kubeconfig dict with an embedded bearer token for deferrable execution.

The token-based config avoids the exec credential plugin so it can be safely
serialized and used by the triggerer process.

:param eks_cluster_name: The name of the EKS cluster.
:param pod_namespace: The Kubernetes namespace.
:return: Kubeconfig dictionary with embedded bearer token.
"""

The additional content where you explain why this function exists might be better as a comment.

sts_url = f"{StsHook(region_name=session.region_name).conn_client_meta.endpoint_url}/?Action=GetCallerIdentity&Version=2011-06-15"
finally:
del os.environ["AWS_STS_REGIONAL_ENDPOINTS"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying environment variables is not acceptable here. os.environ is process-global so setting and deleting AWS_STS_REGIONAL_ENDPOINTS like this could interfere with other tasks running in the same worker process. It could also remove a value that was already set by the environment.

I am not sure why you need to manipulate environment variables because my understanding is that the url construction would default to 'regional' without explicitly setting AWS_STS_REGIONAL_ENDPOINTS to regional.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've removed the env var manipulation entirely. Now constructing the regional STS URL directly: https://sts.{region}.amazonaws.com/.... Also applied this fix to the existing generate_config_file method for consistency.

sts_url=sts_url,
region_name=session.region_name,
)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Exception here is too broad. Can you perhaps narrow it to the errors you would expect when invoking fetch_access_token_for_cluster ?


This override generates a kubeconfig with an embedded bearer token instead of an exec
block, allowing the config to work on the triggerer without requiring local temp files.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is too long as well. I would suggest the below:

"""
Override to generate a token-based kubeconfig for the triggerer.

EKS kubeconfigs use an exec credential plugin that references temporary
files created on the worker. These files are not available on the triggerer,
so this override embeds a bearer token instead.
"""

Same as above, I would leave the truncated content to be included in a comment instead.

import datetime

from airflow.providers.cncf.kubernetes.triggers.pod import ContainerState, KubernetesPodTrigger
from airflow.providers.common.compat.sdk import AirflowNotFoundException, BaseHook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these imports within the function? Not necessarily an issue but can you explain why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I placed them locally but they could be moved to module level to match the parent class pattern


This test verifies that the method generates a kubeconfig dict with a bearer token
embedded directly (instead of an exec block that references temp files), allowing
the config to be serialized and used on the triggerer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove everything from this docstring except the first line.

This test verifies that EksPodOperator.invoke_defer_method() generates a kubeconfig
with an embedded bearer token (instead of an exec block with temp file references)
so that the triggerer can authenticate without requiring files that only exist on the worker.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove everything from this docstring except the first line.

@@ -0,0 +1 @@
Fix EksPodOperator deferrable mode failing on remote triggerers with 401 Unauthorized by embedding bearer token in kubeconfig instead of using exec block with temp file references
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if a news fragment is necessary for this but let's see what a committer/maintainer has to say.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to committer guidance on this. Included it as it's a user-facing bugfix

When using EksPodOperator with deferrable=True, the triggerer fails with
401 Unauthorized because credential temp files created on the worker don't
exist on the triggerer host.

The root cause is that the kubeconfig exec block references a temp file
path that only exists on the worker. When the trigger is serialized and
sent to the triggerer (on a different host), the exec block tries to
source a file that doesn't exist.

This fix adds a new method `generate_config_dict_for_deferral()` that
creates a kubeconfig with an embedded bearer token instead of an exec
block. The EksPodOperator.invoke_defer_method() is overridden to use
this token-based config for the triggerer.

Security considerations:
- Token is encrypted at rest (Fernet encryption in trigger serialization)
- Token has short lifespan (~14 minutes)
- Token is never logged
- Robust error handling with clear messages

Closes apache#61736
- Remove os.environ manipulation for AWS_STS_REGIONAL_ENDPOINTS, construct
  regional STS URL directly to avoid process-global side effects
- Trim verbose docstrings to focus on description rather than justification
- Narrow exception handling from broad Exception to specific
  (BotoCoreError, ClientError, ValueError)
- Add comment explaining why imports are inside invoke_defer_method
- Apply same env var fix to existing generate_config_file method
@akhilesharora akhilesharora force-pushed the fix/issue-61736-eks-deferrable-credentials branch from dc9c61f to cfcd906 Compare March 8, 2026 08:39
@akhilesharora
Copy link
Author

Looks fine. But there was one critical issue that I have commented on already. And I do have one other concern:

Since the EKS authentication token generated by fetch_access_token_for_cluster typically expires after approx. 15 minutes, I’m wondering how this behaves for longer-running triggers. If the trigger polls the Kubernetes API for longer than the token lifetime, could the embedded token expire and cause authentication failures?

CI needs to be run to see if there any other issues.

Yes, if the trigger polls for longer than ~14 minutes, the token could expire. Anticipating most pod operations (startup, completion monitoring) to finish well within this window, and trigger_reentry() generates fresh credentials when the trigger completes.

For very long-running pods, token refresh in the trigger could be a future enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EksPodOperator deferrable mode fails on remote triggerers — credential temp file not available

2 participants