Skip to content

fix: EksPodOperator 401 with cross-account AssumeRole via aws_conn_id#64749

Open
anmolxlight wants to merge 1 commit intoapache:mainfrom
anmolxlight:fix/eks-pod-operator-cross-account-401
Open

fix: EksPodOperator 401 with cross-account AssumeRole via aws_conn_id#64749
anmolxlight wants to merge 1 commit intoapache:mainfrom
anmolxlight:fix/eks-pod-operator-cross-account-401

Conversation

@anmolxlight
Copy link
Copy Markdown

fix: EksPodOperator 401 with cross-account AssumeRole via aws_conn_id

Fixes #64657

Problem

When using EksPodOperator with aws_conn_id pointing to a cross-account IAM role (via AssumeRole), pods fail with 401 Unauthorized:

pods "simple-http-server" is forbidden: User "" cannot create resource "pods" in API group "" in the namespace "default"

The audit log shows an empty user identity: "user":{}.

Root Cause

The kubeconfig exec plugin COMMAND template in EksHook had two critical fragility points:

  1. stderr merged into stdout via 2>&1 — Python warnings, deprecation notices, or log output from eks_get_token contaminated the stdout that bash token parsing relies on. This caused the last_line extraction to grab the wrong line, producing empty/invalid timestamp and token values.

  2. No token validation — If parsing failed, a malformed ExecCredential JSON with an empty token was sent to the EKS API server, resulting in 401 with an empty user identity.

Same-account usage worked by accident because default MWAA execution role credentials were already in the environment, so eks_get_token produced valid output regardless of credential file sourcing.

Changes

airflow/providers/amazon/aws/hooks/eks.py

  • Redirect stderr to /dev/null (2>/dev/null) instead of merging with stdout (2>&1) to ensure clean token output for bash parsing
  • Add token validation: exit with error if token extraction fails
  • Add error messages to stderr for debugging credential issues

tests/unit/amazon/aws/hooks/test_eks.py

  • Add test_command_template_redirects_stderr: verifies stderr is redirected to /dev/null and not merged with stdout
  • Add test_command_template_validates_token: verifies the token validation check and error exit

Testing

# Verify the COMMAND template structure
python -c "
import sys
sys.path.insert(0, 'providers/amazon/src')
from airflow.providers.amazon.aws.hooks.eks import COMMAND
assert '2>/dev/null' in COMMAND
assert '2>&1' not in COMMAND
assert 'if [ -z \"\$token\" ]' in COMMAND
assert 'exit 1' in COMMAND
print('All checks passed')
"

Verification

To verify the fix works with cross-account AssumeRole:

  1. Set up two AWS accounts: Account A (MWAA) and Account B (EKS)
  2. Create an IAM role in Account B that trusts Account A's execution role
  3. Create a connection in MWAA using Account B's role ARN
  4. Run an EksPodOperator task with aws_conn_id set to the cross-account connection
  5. Verify the pod is created successfully without 401 errors

The kubeconfig exec plugin COMMAND template in EksHook had two critical
fragility points that caused 401 Unauthorized when using cross-account
AssumeRole credentials:

1. stderr was merged into stdout via 2>&1, so any Python warnings,
   deprecation notices, or log output from eks_get_token contaminated
   the stdout that bash token parsing relies on. This caused the
   last_line extraction to grab the wrong line, producing empty/
   invalid timestamp and token values.

2. No validation that the token was successfully extracted. If parsing
   failed, a malformed ExecCredential JSON with an empty token was
   sent to the EKS API server, resulting in 401 with an empty user
   identity in the audit logs ("user":{}).

Same-account usage worked by accident because default MWAA execution
role credentials were already in the environment, so eks_get_token
produced valid output regardless of credential file sourcing.

Changes:
- Redirect stderr to /dev/null (2>/dev/null) instead of merging
  with stdout (2>&1) to ensure clean token output for bash parsing
- Add token validation: exit with error if token extraction fails
- Add error messages to stderr for debugging credential issues
- Add unit tests verifying the COMMAND template structure

Fixes apache#64657
@anmolxlight anmolxlight requested a review from o-nikolas as a code owner April 5, 2026 21:48
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Apr 5, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Apr 5, 2026
@eladkal eladkal requested review from ferruzzi and vincbeck April 7, 2026 05:42

if [ "$status" -ne 0 ]; then
printf '%s' "$output" >&2
printf 'eks_get_token failed with exit code %s' "$status" >&2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not pipe stderr output above to a durable location (perhaps something in /tmp) instead of /dev/null and then combine it with the stdout here? The status code alone is not very helpful.

@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes EksPodOperator authorization failures in cross-account AssumeRole setups by making the EKS kubeconfig exec-plugin token generation/parsing more robust.

Changes:

  • Adjust the EksHook kubeconfig exec COMMAND to avoid stderr/stdout mixing and add an explicit empty-token failure path.
  • Add unit tests asserting the COMMAND template no longer merges stderr into stdout and that it contains token validation logic.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
providers/amazon/src/airflow/providers/amazon/aws/hooks/eks.py Updates the kubeconfig exec-plugin shell template to avoid stdout contamination and fail fast on empty token extraction.
providers/amazon/tests/unit/amazon/aws/hooks/test_eks.py Adds string-based assertions over the COMMAND template to guard against regressions in redirection and validation logic.

Comment on lines +85 to 90
# Redirect stderr to /dev/null to prevent Python warnings, deprecation
# notices, or other log output from contaminating stdout. The token
# output must be the ONLY thing on stdout for bash parsing to work.
output=$({python_executable} -m airflow.providers.amazon.aws.utils.eks_get_token \
--cluster-name {eks_cluster_name} --sts-url '{sts_url}' {args} 2>&1)
--cluster-name {eks_cluster_name} --sts-url '{sts_url}' {args} 2>/dev/null)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redirecting eks_get_token stderr to /dev/null discards useful error output (stack traces, botocore messages) and makes kubeconfig exec failures hard to debug. Since the original parsing issue was caused by merging stderr into stdout (2>&1), consider removing the redirection entirely (let stderr pass through) or capture stderr separately and only surface it on non-zero exit, while keeping stdout clean for parsing.

Copilot uses AI. Check for mistakes.

if [ "$status" -ne 0 ]; then
printf '%s' "$output" >&2
printf 'eks_get_token failed with exit code %s' "$status" >&2
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On non-zero exit you now only print the exit code, but not the captured stdout from eks_get_token. This is a regression in diagnostics compared to printing the output, and it will make credential/STS issues much harder to troubleshoot. Consider also emitting $output (and/or captured stderr if you keep it) when status != 0.

Suggested change
printf 'eks_get_token failed with exit code %s' "$status" >&2
printf 'eks_get_token failed with exit code %s' "$status" >&2
if [ -n "$output" ]; then
printf '. Output was: %s' "$output" >&2
fi

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +117
printf 'Failed to extract token from eks_get_token output. ' >&2
printf 'Output was: %s' "$output" >&2
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty-token branch prints the full eks_get_token output to stderr. That output includes the EKS bearer token (see eks_get_token.py printing token: {access_token}), so this will leak credentials into task logs when parsing fails. Please redact the token before logging (or omit the output entirely) to avoid exposing bearer tokens.

Suggested change
printf 'Failed to extract token from eks_get_token output. ' >&2
printf 'Output was: %s' "$output" >&2
printf 'Failed to extract token from eks_get_token output.' >&2

Copilot uses AI. Check for mistakes.
Comment on lines +1277 to +1286
"""Verify COMMAND template redirects stderr to /dev/null to prevent
Python warnings/log output from contaminating stdout and breaking
bash token parsing. This is critical for cross-account AssumeRole
scenarios where the kubeconfig exec plugin must produce a clean token."""
from airflow.providers.amazon.aws.hooks.eks import COMMAND

# Verify stderr is redirected to /dev/null, not merged with stdout
assert "2>/dev/null" in COMMAND, (
"COMMAND must redirect stderr to /dev/null to prevent output contamination"
)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test hard-codes that stderr must be redirected to /dev/null. The actual requirement for correctness is that stderr must not be merged into stdout (i.e., avoid 2>&1) so stdout remains parseable; discarding stderr is an implementation choice and reduces debuggability. Consider relaxing the assertion to only require that 2>&1 is not present, so future changes can keep stderr visible while still fixing the parsing issue.

Suggested change
"""Verify COMMAND template redirects stderr to /dev/null to prevent
Python warnings/log output from contaminating stdout and breaking
bash token parsing. This is critical for cross-account AssumeRole
scenarios where the kubeconfig exec plugin must produce a clean token."""
from airflow.providers.amazon.aws.hooks.eks import COMMAND
# Verify stderr is redirected to /dev/null, not merged with stdout
assert "2>/dev/null" in COMMAND, (
"COMMAND must redirect stderr to /dev/null to prevent output contamination"
)
"""Verify COMMAND template keeps stderr separate from stdout so
Python warnings/log output cannot contaminate stdout and break
bash token parsing. This is critical for cross-account AssumeRole
scenarios where the kubeconfig exec plugin must produce a clean token."""
from airflow.providers.amazon.aws.hooks.eks import COMMAND
# Verify stderr is not merged with stdout. Whether stderr is discarded
# or left visible is an implementation choice.

Copilot uses AI. Check for mistakes.
Comment on lines +1302 to +1304
# Verify it exits with error on empty token
assert "exit 1" in COMMAND or 'exit "$' in COMMAND, (
"COMMAND must exit with error when token extraction fails"
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert "exit 1" in COMMAND or 'exit "$' in COMMAND can pass even if the empty-token validation never exits, because 'exit "$' matches the existing exit "$status" earlier in the script. Tighten this assertion to specifically verify that the empty-token block exits (e.g., by checking ordering relative to if [ -z "$token" ] or matching the exit 1 inside that block).

Suggested change
# Verify it exits with error on empty token
assert "exit 1" in COMMAND or 'exit "$' in COMMAND, (
"COMMAND must exit with error when token extraction fails"
# Verify the empty-token validation block exits with an error
assert 'if [ -z "$token" ]; then\n exit 1\nfi' in COMMAND, (
"COMMAND must exit with error in the empty-token validation block when token extraction fails"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EksPodOperator returns 401 Unauthorized when using cross-account AssumeRole via aws_conn_id

3 participants