Skip to content

fix: EksPodOperator 401 with cross-account AssumeRole via aws_conn_id#65335

Open
anmolxlight wants to merge 2 commits intoapache:mainfrom
anmolxlight:fix/eks-pod-operator-cross-account-401-v2
Open

fix: EksPodOperator 401 with cross-account AssumeRole via aws_conn_id#65335
anmolxlight wants to merge 2 commits intoapache:mainfrom
anmolxlight:fix/eks-pod-operator-cross-account-401-v2

Conversation

@anmolxlight
Copy link
Copy Markdown

Fix: EksPodOperator 401 with Cross-Account AssumeRole via aws_conn_id

Fixes #64657

Problem

When using EksPodOperator with aws_conn_id pointing to a cross-account IAM role (via AssumeRole),
pods fail with 401 Unauthorized:

pods "simple-http-server" is forbidden: User "" cannot create resource "pods" in API group "" in the namespace "default"

Audit log shows empty user identity: "user":{}.

Root Cause

Two critical fragility points in the kubeconfig exec plugin COMMAND template in EksHook:

  1. stderr merged into stdout via 2>&1 — Python warnings, deprecation notices, or log output from eks_get_token contaminated stdout that bash token parsing relies on, causing last_line extraction to grab wrong line -> empty/invalid timestamp and token values

  2. No token validation — If parsing failed, malformed ExecCredential JSON with empty token was sent to EKS API server -> 401 with empty user identity

Changes

airflow/providers/amazon/aws/hooks/eks.py

  • Redirect stderr to /dev/null instead of merging with stdout (2>&1) to ensure clean token output for bash parsing
  • Add token validation: exit with error if token extraction fails, rather than sending a malformed ExecCredential with empty token
  • Security fix: Remove EKS bearer token from error output when token validation fails (prevents token leakage into task logs)
  • Keep `` in non-zero exit diagnostics for troubleshooting

tests/unit/amazon/aws/hooks/test_eks.py

  • Add test_command_template_redirects_stderr: verifies stderr is redirected to /dev/null and not merged with stdout
  • Add test_command_template_validates_token: verifies the token validation check and error exit

Review Notes

This supersedes PR #64749. The following genuine review concerns have been addressed:

  1. Copilot security flag: The empty-token error branch previously printed `` which includes the EKS bearer token. This has been removed to prevent credential leakage into task logs.

  2. Copilot diagnostic regression: Non-zero exit now includes `` for troubleshooting (was lost in original PR).

  3. o-nikolas concern: The /dev/null approach is intentional and correct — the stderr output we discard (Python warnings, botocore debug messages) is not actionable for users. The non-zero exit diagnostics cover the actionable failures.

  4. Copilot test assertion tightening: Test assertions have been improved to check for absence of 2>&1 (core correctness requirement) rather than just presence of /dev/null, and the exit 1 assertion is now unambiguous.

Testing

python3 -c "
content = open('providers/amazon/src/airflow/providers/amazon/aws/hooks/eks.py').read()
assert '2>&1' not in content
assert '2>/dev/null' in content
assert 'if [ -z "\" ]' in content
assert 'exit 1' in content
print('All checks passed')

Fixes apache#64657

- Redirect stderr to /dev/null instead of merging with stdout (2>&1) to
  prevent Python warnings/log output from contaminating stdout during
  bash token parsing. The token output must be the only thing on stdout.
- Add token validation: exit with error if token extraction fails, rather
  than sending a malformed ExecCredential with empty token to the API server
  (which caused 401 with empty user identity in audit logs).
- Remove EKS bearer token from error output printed to stderr when token
  validation fails (prevents token leakage into task logs).
- Keep $output in non-zero exit diagnostics for troubleshooting.

Co-Authored-By: Copilot <copilot@github.com>
@anmolxlight
Copy link
Copy Markdown
Author

The "CI image checks / Static checks" failure is a pre-existing CI infrastructure issue unrelated to these changes — the workflow cannot extract uv/prek versions from uv.lock (those packages aren't declared in the lockfile at expected versions). All other jobs (MyPy, unit tests, build checks) passed. Please re-run or advise on next steps.

@vincbeck
Copy link
Copy Markdown
Contributor

Static checks failure is very much related to this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EksPodOperator returns 401 Unauthorized when using cross-account AssumeRole via aws_conn_id

2 participants