Skip to content

Validate destination paths derived from GCS object names#67667

Open
potiuk wants to merge 2 commits into
apache:mainfrom
potiuk:validate-gcs-object-name-paths
Open

Validate destination paths derived from GCS object names#67667
potiuk wants to merge 2 commits into
apache:mainfrom
potiuk:validate-gcs-object-name-paths

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 28, 2026

Two operators in the Google provider compute filesystem destination paths from GCS object names returned by list*() / list_by_timespan(), without normalizing the join result. Because GCS object names are arbitrary UTF-8 strings, a name containing .. segments (or an absolute-path prefix) escapes the configured destination when the path lands on the SFTP server or the local worker:

  • GCSToSFTPOperator._resolve_destination_path (providers/google/.../transfers/gcs_to_sftp.py) — joins source_object to self.destination_path via os.path.join, which preserves .. segments and lets absolute source_object values absorb the destination prefix entirely.
  • GCSTimeSpanFileTransformOperator._download (providers/google/.../operators/gcs.py, line 899) — joins blob_name to temp_input_dir_path via pathlib.Path.__truediv__, with the same lack of normalization.

This PR adds a post-join validation to both call sites: after computing the destination path, normalize/resolve it and verify it remains within the configured destination (or the temp input directory). When it does not, raise AirflowException with a descriptive message identifying the offending object/blob name.

The allowlist in scripts/ci/prek/known_airflow_exceptions.txt is bumped for the two new raise AirflowException sites (one per operator).

Test plan

  • New unit tests: ..-segment object/blob name → AirflowException; absolute path → AirflowException; benign nested path → unchanged behavior.
  • Existing tests pass (82 passed in test_gcs_to_sftp.py + test_gcs.py).
  • prek run green on touched files.
Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines at https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

@boring-cyborg boring-cyborg Bot added area:dev-tools area:providers backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch provider:google Google (including GCP) related issues labels May 28, 2026
@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented May 29, 2026

Two operators in the Google provider compute filesystem destination paths from GCS object names returned by list*() / list_by_timespan(), without normalizing the join result. Because GCS object names are arbitrary UTF-8 strings, a name containing .. segments (or an absolute-path prefix) escapes the configured destination when the path lands on the SFTP server or the local worker:

  • GCSToSFTPOperator._resolve_destination_path (providers/google/.../transfers/gcs_to_sftp.py) — joins source_object to self.destination_path via os.path.join, which preserves .. segments and lets absolute source_object values absorb the destination prefix entirely.
  • GCSTimeSpanFileTransformOperator._download (providers/google/.../operators/gcs.py, line 899) — joins blob_name to temp_input_dir_path via pathlib.Path.__truediv__, with the same lack of normalization.

This PR adds a post-join validation to both call sites: after computing the destination path, normalize/resolve it and verify it remains within the configured destination (or the temp input directory). When it does not, raise AirflowException with a descriptive message identifying the offending object/blob name.

The allowlist in scripts/ci/prek/known_airflow_exceptions.txt is bumped for the two new raise AirflowException sites (one per operator).

Test plan

  • New unit tests: ..-segment object/blob name → AirflowException; absolute path → AirflowException; benign nested path → unchanged behavior.
  • Existing tests pass (82 passed in test_gcs_to_sftp.py + test_gcs.py).
  • prek run green on touched files.
Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines at https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

Conflict :(

potiuk and others added 2 commits May 29, 2026 20:42
Two operators in the Google provider compute filesystem destination
paths from GCS object names returned by ``list*()`` /
``list_by_timespan()``, without normalising the join result. GCS
object names are arbitrary UTF-8/Unicode strings written by whoever
can produce to the source bucket — frequently a lower-trust external
producer rather than the DAG author — so a name containing ``..``
segments or an absolute-path prefix would otherwise canonicalise
outside the configured destination on the SFTP server or the local
worker filesystem.

- GCSToSFTPOperator._resolve_destination_path: normalise the joined
  destination path and refuse any path that does not stay within the
  configured destination_path.
- GCSTimeSpanFileTransformOperator._download: resolve the joined
  blob destination and refuse any path that does not stay within the
  temp input directory.

The allowlist in scripts/ci/prek/known_airflow_exceptions.txt is
bumped for the two new ``raise AirflowException`` sites (one per
operator).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The post-join validation added in the previous commit rejected
benign uploads when ``destination_path`` was ``"."`` or ``""`` — both
of which are valid SFTP destinations referring to the user's
login / current directory. The startswith check on
``base.rstrip(os.sep) + os.sep`` evaluated to ``"./"`` for a
``"."`` base, and ``"file.txt"`` does not start with ``"./"``, so a
benign ``file.txt`` upload was misclassified as an escape.

Replace the single startswith check with three explicit conditions:

* a leading ``..`` segment after normalisation -> escapes any base;
* an absolute resolved path under a relative configured base ->
  ``source_object`` was itself absolute and absorbed the join;
* with an absolute base, the resolved path must stay within the
  base subtree.

Add regression tests covering ``destination_path="."`` and
``destination_path=""`` (allowed) as well as the same relative
base with a ``..`` escape or an absolute ``source_object``
(refused).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@potiuk potiuk force-pushed the validate-gcs-object-name-paths branch from 6bb44ed to e4f3049 Compare May 29, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:providers backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants