Validate destination paths derived from GCS object names#67667
Conversation
Conflict :( |
Two operators in the Google provider compute filesystem destination paths from GCS object names returned by ``list*()`` / ``list_by_timespan()``, without normalising the join result. GCS object names are arbitrary UTF-8/Unicode strings written by whoever can produce to the source bucket — frequently a lower-trust external producer rather than the DAG author — so a name containing ``..`` segments or an absolute-path prefix would otherwise canonicalise outside the configured destination on the SFTP server or the local worker filesystem. - GCSToSFTPOperator._resolve_destination_path: normalise the joined destination path and refuse any path that does not stay within the configured destination_path. - GCSTimeSpanFileTransformOperator._download: resolve the joined blob destination and refuse any path that does not stay within the temp input directory. The allowlist in scripts/ci/prek/known_airflow_exceptions.txt is bumped for the two new ``raise AirflowException`` sites (one per operator). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The post-join validation added in the previous commit rejected benign uploads when ``destination_path`` was ``"."`` or ``""`` — both of which are valid SFTP destinations referring to the user's login / current directory. The startswith check on ``base.rstrip(os.sep) + os.sep`` evaluated to ``"./"`` for a ``"."`` base, and ``"file.txt"`` does not start with ``"./"``, so a benign ``file.txt`` upload was misclassified as an escape. Replace the single startswith check with three explicit conditions: * a leading ``..`` segment after normalisation -> escapes any base; * an absolute resolved path under a relative configured base -> ``source_object`` was itself absolute and absorbed the join; * with an absolute base, the resolved path must stay within the base subtree. Add regression tests covering ``destination_path="."`` and ``destination_path=""`` (allowed) as well as the same relative base with a ``..`` escape or an absolute ``source_object`` (refused). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6bb44ed to
e4f3049
Compare
Two operators in the Google provider compute filesystem destination paths from GCS object names returned by
list*()/list_by_timespan(), without normalizing the join result. Because GCS object names are arbitrary UTF-8 strings, a name containing..segments (or an absolute-path prefix) escapes the configured destination when the path lands on the SFTP server or the local worker:GCSToSFTPOperator._resolve_destination_path(providers/google/.../transfers/gcs_to_sftp.py) — joinssource_objecttoself.destination_pathviaos.path.join, which preserves..segments and lets absolutesource_objectvalues absorb the destination prefix entirely.GCSTimeSpanFileTransformOperator._download(providers/google/.../operators/gcs.py, line 899) — joinsblob_nametotemp_input_dir_pathviapathlib.Path.__truediv__, with the same lack of normalization.This PR adds a post-join validation to both call sites: after computing the destination path, normalize/resolve it and verify it remains within the configured destination (or the temp input directory). When it does not, raise
AirflowExceptionwith a descriptive message identifying the offending object/blob name.The allowlist in
scripts/ci/prek/known_airflow_exceptions.txtis bumped for the two newraise AirflowExceptionsites (one per operator).Test plan
..-segment object/blob name →AirflowException; absolute path →AirflowException; benign nested path → unchanged behavior.test_gcs_to_sftp.py+test_gcs.py).prek rungreen on touched files.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.7 (1M context) following the guidelines at https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions