Skip to content

Track parse_url Spark compatibility work #4156

@andygrove

Description

@andygrove

Background

parse_url was added in #4152. The native implementation (datafusion-spark's spark_parse_url) diverges from Spark on several edge cases, so CometParseUrl is marked Incompatible and falls back to Spark by default. Users opt into the native path with `spark.comet.expression.ParseUrl.allowIncompatible=true`.

This issue tracks the work to bring the native implementation closer to Spark and reduce the surface that requires opt-in.

Known divergences

Tracked upstream at apache/datafusion#21943.

  1. Empty-string URL returns NULL where Spark returns `""`. Hits every part key.
  2. `FILE` on a URL without an explicit path inserts a leading `/` (e.g. `/?foo=bar`) where Spark returns the bare query (`?foo=bar`).
  3. `PATH` on a URL with a bare trailing slash returns `""` where Spark returns `/`.
  4. 3-arg `parse_url(_, 'QUERY', key)` form-decodes the value (`x%20y` → `x y`) where Spark preserves the raw percent-encoding.

The 2-arg `parse_url(_, 'QUERY')` does match Spark; only the 3-arg form with a key diverges.

Suggested work plan

  1. Drive fixes upstream in DataFusion for Initial PR #1-copy over the script to enable pyspark as well #4 above so we can pick them up via a `datafusion-spark` bump.
  2. Once Initial PR #1-copy over the script to enable pyspark as well #4 are resolved, demote `CometParseUrl.getSupportLevel` from `Incompatible` to `Compatible` and remove the `expect_fallback` queries in `spark/src/test/resources/sql-tests/expressions/url/parse_url.sql`.
  3. Expand the native test (`parse_url_native.sql`) to cover the URL shapes currently in the fallback file (e.g. trailing-slash PATH, percent-encoded query value) so we have positive coverage once each divergence is fixed.
  4. Consider extending `parse_url` test coverage to invalid URLs in non-ANSI mode (Spark returns NULL for everything).

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions