Skip to content

feat: make parse_url compatible#4413

Draft
parthchandra wants to merge 1 commit into
apache:mainfrom
parthchandra:parse_url_2
Draft

feat: make parse_url compatible#4413
parthchandra wants to merge 1 commit into
apache:mainfrom
parthchandra:parse_url_2

Conversation

@parthchandra
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4150.

Rationale for this change

PR #4350 wired parse_url through Comet's serde layer to the upstream datafusion-spark UDFs but marked it Incompatible due to divergences between the WHATWG URL Standard (url crate) and Spark's java.net.URI (RFC 3986). This PR replaces the upstream implementation with a local RFC 3986 regex-based
parser that matches Spark's behavior, promoting parse_url from Incompatible to Compatible.

What changes are included in this PR?

New: native/spark-expr/src/url_funcs/parse_url.rs

  • CometParseUrl and CometTryParseUrl UDFs using RFC 3986 Appendix B regex decomposition instead of the WHATWG url crate
  • Fixes all tracked divergences from Spark:
    1. Empty-string URL with PATH/FILE now returns "" (was NULL)
    2. FILE on URL without explicit path (http://host?foo=bar) returns ?foo=bar (was /?foo=bar)
    3. PATH on trailing slash (http://host/) returns "/" (was "")
    4. 3-arg QUERY with key returns raw percent-encoded value (was decoded)
  • Additional correctness fixes found during adversarial review:
    • Query values containing = (e.g., ?a=b=c) now correctly return b=c
    • 3-arg call with non-QUERY part (e.g., parse_url(url, 'HOST', 'key')) returns NULL
    • Empty port (http://host:/path) strips trailing colon
    • Empty authority (http:///path) returns NULL instead of ""
  • ANSI mode: raises [INVALID_URL] error for malformed URLs (spaces, control chars, missing scheme with ://)

Modified: native/core/src/execution/jni_api.rs

  • Registers CometParseUrl/CometTryParseUrl from spark-expr instead of upstream SparkParseUrl/SparkTryParseUrl

Modified: spark/.../serde/url.scala

  • Removes Incompatible override and incompatibleReason (now Compatible by default)

Modified: SQL test files

  • parse_url.sql: expanded from 4 fallback queries to 30+ native-execution queries covering all components, edge cases, and divergence fixes
  • parse_url_ansi.sql: enables previously-ignored ANSI error tests with expect_error(INVALID_URL)

How are these changes tested?

  • 25 Rust unit tests in parse_url.rs covering all URL components, divergence fixes, error handling, null propagation, and edge cases (query values with =, 3-arg non-QUERY, empty port, empty authority, IPv6, malformed URLs)
  • SQL file tests (parse_url.sql, parse_url_ansi.sql) verified against Spark on both spark-4.0 and spark-4.1 profiles
  • Adversarial code review identified and fixed 4 additional correctness issues before merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support expressions already implemented in datafusion-spark crate

1 participant