feat: expose spark-compatible functions#1564
Draft
timsaucer wants to merge 12 commits into
Draft
Conversation
Add `datafusion.functions.spark` module exposing the upstream `datafusion-spark` crate's UDF/UDAF library (~87 functions across string, math, datetime, hash, array, aggregate, bitwise, bitmap, conditional, collection, conversion, json, map, url categories). For DataFrame use, import the typed Python wrappers from `datafusion.functions.spark`. For SQL use, call `SessionContext.enable_spark_functions()` to register the Spark UDFs by name (overriding DataFusion built-ins of the same name with their Spark semantics — NULL-propagating `concat`, 1-indexed `substring`, HALF_UP `round`, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven `#[allow(clippy::borrow_deref_ref)]` attributes on module declarations in `crates/core/src/lib.rs` had become stale — the only remaining lint hit was a redundant `&*x.as_str()` pattern in `parse_file_compression_type`. Rewriting that call to `&x.unwrap_or_default()` lets every allow come off, removing noise that new modules were copying without need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch most spark wrappers from UDF-direct path (which forced `spark_udf_fixed!(name, fn_category::name, args...)` repetition) to a `spark_expr_fn!` macro that mirrors the existing `expr_fn!` macro in `functions.rs`, so calls collapse to `spark_expr_fn!(sha2, arg1 bit_length);`. UDF-direct retained for genuinely variadic functions whose upstream `expr_fn` wrappers were generated with a single-`Expr` arm by `export_functions!` (concat, array, xxhash64, parse_url family, etc.) so that the Python side keeps its `*args` ergonomics. Aggregates collapse the same way via `spark_aggregate!` mirroring `aggregate_function!`. Net 173 lines removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The intro wording implied "SQL functions" only; the same wrappers are the primary entry point for the DataFrame API as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace API-speak ("Import the submodule", "Returned values are Expr
instances that compose") with a concrete description of where users can
actually drop these calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hand-maintained category list would drift from the actual module as upstream `datafusion-spark` adds/removes functions. Replace with a pointer to the AutoAPI-generated reference, which renders from the module itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
38 wrappers carried `# doctest: +SKIP` because outputs weren't verified at authoring time. Run each with concrete inputs, capture actual outputs, and inline the values so the doctests execute and stay correct. Covers datetime (20), URL (5), bitmap (3), map (3), and remaining hash, JSON, math, string, conversion, and format_string cases. Net new doctest coverage: 65 examples now run that were skipped before; total skipped across the suite drops from 53 to 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Align positional parameter names in `functions.spark` with pyspark.sql.functions: - aggregate first positional → `col` (avg, try_sum, collect_list, collect_set) - unary `arg` → `col` across math/string/byte/datetime helpers - multi-arg renames: array_contains (col, value), array (*cols), shuffle (col), array_repeat (col, count), slice (x, start, length), shiftleft/right/rightunsigned (col, numBits), add_months (start, months), date_add/sub (start, days), date_diff (end, start), date_trunc (format, timestamp), time_trunc (unit, time), trunc (date, format), next_day (date, dayOfWeek), from/to_utc_timestamp (timestamp, tz), sha2 (col, numBits), xxhash64 (*cols), map_from_arrays (col1, col2), width_bucket (v, min, max, numBucket), substring (str, pos, len), concat (*cols), elt (*inputs), is_valid_utf8/make_valid_utf8 (str) Bodies updated to reference the new names; positional callers unaffected. This finishes Category 1 / Category 4 (spark-side BOTH-bucket) renames from PYSPARK_ALIGNMENT_PLAN.md PR 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match pyspark's optional-parameter surface in the spark namespace: - make_dt_interval, make_interval: all parts default to zero (int32 0 / lit 0.0) - str_to_map: pair_delim defaults to ',', key_value_delim defaults to ':' - round: scale defaults to 0 (HALF_UP rounding to nearest integer) - shuffle: accepts `seed` kwarg for pyspark parity; raises NotImplementedError for non-None values until the Rust binding supports it - like, ilike: accept `escapeChar` for pyspark parity; same NotImplementedError guard; first positional renamed `string` → `str` to match pyspark ceil/floor `scale=` deferred — the underlying Rust expr_fn is single-arg. Added a module-level `_ZERO_I32` literal to avoid rebuilding the pyarrow int32 zero scalar on every call. Tests: positional-compat coverage for aggregates (`spark.avg(col)` etc.), defaults-omitted cases for the optional-arg functions, and NotImplementedError cases for `shuffle(seed=)` and `like/ilike(escapeChar=)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace generic ``*args`` with explicit pyspark-style signatures: - json_tuple(col, *fields) — first positional is the JSON expr - format_string(format, *cols) — `format` is the printf template; a plain ``str`` is auto-promoted to a literal - parse_url(url, partToExtract, key=None) — `key` is optional and only meaningful with ``partToExtract='QUERY'`` - try_parse_url(url, partToExtract, key=None) — same shape - url_decode(str), try_url_decode(str), url_encode(str) — single-argument forms (multi-arg calls were always semantically wrong) Tests cover the three-arg parse_url path and the plain-str format_string auto-promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`functions.spark` mirrors `pyspark.sql.functions` and now ships on this branch. Update every skill that references the function surface: - skills/datafusion_python/SKILL.md (user-facing): add an import reference, a Core Abstractions row, and a "Spark-Compatible Functions" subsection listing coverage by category, the SQL-vs-DataFrame usage (`enable_spark_functions`), and the divergent-semantics table (concat NULL, round HALF_UP, trunc) so callers know which namespace to pick. - .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark` crate with the coverage policy (parity with pyspark, extras allowed when positional pyspark calls still work). Hygiene check also now spans `functions/spark.py`'s `__all__`. - .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the surface table and a `spark-functions` scope so this audit also validates the new subsection and divergent-semantics table. - .ai/skills/make-pythonic/SKILL.md: explicit scope note that the spark namespace is a deliberate pyspark mirror — generic native-type coercion does not apply there. Path references updated to the new `functions/__init__.py` module layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #1482
Rationale for this change
This expands the pool of available functions for users. Some replace existing functions and others are new.
What changes are included in this PR?
Are there any user-facing changes?
No existing functions are impacted. New APIs and functions exposed.