chore: audit date/time expressions by andygrove · Pull Request #4448 · apache/datafusion-comet

andygrove · 2026-05-27T14:03:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

Only from_utc_timestamp and to_utc_timestamp had per-version audit sub-bullets in spark_expressions_support.md. The rest of the datetime_funcs section was implemented but undocumented as to which Spark versions it had been validated against. This PR brings every implemented [x] date/time function (38 SQL names backed by 33 Comet serde objects) under the same audit format, captures the three correctness divergences found during that audit as ignored regression tests, and applies the support-level / reason consistency fixes the audit surfaced.

What changes are included in this PR?

Documentation

Adds sub-bullets to every [x] entry under ### datetime_funcs in docs/source/contributor-guide/spark_expressions_support.md, except the five entries that have no dedicated Comet serde (current_timezone, date_part, datepart, extract, localtimestamp route through Spark optimizer rewrites to other expressions or evaluate to constants).

For each function, the sub-bullets record:

Whether the Spark class is identical across 3.4.3, 3.5.8, and 4.0.1.
Spark 4.0 changes. Universally the NullIntolerant trait migrated to a nullIntolerant: Boolean field override; many string-typed inputTypes widened to StringTypeWithCollation(supportsTrimCollation = true); ANSI error helpers were renamed.
Known divergences between Comet and Spark, with tracking-issue links where applicable.

The work was driven by 8 parallel agents using the audit-comet-expression skill, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms).

Captured regression tests for correctness findings

Three correctness divergences were found and captured as query ignore(<issue-url>) SQL tests. When the underlying bug is fixed, removing the ignore(...) activates the assertion.

Test	Bug	Issue
`next_day.sql` (appended)	Comet trims `' MO '` to match `Monday`; Spark does not trim and returns NULL	#4450
`next_day_ansi.sql` (new)	`next_day` returns NULL where Spark throws under `spark.sql.ansi.enabled=true`	#4449
`make_date_ansi.sql` (new)	`make_date` returns NULL where Spark throws under `spark.sql.ansi.enabled=true`	#4451

A fourth audit finding (make_date year 0 / negative years) was checked against Spark's own MakeDate.nullSafeEval, which uses LocalDate.of(year, month, day) and accepts the full Java date range. Spark and chrono agree; the existing make_date.sql tests at lines 133/136 already exercise it. Issue #4452 was closed as not-a-bug, no test added.

Why not fix instead of ignore? Both SparkNextDay and SparkMakeDate live in the upstream datafusion-spark crate (pinned at 53.1.0 in Cargo.lock). Fixing the ANSI throw and the trim behaviour requires changes there, which is outside the scope of this PR.

Support-level consistency fixes

Mechanical fixes for the alignment issues the audit surfaced. No behaviour changes; the only observable effect is that EXPLAIN-time fallback messages now include the specific reason instead of a generic "not fully compatible with Spark":

New TimeFieldSerde private companion hoists the shared TimestampNTZ reason used by CometHour, CometMinute, CometSecond (mirrors UTCTimestampSerde).
CometTruncDate: extracted reasons to private vals; corrected "Invalid" vs "Non-literal" wording mismatch.
CometTruncTimestamp: added the missing non-literal-format reason to getIncompatibleReasons; added missing getUnsupportedReasons override; reasons shared via private vals.
CometSecondsToTimestamp: added missing getUnsupportedReasons override.
CometHours / CometDays: added getSupportLevel and getUnsupportedReasons so the unsupported-input-type fallback surfaces in EXPLAIN/Compatibility Guide instead of being silently swallowed via withInfo in convert.
CometFromUnixTime: moved format-pattern check from convert into getSupportLevel (Unsupported for non-default, Incompatible for default); both reason methods now populated. Updated the existing from_unix_time.sql expect_fallback assertions to match the new, more specific fallback message.

Out of scope here: CometDateFormat (needs semantics decision about gating refactor) and CometUnixTimestamp (predicate-vs-reason disagreement needs verification of what TimestampNTZ actually does). Both are listed in the consistency-findings section below for follow-up.

Remaining consistency findings (follow-up)

These were surfaced by the audit but are out of scope here because they need semantics decisions, not mechanical edits:

CometUnixTimestamp: getUnsupportedReasons claims TimestampNTZType is not supported, but isSupportedInputType returns true for it. Either the predicate or the reason is wrong; needs verification of what the Rust path actually does with TimestampNTZ.
CometDateFormat: getSupportLevel returns Compatible() while convert reads allowIncompatible and branches on UTC vs non-UTC. The new skill rule prefers getSupportLevel gating, but moving it here changes the dispatch flow slightly and needs a separate review.
The 9 codegen-dispatched expressions in Group A (informational): getSupportLevel returns Compatible() while convert returns None when spark.comet.exec.scalaUDF.codegen.enabled=false. The dispatcher flag is not surfaced in the compatibility guide.

How are these changes tested?

Documentation: prettier --check passes on spark_expressions_support.md.

Captured regression tests: the new SQL files parse against SqlFileTestParser (IgnorePattern, ConfigPattern, MinSparkVersionPattern regexes all match). The query ignore(...) queries are skipped at runtime per the documented ignore mode contract in sql-file-tests.md, so they do not affect suite outcomes today and will activate automatically when the upstream datafusion-spark fixes land.

Support-level fixes: ran ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite datetime/" -Dtest=none locally. All 90 datetime SQL file tests pass.

…omatically Tightens Step 6 / Step 7 of the audit-comet-expression skill so the workflow produces concrete output without pausing for user approval on mechanical steps: - Step 6 reorganises the priorities into three named buckets: correctness divergences, missing coverage, and consistency issues. - Step 7 requires correctness findings to be captured as SQL file tests in the same PR as the audit. Walks through the "search for existing issue, file if missing" workflow and the trivial-fix-vs-`query ignore(<url>)` decision rule, with a concrete example. - Step 7 also applies every Step 5 consistency finding automatically. These are mechanical edits (extract `private val`, switch `Incompatible(None)` to `Some(reason)`, add missing `get*Reasons`, move a check from `convert` into `getSupportLevel`, hoist a shared reason into a private companion) and do not need user approval. - Only "missing test coverage" still pauses for user input, because adding tests for cases that already work is incremental polish. Surfaced while applying the skill to a multi-expression audit in apache#4448; both behaviours felt obviously right when the audit ran on a larger surface and several recurring patterns made the asking step feel like friction rather than safety.

Add per-version audit sub-bullets to every implemented date/time expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 38 SQL function names across the 33 backing Comet serde objects (some serdes back multiple SQL names, e.g. day/dayofmonth, date_add/dateadd, date_diff/datediff). For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (universally the NullIntolerant trait / nullIntolerant: Boolean refactor, plus StringTypeWithCollation widening on string inputs and some error-helper renames) - Known divergences between Comet and Spark, with tracking-issue links The audit was driven by 8 parallel agents, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms). Out of scope: current_timezone, date_part, datepart, extract, localtimestamp route through Spark optimizer rewrites or evaluate to constants and do not have dedicated Comet serdes; days and hours are V2 partition transforms with no SQL function name and so do not appear in this section.

Captured tests for the three correctness divergences found during the datetime audit. Each test is in query ignore(<issue-url>) mode and will activate when the corresponding upstream fix lands. - next_day.sql gains a divergence test for whitespace trimming (Comet trims ' MO '; Spark does not). ignore(apache#4450). - next_day_ansi.sql is new and asserts that next_day throws under spark.sql.ansi.enabled=true for malformed dayOfWeek. Comet currently returns NULL. ignore(apache#4449). - make_date_ansi.sql is new and asserts that make_date throws under spark.sql.ansi.enabled=true for invalid (year, month, day). Comet currently returns NULL. ignore(apache#4451). A fourth audit finding (make_date year 0 / negative years) was verified against Spark's own implementation and turned out to be a non-divergence; the issue was closed and no test added. None of the three remaining bugs are trivial to fix here: both SparkNextDay and SparkMakeDate live upstream in the datafusion-spark crate, so the fixes need to flow through that project. The captured tests will switch from ignore(...) to their intended assertion mode when the upstream changes land.

Mechanical fixes for the support-level / reason alignment issues found during the datetime audit. No behavioural changes; the only observable effect is that EXPLAIN-time fallback messages now include the specific reason instead of a generic "not fully compatible with Spark". - TimeFieldSerde companion (new) hoists the shared TimestampNTZ reason string used by CometHour, CometMinute, and CometSecond, mirroring the existing UTCTimestampSerde pattern. The three serdes now share one reason and one support-level helper. - CometTruncDate extracts the duplicated reason strings into private vals and corrects the wording drift (the inline reason said "Invalid" while the docs reason said "Non-literal"; they now match). - CometTruncTimestamp adds the missing non-literal-format reason to getIncompatibleReasons, adds the missing getUnsupportedReasons override for unsupported format literals, and extracts both reasons into private vals. - CometSecondsToTimestamp adds the missing getUnsupportedReasons override so the Compatibility Guide reflects which input types are supported. - CometHours and CometDays add getSupportLevel and getUnsupportedReasons overrides so the unsupported-input-type fallback surfaces in EXPLAIN output and the Compatibility Guide; the dispatcher now handles the fall-back uniformly and the withInfo call in convert is no longer needed for those branches. - CometFromUnixTime moves the format-pattern check out of convert into getSupportLevel (returning Unsupported for non-default patterns and Incompatible for the DataFusion timestamp-range issue on default patterns). Reasons are shared via private vals; getUnsupportedReasons and getIncompatibleReasons both populated. As a side effect the fallback message for non-default formats now includes the specific reason ("Only the default datetime format pattern...") rather than the generic "not fully compatible with Spark"; updated the existing from_unix_time.sql expect_fallback assertions accordingly. The CometDateFormat and CometUnixTimestamp findings need deeper semantics analysis and are left for follow-up.

…e/TruncTimestamp wording The audit found that the TruncDate / TruncTimestamp non-literal-format reason was using two different wordings ("Invalid" in the inline support-level branch, "Non-literal" in getIncompatibleReasons). The preceding commit picked "Non-literal" as the canonical wording. CometTemporalExpressionSuite was asserting against the old "Invalid" wording in three tests; update those assertions to match.

andygrove marked this pull request as draft May 27, 2026 14:04

andygrove changed the title ~~docs: audit date/time expressions across Spark 3.4.3, 3.5.8, 4.0.1~~ docs+test: audit date/time expressions; capture findings as ignored tests; update audit skill May 27, 2026

andygrove changed the title ~~docs+test: audit date/time expressions; capture findings as ignored tests; update audit skill~~ test: audit date/time expressions; capture findings as ignored tests; update audit skill May 27, 2026

andygrove added 3 commits May 27, 2026 08:46

andygrove force-pushed the audit-datetime-expressions branch from 9f3ee71 to 8fb7905 Compare May 27, 2026 14:47

andygrove changed the title ~~test: audit date/time expressions; capture findings as ignored tests; update audit skill~~ docs+test+fix: audit date/time expressions May 27, 2026

andygrove mentioned this pull request May 27, 2026

docs: clarify support-level and reason consistency in audit-comet-expression skill #4447

Merged

andygrove changed the title ~~docs+test+fix: audit date/time expressions~~ chore: audit date/time expressions May 27, 2026

andygrove marked this pull request as ready for review May 27, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: audit date/time expressions#4448

chore: audit date/time expressions#4448
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:audit-datetime-expressions

andygrove commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Documentation

Captured regression tests for correctness findings

Support-level consistency fixes

Remaining consistency findings (follow-up)

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 27, 2026 •

edited

Loading