Skip to content

chore: audit date/time expressions#4448

Open
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:audit-datetime-expressions
Open

chore: audit date/time expressions#4448
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:audit-datetime-expressions

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 27, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

Only from_utc_timestamp and to_utc_timestamp had per-version audit sub-bullets in spark_expressions_support.md. The rest of the datetime_funcs section was implemented but undocumented as to which Spark versions it had been validated against. This PR brings every implemented [x] date/time function (38 SQL names backed by 33 Comet serde objects) under the same audit format, captures the three correctness divergences found during that audit as ignored regression tests, and applies the support-level / reason consistency fixes the audit surfaced.

What changes are included in this PR?

Documentation

Adds sub-bullets to every [x] entry under ### datetime_funcs in docs/source/contributor-guide/spark_expressions_support.md, except the five entries that have no dedicated Comet serde (current_timezone, date_part, datepart, extract, localtimestamp route through Spark optimizer rewrites to other expressions or evaluate to constants).

For each function, the sub-bullets record:

  • Whether the Spark class is identical across 3.4.3, 3.5.8, and 4.0.1.
  • Spark 4.0 changes. Universally the NullIntolerant trait migrated to a nullIntolerant: Boolean field override; many string-typed inputTypes widened to StringTypeWithCollation(supportsTrimCollation = true); ANSI error helpers were renamed.
  • Known divergences between Comet and Spark, with tracking-issue links where applicable.

The work was driven by 8 parallel agents using the audit-comet-expression skill, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms).

Captured regression tests for correctness findings

Three correctness divergences were found and captured as query ignore(<issue-url>) SQL tests. When the underlying bug is fixed, removing the ignore(...) activates the assertion.

Test Bug Issue
next_day.sql (appended) Comet trims ' MO ' to match Monday; Spark does not trim and returns NULL #4450
next_day_ansi.sql (new) next_day returns NULL where Spark throws under spark.sql.ansi.enabled=true #4449
make_date_ansi.sql (new) make_date returns NULL where Spark throws under spark.sql.ansi.enabled=true #4451

A fourth audit finding (make_date year 0 / negative years) was checked against Spark's own MakeDate.nullSafeEval, which uses LocalDate.of(year, month, day) and accepts the full Java date range. Spark and chrono agree; the existing make_date.sql tests at lines 133/136 already exercise it. Issue #4452 was closed as not-a-bug, no test added.

Why not fix instead of ignore? Both SparkNextDay and SparkMakeDate live in the upstream datafusion-spark crate (pinned at 53.1.0 in Cargo.lock). Fixing the ANSI throw and the trim behaviour requires changes there, which is outside the scope of this PR.

Support-level consistency fixes

Mechanical fixes for the alignment issues the audit surfaced. No behaviour changes; the only observable effect is that EXPLAIN-time fallback messages now include the specific reason instead of a generic "not fully compatible with Spark":

  • New TimeFieldSerde private companion hoists the shared TimestampNTZ reason used by CometHour, CometMinute, CometSecond (mirrors UTCTimestampSerde).
  • CometTruncDate: extracted reasons to private vals; corrected "Invalid" vs "Non-literal" wording mismatch.
  • CometTruncTimestamp: added the missing non-literal-format reason to getIncompatibleReasons; added missing getUnsupportedReasons override; reasons shared via private vals.
  • CometSecondsToTimestamp: added missing getUnsupportedReasons override.
  • CometHours / CometDays: added getSupportLevel and getUnsupportedReasons so the unsupported-input-type fallback surfaces in EXPLAIN/Compatibility Guide instead of being silently swallowed via withInfo in convert.
  • CometFromUnixTime: moved format-pattern check from convert into getSupportLevel (Unsupported for non-default, Incompatible for default); both reason methods now populated. Updated the existing from_unix_time.sql expect_fallback assertions to match the new, more specific fallback message.

Out of scope here: CometDateFormat (needs semantics decision about gating refactor) and CometUnixTimestamp (predicate-vs-reason disagreement needs verification of what TimestampNTZ actually does). Both are listed in the consistency-findings section below for follow-up.

Remaining consistency findings (follow-up)

These were surfaced by the audit but are out of scope here because they need semantics decisions, not mechanical edits:

  • CometUnixTimestamp: getUnsupportedReasons claims TimestampNTZType is not supported, but isSupportedInputType returns true for it. Either the predicate or the reason is wrong; needs verification of what the Rust path actually does with TimestampNTZ.
  • CometDateFormat: getSupportLevel returns Compatible() while convert reads allowIncompatible and branches on UTC vs non-UTC. The new skill rule prefers getSupportLevel gating, but moving it here changes the dispatch flow slightly and needs a separate review.
  • The 9 codegen-dispatched expressions in Group A (informational): getSupportLevel returns Compatible() while convert returns None when spark.comet.exec.scalaUDF.codegen.enabled=false. The dispatcher flag is not surfaced in the compatibility guide.

How are these changes tested?

Documentation: prettier --check passes on spark_expressions_support.md.

Captured regression tests: the new SQL files parse against SqlFileTestParser (IgnorePattern, ConfigPattern, MinSparkVersionPattern regexes all match). The query ignore(...) queries are skipped at runtime per the documented ignore mode contract in sql-file-tests.md, so they do not affect suite outcomes today and will activate automatically when the upstream datafusion-spark fixes land.

Support-level fixes: ran ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite datetime/" -Dtest=none locally. All 90 datetime SQL file tests pass.

@andygrove andygrove marked this pull request as draft May 27, 2026 14:04
@andygrove andygrove changed the title docs: audit date/time expressions across Spark 3.4.3, 3.5.8, 4.0.1 docs+test: audit date/time expressions; capture findings as ignored tests; update audit skill May 27, 2026
@andygrove andygrove changed the title docs+test: audit date/time expressions; capture findings as ignored tests; update audit skill test: audit date/time expressions; capture findings as ignored tests; update audit skill May 27, 2026
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request May 27, 2026
…omatically

Tightens Step 6 / Step 7 of the audit-comet-expression skill so the
workflow produces concrete output without pausing for user approval on
mechanical steps:

- Step 6 reorganises the priorities into three named buckets:
  correctness divergences, missing coverage, and consistency issues.
- Step 7 requires correctness findings to be captured as SQL file
  tests in the same PR as the audit. Walks through the
  "search for existing issue, file if missing" workflow and the
  trivial-fix-vs-`query ignore(<url>)` decision rule, with a concrete
  example.
- Step 7 also applies every Step 5 consistency finding automatically.
  These are mechanical edits (extract `private val`, switch
  `Incompatible(None)` to `Some(reason)`, add missing `get*Reasons`,
  move a check from `convert` into `getSupportLevel`, hoist a shared
  reason into a private companion) and do not need user approval.
- Only "missing test coverage" still pauses for user input, because
  adding tests for cases that already work is incremental polish.

Surfaced while applying the skill to a multi-expression audit in
apache#4448; both behaviours felt obviously right when the audit ran on a
larger surface and several recurring patterns made the asking step
feel like friction rather than safety.
andygrove added 3 commits May 27, 2026 08:46
Add per-version audit sub-bullets to every implemented date/time
expression in spark_expressions_support.md using the
audit-comet-expression skill. Covers 38 SQL function names across the
33 backing Comet serde objects (some serdes back multiple SQL names,
e.g. day/dayofmonth, date_add/dateadd, date_diff/datediff).

For each function, the sub-bullets record:
- Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1
- Spark 4.0 changes (universally the NullIntolerant trait /
  nullIntolerant: Boolean refactor, plus StringTypeWithCollation
  widening on string inputs and some error-helper renames)
- Known divergences between Comet and Spark, with tracking-issue links

The audit was driven by 8 parallel agents, each handling a related
group of expressions (codegen-dispatched, date field extractors,
Hour/Minute/Second, scalar function wrappers, timezone/unix,
truncation, format, Iceberg transforms).

Out of scope: current_timezone, date_part, datepart, extract,
localtimestamp route through Spark optimizer rewrites or evaluate to
constants and do not have dedicated Comet serdes; days and hours
are V2 partition transforms with no SQL function name and so do not
appear in this section.
Captured tests for the three correctness divergences found during the
datetime audit. Each test is in query ignore(<issue-url>) mode and
will activate when the corresponding upstream fix lands.

- next_day.sql gains a divergence test for whitespace trimming
  (Comet trims ' MO '; Spark does not). ignore(apache#4450).
- next_day_ansi.sql is new and asserts that next_day throws under
  spark.sql.ansi.enabled=true for malformed dayOfWeek. Comet
  currently returns NULL. ignore(apache#4449).
- make_date_ansi.sql is new and asserts that make_date throws under
  spark.sql.ansi.enabled=true for invalid (year, month, day). Comet
  currently returns NULL. ignore(apache#4451).

A fourth audit finding (make_date year 0 / negative years) was
verified against Spark's own implementation and turned out to be a
non-divergence; the issue was closed and no test added.

None of the three remaining bugs are trivial to fix here: both
SparkNextDay and SparkMakeDate live upstream in the datafusion-spark
crate, so the fixes need to flow through that project. The captured
tests will switch from ignore(...) to their intended assertion mode
when the upstream changes land.
Mechanical fixes for the support-level / reason alignment issues found
during the datetime audit. No behavioural changes; the only observable
effect is that EXPLAIN-time fallback messages now include the specific
reason instead of a generic "not fully compatible with Spark".

- TimeFieldSerde companion (new) hoists the shared TimestampNTZ reason
  string used by CometHour, CometMinute, and CometSecond, mirroring
  the existing UTCTimestampSerde pattern. The three serdes now share
  one reason and one support-level helper.
- CometTruncDate extracts the duplicated reason strings into private
  vals and corrects the wording drift (the inline reason said "Invalid"
  while the docs reason said "Non-literal"; they now match).
- CometTruncTimestamp adds the missing non-literal-format reason to
  getIncompatibleReasons, adds the missing getUnsupportedReasons
  override for unsupported format literals, and extracts both reasons
  into private vals.
- CometSecondsToTimestamp adds the missing getUnsupportedReasons
  override so the Compatibility Guide reflects which input types are
  supported.
- CometHours and CometDays add getSupportLevel and getUnsupportedReasons
  overrides so the unsupported-input-type fallback surfaces in EXPLAIN
  output and the Compatibility Guide; the dispatcher now handles the
  fall-back uniformly and the withInfo call in convert is no longer
  needed for those branches.
- CometFromUnixTime moves the format-pattern check out of convert into
  getSupportLevel (returning Unsupported for non-default patterns and
  Incompatible for the DataFusion timestamp-range issue on default
  patterns). Reasons are shared via private vals; getUnsupportedReasons
  and getIncompatibleReasons both populated. As a side effect the
  fallback message for non-default formats now includes the specific
  reason ("Only the default datetime format pattern...") rather than
  the generic "not fully compatible with Spark"; updated the existing
  from_unix_time.sql expect_fallback assertions accordingly.

The CometDateFormat and CometUnixTimestamp findings need deeper
semantics analysis and are left for follow-up.
@andygrove andygrove force-pushed the audit-datetime-expressions branch from 9f3ee71 to 8fb7905 Compare May 27, 2026 14:47
@andygrove andygrove changed the title test: audit date/time expressions; capture findings as ignored tests; update audit skill docs+test+fix: audit date/time expressions May 27, 2026
@andygrove andygrove changed the title docs+test+fix: audit date/time expressions chore: audit date/time expressions May 27, 2026
@andygrove andygrove marked this pull request as ready for review May 27, 2026 14:58
…e/TruncTimestamp wording

The audit found that the TruncDate / TruncTimestamp non-literal-format
reason was using two different wordings ("Invalid" in the inline
support-level branch, "Non-literal" in getIncompatibleReasons). The
preceding commit picked "Non-literal" as the canonical wording.

CometTemporalExpressionSuite was asserting against the old "Invalid"
wording in three tests; update those assertions to match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant