chore: audit date/time expressions#4448
Open
andygrove wants to merge 4 commits into
Open
Conversation
This was referenced May 27, 2026
andygrove
added a commit
to andygrove/datafusion-comet
that referenced
this pull request
May 27, 2026
…omatically Tightens Step 6 / Step 7 of the audit-comet-expression skill so the workflow produces concrete output without pausing for user approval on mechanical steps: - Step 6 reorganises the priorities into three named buckets: correctness divergences, missing coverage, and consistency issues. - Step 7 requires correctness findings to be captured as SQL file tests in the same PR as the audit. Walks through the "search for existing issue, file if missing" workflow and the trivial-fix-vs-`query ignore(<url>)` decision rule, with a concrete example. - Step 7 also applies every Step 5 consistency finding automatically. These are mechanical edits (extract `private val`, switch `Incompatible(None)` to `Some(reason)`, add missing `get*Reasons`, move a check from `convert` into `getSupportLevel`, hoist a shared reason into a private companion) and do not need user approval. - Only "missing test coverage" still pauses for user input, because adding tests for cases that already work is incremental polish. Surfaced while applying the skill to a multi-expression audit in apache#4448; both behaviours felt obviously right when the audit ran on a larger surface and several recurring patterns made the asking step feel like friction rather than safety.
Add per-version audit sub-bullets to every implemented date/time expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 38 SQL function names across the 33 backing Comet serde objects (some serdes back multiple SQL names, e.g. day/dayofmonth, date_add/dateadd, date_diff/datediff). For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (universally the NullIntolerant trait / nullIntolerant: Boolean refactor, plus StringTypeWithCollation widening on string inputs and some error-helper renames) - Known divergences between Comet and Spark, with tracking-issue links The audit was driven by 8 parallel agents, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms). Out of scope: current_timezone, date_part, datepart, extract, localtimestamp route through Spark optimizer rewrites or evaluate to constants and do not have dedicated Comet serdes; days and hours are V2 partition transforms with no SQL function name and so do not appear in this section.
Captured tests for the three correctness divergences found during the datetime audit. Each test is in query ignore(<issue-url>) mode and will activate when the corresponding upstream fix lands. - next_day.sql gains a divergence test for whitespace trimming (Comet trims ' MO '; Spark does not). ignore(apache#4450). - next_day_ansi.sql is new and asserts that next_day throws under spark.sql.ansi.enabled=true for malformed dayOfWeek. Comet currently returns NULL. ignore(apache#4449). - make_date_ansi.sql is new and asserts that make_date throws under spark.sql.ansi.enabled=true for invalid (year, month, day). Comet currently returns NULL. ignore(apache#4451). A fourth audit finding (make_date year 0 / negative years) was verified against Spark's own implementation and turned out to be a non-divergence; the issue was closed and no test added. None of the three remaining bugs are trivial to fix here: both SparkNextDay and SparkMakeDate live upstream in the datafusion-spark crate, so the fixes need to flow through that project. The captured tests will switch from ignore(...) to their intended assertion mode when the upstream changes land.
Mechanical fixes for the support-level / reason alignment issues found
during the datetime audit. No behavioural changes; the only observable
effect is that EXPLAIN-time fallback messages now include the specific
reason instead of a generic "not fully compatible with Spark".
- TimeFieldSerde companion (new) hoists the shared TimestampNTZ reason
string used by CometHour, CometMinute, and CometSecond, mirroring
the existing UTCTimestampSerde pattern. The three serdes now share
one reason and one support-level helper.
- CometTruncDate extracts the duplicated reason strings into private
vals and corrects the wording drift (the inline reason said "Invalid"
while the docs reason said "Non-literal"; they now match).
- CometTruncTimestamp adds the missing non-literal-format reason to
getIncompatibleReasons, adds the missing getUnsupportedReasons
override for unsupported format literals, and extracts both reasons
into private vals.
- CometSecondsToTimestamp adds the missing getUnsupportedReasons
override so the Compatibility Guide reflects which input types are
supported.
- CometHours and CometDays add getSupportLevel and getUnsupportedReasons
overrides so the unsupported-input-type fallback surfaces in EXPLAIN
output and the Compatibility Guide; the dispatcher now handles the
fall-back uniformly and the withInfo call in convert is no longer
needed for those branches.
- CometFromUnixTime moves the format-pattern check out of convert into
getSupportLevel (returning Unsupported for non-default patterns and
Incompatible for the DataFusion timestamp-range issue on default
patterns). Reasons are shared via private vals; getUnsupportedReasons
and getIncompatibleReasons both populated. As a side effect the
fallback message for non-default formats now includes the specific
reason ("Only the default datetime format pattern...") rather than
the generic "not fully compatible with Spark"; updated the existing
from_unix_time.sql expect_fallback assertions accordingly.
The CometDateFormat and CometUnixTimestamp findings need deeper
semantics analysis and are left for follow-up.
9f3ee71 to
8fb7905
Compare
…e/TruncTimestamp wording
The audit found that the TruncDate / TruncTimestamp non-literal-format
reason was using two different wordings ("Invalid" in the inline
support-level branch, "Non-literal" in getIncompatibleReasons). The
preceding commit picked "Non-literal" as the canonical wording.
CometTemporalExpressionSuite was asserting against the old "Invalid"
wording in three tests; update those assertions to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Only
from_utc_timestampandto_utc_timestamphad per-version audit sub-bullets inspark_expressions_support.md. The rest of thedatetime_funcssection was implemented but undocumented as to which Spark versions it had been validated against. This PR brings every implemented[x]date/time function (38 SQL names backed by 33 Comet serde objects) under the same audit format, captures the three correctness divergences found during that audit as ignored regression tests, and applies the support-level / reason consistency fixes the audit surfaced.What changes are included in this PR?
Documentation
Adds sub-bullets to every
[x]entry under### datetime_funcsindocs/source/contributor-guide/spark_expressions_support.md, except the five entries that have no dedicated Comet serde (current_timezone,date_part,datepart,extract,localtimestamproute through Spark optimizer rewrites to other expressions or evaluate to constants).For each function, the sub-bullets record:
NullIntoleranttrait migrated to anullIntolerant: Booleanfield override; many string-typedinputTypeswidened toStringTypeWithCollation(supportsTrimCollation = true); ANSI error helpers were renamed.The work was driven by 8 parallel agents using the audit-comet-expression skill, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms).
Captured regression tests for correctness findings
Three correctness divergences were found and captured as
query ignore(<issue-url>)SQL tests. When the underlying bug is fixed, removing theignore(...)activates the assertion.next_day.sql(appended)' MO 'to matchMonday; Spark does not trim and returns NULLnext_day_ansi.sql(new)next_dayreturns NULL where Spark throws underspark.sql.ansi.enabled=truemake_date_ansi.sql(new)make_datereturns NULL where Spark throws underspark.sql.ansi.enabled=trueA fourth audit finding (
make_dateyear 0 / negative years) was checked against Spark's ownMakeDate.nullSafeEval, which usesLocalDate.of(year, month, day)and accepts the full Java date range. Spark and chrono agree; the existingmake_date.sqltests at lines 133/136 already exercise it. Issue #4452 was closed as not-a-bug, no test added.Why not fix instead of ignore? Both
SparkNextDayandSparkMakeDatelive in the upstreamdatafusion-sparkcrate (pinned at 53.1.0 inCargo.lock). Fixing the ANSI throw and the trim behaviour requires changes there, which is outside the scope of this PR.Support-level consistency fixes
Mechanical fixes for the alignment issues the audit surfaced. No behaviour changes; the only observable effect is that EXPLAIN-time fallback messages now include the specific reason instead of a generic "not fully compatible with Spark":
TimeFieldSerdeprivate companion hoists the shared TimestampNTZ reason used byCometHour,CometMinute,CometSecond(mirrorsUTCTimestampSerde).CometTruncDate: extracted reasons toprivate vals; corrected "Invalid" vs "Non-literal" wording mismatch.CometTruncTimestamp: added the missing non-literal-format reason togetIncompatibleReasons; added missinggetUnsupportedReasonsoverride; reasons shared viaprivate vals.CometSecondsToTimestamp: added missinggetUnsupportedReasonsoverride.CometHours/CometDays: addedgetSupportLevelandgetUnsupportedReasonsso the unsupported-input-type fallback surfaces in EXPLAIN/Compatibility Guide instead of being silently swallowed viawithInfoinconvert.CometFromUnixTime: moved format-pattern check fromconvertintogetSupportLevel(Unsupported for non-default, Incompatible for default); both reason methods now populated. Updated the existingfrom_unix_time.sqlexpect_fallbackassertions to match the new, more specific fallback message.Out of scope here:
CometDateFormat(needs semantics decision about gating refactor) andCometUnixTimestamp(predicate-vs-reason disagreement needs verification of what TimestampNTZ actually does). Both are listed in the consistency-findings section below for follow-up.Remaining consistency findings (follow-up)
These were surfaced by the audit but are out of scope here because they need semantics decisions, not mechanical edits:
CometUnixTimestamp:getUnsupportedReasonsclaimsTimestampNTZType is not supported, butisSupportedInputTypereturnstruefor it. Either the predicate or the reason is wrong; needs verification of what the Rust path actually does with TimestampNTZ.CometDateFormat:getSupportLevelreturnsCompatible()whileconvertreadsallowIncompatibleand branches on UTC vs non-UTC. The new skill rule prefersgetSupportLevelgating, but moving it here changes the dispatch flow slightly and needs a separate review.getSupportLevelreturnsCompatible()whileconvertreturnsNonewhenspark.comet.exec.scalaUDF.codegen.enabled=false. The dispatcher flag is not surfaced in the compatibility guide.How are these changes tested?
Documentation:
prettier --checkpasses onspark_expressions_support.md.Captured regression tests: the new SQL files parse against
SqlFileTestParser(IgnorePattern,ConfigPattern,MinSparkVersionPatternregexes all match). Thequery ignore(...)queries are skipped at runtime per the documentedignoremode contract insql-file-tests.md, so they do not affect suite outcomes today and will activate automatically when the upstreamdatafusion-sparkfixes land.Support-level fixes: ran
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite datetime/" -Dtest=nonelocally. All 90 datetime SQL file tests pass.