Skip to content

docs: improve review skill and contributor guide for serde patterns#4132

Merged
andygrove merged 4 commits intoapache:mainfrom
andygrove:docs-scalar-fn-return-type
Apr 29, 2026
Merged

docs: improve review skill and contributor guide for serde patterns#4132
andygrove merged 4 commits intoapache:mainfrom
andygrove:docs-scalar-fn-return-type

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 28, 2026

Which issue does this PR close?

N/A. Documentation-only follow-up to issues observed while reviewing scalar function PRs (e.g. #4105 for levenshtein).

Rationale for this change

Two related cleanups to the contributor guide and the local review-comet-pr skill:

  1. Document when a scalar function serde must set its return type explicitly. When a Comet scalar function shares its name with a datafusion-functions built-in (e.g. levenshtein, concat, coalesce, sha2, regexp_replace), the serde must use scalarFunctionExprToProtoWithReturnType, not scalarFunctionExprToProto or the bare CometScalarFunction(name) shortcut. Without an explicit return type the native planner consults DataFusion's UDF registry first to resolve the return type, and any arity or input-type difference between the Spark and DataFusion versions fails native execution at runtime with errors like:

    org.apache.comet.CometNativeException: Error from DataFusion:
    Function 'levenshtein' expects 2 arguments but received 3.
    

    This is subtle because the matching-arity path happens to work even without an explicit return type. It only blows up when Spark's signature is wider than DF's, which is exactly when a contributor is most likely to hit it without warning. The contributor guide already mentioned scalarFunctionExprToProtoWithReturnType in passing but did not explain when it is required versus optional.

  2. Refresh the review skill's documentation-check guidance. The existing instructions told reviewers to flag missing hand edits to compatibility.md and configs.md. Both of those paths are stale. The compatibility guide pages (docs/source/user-guide/latest/compatibility/expressions/*.md) and the configuration reference (docs/source/user-guide/latest/configs.md) are now generated by GenerateDocs from getIncompatibleReasons / getUnsupportedReasons on the serde and republished by CI on every merge to main. Asking contributors to edit them by hand is wrong. The skill now distinguishes generated docs (where reviewers should check the serde reason methods instead) from the still hand-edited pages like expressions.md.

What changes are included in this PR?

  • docs/source/contributor-guide/adding_a_new_expression.md: new "When to set the return type explicitly" subsection that explains the failure mode, lists known colliding names, and shows a CometLevenshtein-style example using scalarFunctionExprToProtoWithReturnType.
  • .claude/skills/review-comet-pr/SKILL.md:
    • New item under "Common Comet Review Issues" pointing reviewers at the same return-type pitfall.
    • Rewritten "Documentation Check" section that splits docs into generated (verify serde reason methods, do not ask for hand edits) and hand-edited (still flag missing updates to expressions.md, etc.), with concrete things to look for in reason strings.

How are these changes tested?

Documentation-only change. No tests required. The Sphinx build will validate the Markdown on the next docs publish.

Document the failure mode where a Comet scalar function name collides with a
DataFusion built-in that has different arity or input types. Without an
explicit return type on the proto, the native planner consults DataFusion's
UDF registry first and signature mismatches fail before Comet's UDF is
swapped in. The classic case is levenshtein, where Spark's optional 3-arg
threshold form fails because DF's built-in is 2-arg only.

Add the same pointer to the review skill so reviewers catch this on
incoming PRs.
The old guidance pointed reviewers at compatibility.md and configs.md as
files contributors should hand-edit, but both are now generated by
GenerateDocs from getIncompatibleReasons / getUnsupportedReasons on the
serde and republished by CI on every merge to main.

Reframe the section so reviewers check the source (serde reason
methods) for generated docs and only ask for hand edits to the still
manual pages like expressions.md and the non-expression compatibility
pages.
@andygrove andygrove changed the title docs: clarify when scalar function serde must set return type docs: improve review skill and contributor guide for serde patterns Apr 28, 2026
@andygrove andygrove marked this pull request as ready for review April 28, 2026 16:48
@andygrove andygrove requested a review from coderfender April 28, 2026 16:48
@coderfender
Copy link
Copy Markdown
Contributor

LGTM @andygrove . We might need to run prettier to fix markdown formatting ?

@andygrove andygrove merged commit ade0c60 into apache:main Apr 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants