feat: add AI skill to find and improve the Pythonic interface to functions by timsaucer · Pull Request #1484 · apache/datafusion-python

timsaucer · 2026-04-09T15:51:32Z

Which issue does this PR close?

None

Rationale for this change

This adds an AI agent skill that can be used to search the repository and identify cases where we can make our interface more intuitive to users. Attached is also the diff recommended when using this skill in coordination with our existing agent directives about how to write functions.

What changes are included in this PR?

Add skill for searching repository for functions, investigating their upstream equivalent, and update the function inputs where appropriate.

I ran the skill and updated many function signatures.

Are there any user-facing changes?

Improved type hints and inputs allowed in Python.

…uiring lit() Update 47 functions in functions.py to accept native Python types (int, float, str) for arguments that are contextually literals, eliminating verbose lit() wrapping. For example, users can now write split_part(col("a"), ",", 2) instead of split_part(col("a"), lit(","), lit(2)). All changes are backward compatible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ions Update instr and position (aliases of strpos) to accept Expr | str for the substring parameter, matching the updated primary function signature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Alias functions that delegate to a primary function must have their type hints updated to match, even though coercion logic is only added to the primary. Added a new Step 3 to the implementation workflow for this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timsaucer · 2026-04-09T16:04:25Z

Since @kevinjqliu asked about it regarding the last skill I wrote, here is an export of the chat history:
chat-export-2026-04-09.md

One of the important things to note in the chat history is that I intentionally exited the session and started a new session so that the skill would be applied without prior context. Then as I reviewed the code it generated I gave it feedback on the fact that it missed the aliases. And so then I had the agent update the skill it was using.

I've found that this has to be an iterative process. The next step I'm going to take is to start a fresh session and have it review this PR, both the skill and the updates it makes. I'll keep iterating on the process of having the agent and myself review the code suggestions and update the skill.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a new AI “make-pythonic” skill and applies its recommendations to make datafusion-python function wrappers accept native Python literals (e.g., str, int, float) in places where callers previously had to wrap values with lit().

Changes:

Added a reusable AI skill definition documenting how to audit and “pythonicize” function signatures.
Updated many python/datafusion/functions.py APIs to accept native types and internally coerce them to Expr.literal(...).
Updated doctest examples to demonstrate the simplified calling convention.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File	Description
python/datafusion/functions.py	Broad signature + coercion updates to accept native Python types; doctest examples updated accordingly.
.ai/skills/make-pythonic/SKILL.md	New skill documentation describing how to identify and implement pythonic argument coercions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/datafusion/functions.py

Update SKILL.md to prevent three classes of issues: clarify that float already accepts int per PEP 484 (avoiding redundant int | float that fails ruff PYI041), add backward-compat rule for Category B so existing Expr params aren't removed, and add guidance for inline coercion with many optional nullable params instead of local helpers. Replace regexp_instr's _to_raw() helper with inline coercion matching the pattern used throughout the rest of the file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…erns Introduce coerce_to_expr() and coerce_to_expr_or_none() in expr.py as the complement to ensure_expr() — where ensure_expr rejects non-Expr values, these helpers wrap them via Expr.literal(). Replaces ~60 inline isinstance checks in functions.py with single-line helper calls, and updates the make-pythonic skill to document the new pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Technique 1a to detect literal-only arguments in aggregate functions. Unlike scalar UDFs which enforce literals in invoke_with_args(), aggregate functions enforce them in accumulator() via get_scalar_value(), validate_percentile_expr(), or downcast_ref::<Literal>(). Without this technique, the skill would incorrectly classify arguments like approx_percentile_cont's percentile as Category A (Expr | float) when they should be Category B (float only). Updates the decision flow to branch on scalar vs aggregate before checking for literal enforcement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Technique 1b to detect literal-only arguments in window functions. Window functions enforce literals in partition_evaluator() via get_scalar_value_from_args() / downcast_ref::<Literal>(), not in invoke_with_args() (scalar) or accumulator() (aggregate). Updates the decision flow to branch on scalar vs aggregate vs window. Known window functions with literal-only arguments: ntile (n), lead/lag (offset, default_value), nth_value (n). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/datafusion/functions.py

python/datafusion/expr.py

python/datafusion/functions.py

Replace 7 fragile truthiness checks (x.expr if x else None) with explicit is not None checks to prevent silent None when zero-valued literals are passed. Widen log/power/pow type hints to Expr | int | float with noqa: PYI041 for clarity. Add unit tests for coerce_to_expr helpers and integration tests for pythonic calling conventions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add FBT003 (boolean positional value) to the per-file-ignores for python/tests/* in pyproject.toml, and remove the 6 now-redundant inline noqa: FBT003 comments across test_expr.py and test_context.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace hardcoded "Known aggregate/window functions with literal-only arguments" lists with instructions to discover them dynamically by searching the upstream crate source. Keeps a few examples as validation anchors so the agent knows its search is working correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ntjohnson1

I see this is still draft so might change but saw you referenced it on the other pr so did a quick skim

ntjohnson1 · 2026-04-14T13:36:31Z

python/datafusion/functions.py

    """Truncates the date to a specified level of precision.

+    Args:
+        part: The precision to truncate to. Must be one of ``"year"``,


Would be nice to just use Literal instead here since that's more specific

Let's open a follow up PR with deprecation warning for now

ntjohnson1 · 2026-04-14T13:39:07Z

python/datafusion/functions.py



-def encode(expr: Expr, encoding: Expr) -> Expr:
+def encode(expr: Expr, encoding: Expr | str) -> Expr:


NIT: I wonder if it makes sense to make type aliases for Expr | str, Expr | int

I think that makes it slightly less user friendly for the type hints.

PyThreadState_SetAsyncExc only delivers exceptions when the thread is executing Python bytecode, not while in native (Rust/C) code. The previous test had two issues causing flakiness on Python 3.11: 1. The interrupt fired before df.collect() entered the UDF, while the thread was still in native code where async exceptions are ignored. 2. time.sleep(2.0) is a single C call where async exceptions are not checked — they're only checked between bytecode instructions. Fix by adding a threading.Event so the interrupt waits until the UDF is actually executing Python code, and by sleeping in small increments so the eval loop has opportunities to check for pending exceptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timsaucer and others added 3 commits April 9, 2026 11:44

timsaucer requested a review from Copilot April 9, 2026 16:05

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Copilot started reviewing on behalf of timsaucer April 9, 2026 16:20 View session

timsaucer and others added 4 commits April 9, 2026 12:40

timsaucer requested a review from Copilot April 14, 2026 12:15

Copilot started reviewing on behalf of timsaucer April 14, 2026 12:16 View session

timsaucer mentioned this pull request Apr 14, 2026

Allow plain Python literals in regexp function wrappers #1493

Open

Copilot AI reviewed Apr 14, 2026

View reviewed changes

timsaucer and others added 3 commits April 14, 2026 08:38

ntjohnson1 reviewed Apr 14, 2026

View reviewed changes

timsaucer marked this pull request as ready for review April 14, 2026 13:44

timsaucer mentioned this pull request Apr 14, 2026

Deprecate Expr for functions that take only specific literals #1496

Open



		def encode(expr: Expr, encoding: Expr) -> Expr:
		def encode(expr: Expr, encoding: Expr \| str) -> Expr:

Conversation

timsaucer commented Apr 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntjohnson1 left a comment

Choose a reason for hiding this comment

Uh oh!

ntjohnson1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ntjohnson1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants