feat: add AI skill to find and improve the Pythonic interface to functions#1484
feat: add AI skill to find and improve the Pythonic interface to functions#1484timsaucer wants to merge 11 commits intoapache:mainfrom
Conversation
…uiring lit()
Update 47 functions in functions.py to accept native Python types (int, float,
str) for arguments that are contextually literals, eliminating verbose lit()
wrapping. For example, users can now write split_part(col("a"), ",", 2) instead
of split_part(col("a"), lit(","), lit(2)). All changes are backward compatible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions Update instr and position (aliases of strpos) to accept Expr | str for the substring parameter, matching the updated primary function signature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alias functions that delegate to a primary function must have their type hints updated to match, even though coercion logic is only added to the primary. Added a new Step 3 to the implementation workflow for this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Since @kevinjqliu asked about it regarding the last skill I wrote, here is an export of the chat history: One of the important things to note in the chat history is that I intentionally exited the session and started a new session so that the skill would be applied without prior context. Then as I reviewed the code it generated I gave it feedback on the fact that it missed the aliases. And so then I had the agent update the skill it was using. I've found that this has to be an iterative process. The next step I'm going to take is to start a fresh session and have it review this PR, both the skill and the updates it makes. I'll keep iterating on the process of having the agent and myself review the code suggestions and update the skill. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a new AI “make-pythonic” skill and applies its recommendations to make datafusion-python function wrappers accept native Python literals (e.g., str, int, float) in places where callers previously had to wrap values with lit().
Changes:
- Added a reusable AI skill definition documenting how to audit and “pythonicize” function signatures.
- Updated many
python/datafusion/functions.pyAPIs to accept native types and internally coerce them toExpr.literal(...). - Updated doctest examples to demonstrate the simplified calling convention.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| python/datafusion/functions.py | Broad signature + coercion updates to accept native Python types; doctest examples updated accordingly. |
| .ai/skills/make-pythonic/SKILL.md | New skill documentation describing how to identify and implement pythonic argument coercions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Update SKILL.md to prevent three classes of issues: clarify that float already accepts int per PEP 484 (avoiding redundant int | float that fails ruff PYI041), add backward-compat rule for Category B so existing Expr params aren't removed, and add guidance for inline coercion with many optional nullable params instead of local helpers. Replace regexp_instr's _to_raw() helper with inline coercion matching the pattern used throughout the rest of the file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erns Introduce coerce_to_expr() and coerce_to_expr_or_none() in expr.py as the complement to ensure_expr() — where ensure_expr rejects non-Expr values, these helpers wrap them via Expr.literal(). Replaces ~60 inline isinstance checks in functions.py with single-line helper calls, and updates the make-pythonic skill to document the new pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Technique 1a to detect literal-only arguments in aggregate functions. Unlike scalar UDFs which enforce literals in invoke_with_args(), aggregate functions enforce them in accumulator() via get_scalar_value(), validate_percentile_expr(), or downcast_ref::<Literal>(). Without this technique, the skill would incorrectly classify arguments like approx_percentile_cont's percentile as Category A (Expr | float) when they should be Category B (float only). Updates the decision flow to branch on scalar vs aggregate before checking for literal enforcement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Technique 1b to detect literal-only arguments in window functions. Window functions enforce literals in partition_evaluator() via get_scalar_value_from_args() / downcast_ref::<Literal>(), not in invoke_with_args() (scalar) or accumulator() (aggregate). Updates the decision flow to branch on scalar vs aggregate vs window. Known window functions with literal-only arguments: ntile (n), lead/lag (offset, default_value), nth_value (n). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace 7 fragile truthiness checks (x.expr if x else None) with explicit is not None checks to prevent silent None when zero-valued literals are passed. Widen log/power/pow type hints to Expr | int | float with noqa: PYI041 for clarity. Add unit tests for coerce_to_expr helpers and integration tests for pythonic calling conventions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add FBT003 (boolean positional value) to the per-file-ignores for python/tests/* in pyproject.toml, and remove the 6 now-redundant inline noqa: FBT003 comments across test_expr.py and test_context.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded "Known aggregate/window functions with literal-only arguments" lists with instructions to discover them dynamically by searching the upstream crate source. Keeps a few examples as validation anchors so the agent knows its search is working correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ntjohnson1
left a comment
There was a problem hiding this comment.
I see this is still draft so might change but saw you referenced it on the other pr so did a quick skim
| """Truncates the date to a specified level of precision. | ||
|
|
||
| Args: | ||
| part: The precision to truncate to. Must be one of ``"year"``, |
There was a problem hiding this comment.
Would be nice to just use Literal instead here since that's more specific
There was a problem hiding this comment.
Let's open a follow up PR with deprecation warning for now
|
|
||
|
|
||
| def encode(expr: Expr, encoding: Expr) -> Expr: | ||
| def encode(expr: Expr, encoding: Expr | str) -> Expr: |
There was a problem hiding this comment.
NIT: I wonder if it makes sense to make type aliases for Expr | str, Expr | int
There was a problem hiding this comment.
I think that makes it slightly less user friendly for the type hints.
PyThreadState_SetAsyncExc only delivers exceptions when the thread is executing Python bytecode, not while in native (Rust/C) code. The previous test had two issues causing flakiness on Python 3.11: 1. The interrupt fired before df.collect() entered the UDF, while the thread was still in native code where async exceptions are ignored. 2. time.sleep(2.0) is a single C call where async exceptions are not checked — they're only checked between bytecode instructions. Fix by adding a threading.Event so the interrupt waits until the UDF is actually executing Python code, and by sleeping in small increments so the eval loop has opportunities to check for pending exceptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
None
Rationale for this change
This adds an AI agent skill that can be used to search the repository and identify cases where we can make our interface more intuitive to users. Attached is also the diff recommended when using this skill in coordination with our existing agent directives about how to write functions.
What changes are included in this PR?
Add skill for searching repository for functions, investigating their upstream equivalent, and update the function inputs where appropriate.
I ran the skill and updated many function signatures.
Are there any user-facing changes?
Improved type hints and inputs allowed in Python.