Skip to content

feat: add AI skill to find and improve the Pythonic interface to functions#1484

Open
timsaucer wants to merge 11 commits intoapache:mainfrom
timsaucer:feat/pythonic-skill
Open

feat: add AI skill to find and improve the Pythonic interface to functions#1484
timsaucer wants to merge 11 commits intoapache:mainfrom
timsaucer:feat/pythonic-skill

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

None

Rationale for this change

This adds an AI agent skill that can be used to search the repository and identify cases where we can make our interface more intuitive to users. Attached is also the diff recommended when using this skill in coordination with our existing agent directives about how to write functions.

What changes are included in this PR?

Add skill for searching repository for functions, investigating their upstream equivalent, and update the function inputs where appropriate.

I ran the skill and updated many function signatures.

Are there any user-facing changes?

Improved type hints and inputs allowed in Python.

timsaucer and others added 3 commits April 9, 2026 11:44
…uiring lit()

Update 47 functions in functions.py to accept native Python types (int, float,
str) for arguments that are contextually literals, eliminating verbose lit()
wrapping. For example, users can now write split_part(col("a"), ",", 2) instead
of split_part(col("a"), lit(","), lit(2)). All changes are backward compatible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

Update instr and position (aliases of strpos) to accept Expr | str for
the substring parameter, matching the updated primary function signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alias functions that delegate to a primary function must have their type
hints updated to match, even though coercion logic is only added to the
primary. Added a new Step 3 to the implementation workflow for this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer
Copy link
Copy Markdown
Member Author

Since @kevinjqliu asked about it regarding the last skill I wrote, here is an export of the chat history:
chat-export-2026-04-09.md

One of the important things to note in the chat history is that I intentionally exited the session and started a new session so that the skill would be applied without prior context. Then as I reviewed the code it generated I gave it feedback on the fact that it missed the aliases. And so then I had the agent update the skill it was using.

I've found that this has to be an iterative process. The next step I'm going to take is to start a fresh session and have it review this PR, both the skill and the updates it makes. I'll keep iterating on the process of having the agent and myself review the code suggestions and update the skill.

@timsaucer timsaucer requested a review from Copilot April 9, 2026 16:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a new AI “make-pythonic” skill and applies its recommendations to make datafusion-python function wrappers accept native Python literals (e.g., str, int, float) in places where callers previously had to wrap values with lit().

Changes:

  • Added a reusable AI skill definition documenting how to audit and “pythonicize” function signatures.
  • Updated many python/datafusion/functions.py APIs to accept native types and internally coerce them to Expr.literal(...).
  • Updated doctest examples to demonstrate the simplified calling convention.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
python/datafusion/functions.py Broad signature + coercion updates to accept native Python types; doctest examples updated accordingly.
.ai/skills/make-pythonic/SKILL.md New skill documentation describing how to identify and implement pythonic argument coercions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

timsaucer and others added 4 commits April 9, 2026 12:40
Update SKILL.md to prevent three classes of issues: clarify that float
already accepts int per PEP 484 (avoiding redundant int | float that
fails ruff PYI041), add backward-compat rule for Category B so existing
Expr params aren't removed, and add guidance for inline coercion with
many optional nullable params instead of local helpers.

Replace regexp_instr's _to_raw() helper with inline coercion matching
the pattern used throughout the rest of the file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erns

Introduce coerce_to_expr() and coerce_to_expr_or_none() in expr.py as the
complement to ensure_expr() — where ensure_expr rejects non-Expr values,
these helpers wrap them via Expr.literal(). Replaces ~60 inline isinstance
checks in functions.py with single-line helper calls, and updates the
make-pythonic skill to document the new pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Technique 1a to detect literal-only arguments in aggregate functions.
Unlike scalar UDFs which enforce literals in invoke_with_args(), aggregate
functions enforce them in accumulator() via get_scalar_value(),
validate_percentile_expr(), or downcast_ref::<Literal>(). Without this
technique, the skill would incorrectly classify arguments like
approx_percentile_cont's percentile as Category A (Expr | float) when they
should be Category B (float only). Updates the decision flow to branch on
scalar vs aggregate before checking for literal enforcement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Technique 1b to detect literal-only arguments in window functions.
Window functions enforce literals in partition_evaluator() via
get_scalar_value_from_args() / downcast_ref::<Literal>(), not in
invoke_with_args() (scalar) or accumulator() (aggregate). Updates the
decision flow to branch on scalar vs aggregate vs window.

Known window functions with literal-only arguments: ntile (n), lead/lag
(offset, default_value), nth_value (n).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

timsaucer and others added 3 commits April 14, 2026 08:38
Replace 7 fragile truthiness checks (x.expr if x else None) with
explicit is not None checks to prevent silent None when zero-valued
literals are passed. Widen log/power/pow type hints to Expr | int | float
with noqa: PYI041 for clarity. Add unit tests for coerce_to_expr helpers
and integration tests for pythonic calling conventions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add FBT003 (boolean positional value) to the per-file-ignores for
python/tests/* in pyproject.toml, and remove the 6 now-redundant
inline noqa: FBT003 comments across test_expr.py and test_context.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded "Known aggregate/window functions with literal-only
arguments" lists with instructions to discover them dynamically by
searching the upstream crate source. Keeps a few examples as validation
anchors so the agent knows its search is working correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is still draft so might change but saw you referenced it on the other pr so did a quick skim

"""Truncates the date to a specified level of precision.

Args:
part: The precision to truncate to. Must be one of ``"year"``,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to just use Literal instead here since that's more specific

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open a follow up PR with deprecation warning for now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def encode(expr: Expr, encoding: Expr) -> Expr:
def encode(expr: Expr, encoding: Expr | str) -> Expr:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I wonder if it makes sense to make type aliases for Expr | str, Expr | int

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes it slightly less user friendly for the type hints.

PyThreadState_SetAsyncExc only delivers exceptions when the thread is
executing Python bytecode, not while in native (Rust/C) code. The
previous test had two issues causing flakiness on Python 3.11:

1. The interrupt fired before df.collect() entered the UDF, while the
   thread was still in native code where async exceptions are ignored.
2. time.sleep(2.0) is a single C call where async exceptions are not
   checked — they're only checked between bytecode instructions.

Fix by adding a threading.Event so the interrupt waits until the UDF is
actually executing Python code, and by sleeping in small increments so
the eval loop has opportunities to check for pending exceptions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants