Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Jan 6, 2026

What changes were proposed in this pull request?

This is a WIP change that adds support for using functions like DataFrame.schema and DataFrame.columns inside pipeline query functions.

The change makes graph resolution partially asynchronous.

Many of the data structures that were previously maintained as local variables inside transformDownNodes have been moved to a GraphAnalysisContext object. Moving them into a separate object makes them accessible from Spark Connect RPC handlers that:

  • Register query function results
  • Poll for query functions to execute
  • Analyze within the context of the graph

Were also essentially introducing a new state that flows can be in during resolution, which is “waiting for query function result”.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

⚠️ Pull Request Title Validation

This pull request title does not contain a JIRA issue ID.

Please update the title to either:

  • Include a JIRA ID: [SPARK-12345] Your description
  • Mark as minor change: [MINOR] Your description

For minor changes that don't require a JIRA ticket (e.g., typo fixes), please prefix the title with [MINOR].


This comment was automatically generated by GitHub Actions

@sryza sryza changed the title Analyze in query function Prototype: support eager analysis inside Pipelines query functions Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant