Skip to content

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 6, 2023

Which issue does this PR close?

Closes #7357

Rationale for this change

This was added by @yahoNanJing in #5572. It was then evolved in #7175 to try to avoid strongly coupling to AvroExec. However, this is causing issues trying to split apart the crates (#7357) and it is unclear how best to adapt the design #7425 (comment).

Given this API does not appear to be being used, I wonder if we can do the simplest possible thing, and just remove it

What changes are included in this PR?

Removes the API

Are these changes tested?

Are there any user-facing changes?

@tustvold tustvold added the api change Changes the API exposed to users of the crate label Sep 6, 2023
@github-actions github-actions bot added the core Core DataFusion crate label Sep 6, 2023
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Let's get @not-my-profile's view on that, i.e. if he has a good reason to keep this API.

@not-my-profile
Copy link
Contributor

get_scan_files is public API. So just because it's not used internally doesn't mean that it's not used at all.

The question is if @yahoNanJing's motivation still applies:

Currently the TreeNodeRewriter is as a visitor to transform a node to another. However, sometimes we don't need to do the transformation and what we want is only to collect some info from the node. To achieve this, it's better to introduce another visitor for collecting info and keep the node unchanged.

As I said in #7425 (comment) I'm not familiar enough with datafusion to determine that.

it is unclear how best to adapt the design

I think that #7485 is a decent solution if we decide to keep the function.

alamb
alamb previously approved these changes Sep 6, 2023
@alamb alamb dismissed their stale review September 6, 2023 16:26

It is used in ballista

@alamb
Copy link
Contributor

alamb commented Sep 6, 2023

It appears that get_scan_files is used in ballista: https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20get_scan_files&type=code

@tustvold
Copy link
Contributor Author

tustvold commented Sep 6, 2023

It appears to only be used as an argument for https://github.com/apache/arrow-ballista/blob/948d0777f972144cc242d0398fd61fadf34cec73/ballista/scheduler/src/cluster/mod.rs#L679 which seems like it could definitely be achieved in a different way

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more into the ballista code and I think the files are used for more than just their file count. Specifically the URL is used here:

https://github.com/apache/arrow-ballista/blob/948d0777f972144cc242d0398fd61fadf34cec73/ballista/scheduler/src/cluster/mod.rs#L625C1-L626C1

I am thinking, however, that we could effectively port get_files_for_scan into Ballista... Let me give that a try

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this API into Ballista apache/datafusion-ballista#877 so I think it is ok to remove from DataFusion. @yahoNanJing please advise if there are reasons to avoid this

@tustvold tustvold merged commit 63ccd4a into apache:main Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detach ExecutionPlan and DataSource

4 participants