Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion Configuration Consolidation #4349

Closed
4 of 6 tasks
alamb opened this issue Nov 23, 2022 · 10 comments
Closed
4 of 6 tasks

DataFusion Configuration Consolidation #4349

alamb opened this issue Nov 23, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 23, 2022

Related

apache/datafusion-ballista#479
#3885

TLDR Recommendations

This is a complicated issue and I don't have a magic answer. However I have some concrete suggestions

Some suggested steps:

I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think (like runtime visibility into the current settings)

Then we can further improve from there

Introduction

"Configuration" in DataFusion has a few usecases:

There are also two overlapping "levels" of configuration that are needed

  • Session level (e.g. that can be reused from one query execution to the next)
  • Statement/Task level (e.g. that is needed to plan a query and doesn't change for the duration of a statement such as "the value of now()" and target batch sizes, etc). Statement level configuration is typically a superset of the session level configuration

Current state of configuration in DataFusion

The current state is .... inconsistent to put it mildly.

The core structure is SessionContext which is the final glue and entry point to interacting with datafusion (e.g. tables provided, etc).

Within the SessionContext there is the some combination of SessionState, SessionConfig, ConfigOptions. Part of the hierarchy is like this:

SessionContext
 -- Has a SessionState
 -- SessionStartTime
 -- SessionConfig
   -- ConfigOptions

SessionConfig is effectively the Session level configuration I describe above.

TaskContext is the statement level (aka per task / per query) level context. If you look hard you can see has a copy of the SessionConfig (buried in TaskPropertoes) or also maybe is backed by KVPairs.

Desire

I would like to have a clear configuration system that cleanly separates the statement level config from the task level config and allows configuration values to be set in a uniform manner and that are easy to view programmatically

@alamb alamb added the enhancement New feature or request label Nov 23, 2022
@alamb
Copy link
Contributor Author

alamb commented Nov 23, 2022

However, one question, raised by @avantgardnerio in apache/datafusion-ballista#479, is how to structure the interactions with the configuration code.

It turns out that SessionConfig has a more rust struct style and ConfigOptions is a more dynamic hashmap style. I honestly don't have a preference as long as the introspection (with docstrings) features of ConfigOptions are preserved. Maybe this is possible via macro magic

@alamb
Copy link
Contributor Author

alamb commented Nov 23, 2022

🤔 there is also ExecutionProps which is some subset of the TaskProperties 🤔

@mingmwang
Copy link
Contributor

I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think (like runtime visibility into the current settings)

I think consolidating SessionConfig/ConfigOptions is definitely the right direction. The debate should be how to consolidate them, either keep SessionConfig or keep ConfigOptions. I do not have a clear preference on this either. maybe we can have a vote.

@mingmwang
Copy link
Contributor

🤔 there is also ExecutionProps which is some subset of the TaskProperties 🤔

We should consolidate ExecutionProps and TaskProperties also.

@mingmwang
Copy link
Contributor

I'm OK with the change in this PR. Ballista should still work after this PR.

#4382

@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2022

Here is my next contribution to clean up configuration: #4427 (slowly consolidating the configurations)

@Cheappie
Copy link
Contributor

If a TaskContext represents a query/task, then could TableProvider receive in scan method TaskContext as replacement for SessionState ? I would like to have a way to identify task in TableProvider scan fn. If I am not wrong, the only way right now to identify task in TableProvider is to create new SessionContext for each query and use session_id from SessionState.

@tustvold
Copy link
Contributor

tustvold commented Dec 14, 2022

If a TaskContext represents a query/task

I think it might be more precise to describe TaskContext as the state for query execution, with SessionState used for query planning.

the only way right now to identify task in TableProvider

Correct, at the point in planning where TableProvider::scan is called, the task conceptually doesn't exist yet

create new SessionContext for each query

FWIW this is what IOx does

Edit: I wrote some docs in #2655 last time I got thoroughly confused by this system that I think are still in date. I see we have grown a few more config structs since then though

Edit Edit: It may also be the case that a single query is broken into multiple TaskContext - I think Ballista does this, but I'm not sure.

@Cheappie
Copy link
Contributor

Cheappie commented Dec 14, 2022

Thank you @tustvold for clarification, in such case I will follow SessionContext per query approach.

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Dec 21, 2022
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Dec 21, 2022
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Dec 21, 2022
tustvold added a commit that referenced this issue Dec 22, 2022
* Push SessionState into FileFormat (#4349)

* Rename ctx to state

* More renames
@alamb
Copy link
Contributor Author

alamb commented Jul 25, 2023

I think this is largely complete and we don't have any additional work planned here, so closing

@alamb alamb closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants