Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple FileFormat from datafusion_data_access #2572

Merged
merged 4 commits into from
May 24, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 18, 2022

Which issue does this PR close?

Part of #2489

Rationale for this change

The IOx ObjectStore does not have an equivalent notion of ObjectReader, decoupling FileFormat from such notions is therefore a necessary precondition of any migration. I also think it makes the API significantly easier to use

What changes are included in this PR?

Reworks FileFormat to be in terms of FileMeta and ObjectStore, instead of ObjectReader and ObjectReaderStream. These concepts translate cleanly, with FileMeta -> ObjectMeta.

Are there any user-facing changes?

Yes, FileFormat is a public API

@tustvold tustvold added the api change Changes the API exposed to users of the crate label May 18, 2022
file_meta: local::local_unpartitioned_file(file),
partition_values: vec![],
range: None,
impl From<FileMeta> for PartitionedFile {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not only makes for reduced verbosity, but avoids a confusing situation where there are two local_unpartitioned_file, one which returns FileMeta and one which returns PartitionedFile

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels May 18, 2022
@tustvold tustvold marked this pull request as ready for review May 18, 2022 19:07
)
.await?;
Ok(exec)
get_exec_format(&format, &store_root, file_name, projection, limit).await
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by-fix to extract the common scanning logic that was duplicated for each format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for other readers, it is common logic to all tests)

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look reasonable to me but I am not very familiar with this part of the code

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; I had a question about &dyn ObjectStore vs &Arc<dyn ObjectStore>

Also since several other systems are starting to use the object store, and the parquet reader, it might make sense for someone from those projects to take a quick look and make sure this doesn't mess up what they are doing.

cc @yahoNanJing @rdettai @thinkharderdev and @Igosuki

datafusion/core/src/datasource/file_format/mod.rs Outdated Show resolved Hide resolved
async fn infer_schema(&self, readers: ObjectReaderStream) -> Result<SchemaRef>;
async fn infer_schema(
&self,
store: &dyn ObjectStore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
store: &dyn ObjectStore,
store: &Arc<dyn ObjectStore>,

datafusion/core/src/datasource/file_format/mod.rs Outdated Show resolved Hide resolved
@thinkharderdev
Copy link
Contributor

Seems reasonable to me

@yjshen
Copy link
Member

yjshen commented May 22, 2022

cc @richox

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending a small conflict to resolve and switching back to &Arc<dyn ObjectStore> 👍

@github-actions github-actions bot added the development-process Related to development process of DataFusion label May 24, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's :shipit: 🚀

@@ -25,3 +25,23 @@ If there are user-facing changes then we may require documentation to be updated
<!--
If there are any breaking changes to public APIs, please add the `api change` label.
-->

# Does this PR break compatibility with Ballista?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it doesn't belong in the PR 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from the merge commit where ballista was removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate development-process Related to development process of DataFusion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants