Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add register_object_store to SessionContext #55

Merged
merged 20 commits into from
Nov 8, 2022

Conversation

wseaton
Copy link
Contributor

@wseaton wseaton commented Sep 23, 2022

Related to #22

With this function added I can now (with the proper environment set):

ctx = datafusion.SessionContext()
ctx.register_object_store("s3://{bucket}")
ctx.register_parquet("table_name", "s3://{bucket}/{dataset}/created_year=2022/created_month=9/created_day=19/", [], True, ".parquet")

df = ctx.sql("""
        SELECT
            first,
            second, 
            sum(third)
        FROM table_name
        group by 
            first, second, third
        LIMIT 20000
        """
        )
df.show()
<data prints>

I'm currently unable to test the gcp integration, may need some help there.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

It would be nice to figure out a cleaner abstraction for configuration for the different providers in the object_store crate, unfortunately their APIs are very different for client configuration. Due to this I just default to the existing credential providers.

Maybe a 3 varant enum w/ fields for the exposed config? Could potentially represent that on the python side as **storage_kwargs.

@matthewmturner
Copy link

It's possible things have changed since i was working on the ObjectStore abstraction but i was expecting something closer to:

let s3 = AmazonS3Builder.new()
ctx.register_object_store(s3, "bucket", "s3")

I believe this is needed because you still need to create the ObjectStore that will be registered to the context and I dont see that in your above example.

Then you could register files like how you showed with register_parquet.

A quick peak at the datafusion cookbook example for s3 shows how thats done on the rust side.

Take what im saying with a grain of salt though as I havent been as involved lately.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

@matthewmturner I'm not quite sure what changes you are suggesting do you mind elaborating? The PR basically does the equivalent of the cookbook example, except it generates the clients using default credential providers (like ::from_env) and just dispatches based on the URI scheme, this seems like the most straightforward way to do things

@matthewmturner
Copy link

@wseaton I thought the function signature would be similar to the rust api. That being said, im not sure how close the existing python function signatures match their equivalent on the rust side so maybe this isnt a design goal of the python bindings.

I also think that it would be good to allow users to create ObjectStore with more than just the default credential providers so that s3 API providers, such as MinIO, could be used.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

im not sure how close the existing python function signatures match their equivalent on the rust side so maybe this isnt a design goal of the python bindings

Yeah, the builder pattern doesn't really map well into Python space. I'll try to do some research to see how it's done elsewhere

I also think that it would be good to allow users to create ObjectStore with more than just the default credential providers so that s3 API providers, such as MinIO, could be used.

MinIO and Ceph (which I need as well) should be supported by setting AWS_ENDPOINT, but I agree it'd be nice to let users generate this config via the python APIs instead of purely environment driven

@matthewmturner
Copy link

Agree on the builder pattern not mapping as well to python. I was focusing more on the signature for register_object_store. But of course with what im suggesting it would require figuring out how to create the object store.

If I recall correctly some of the read options and other configs that are used with session context also use builder pattern and its possible some of those are already exposed in our python bindings somewhere and maybe their implementation could be used as inspiration for how to approach creating object store.

@wseaton wseaton changed the title Add register_object_store to SessionContext WIP: Add register_object_store to SessionContext Sep 23, 2022
@wseaton wseaton marked this pull request as draft September 23, 2022 03:24
@turbo1912
Copy link

@wseaton have you seen https://github.com/roeap/object-store-python (cc: @roeap) ? Looks like not not all services are fully supported yet, but this project can be used to construct an object_store and then registered to a SessionContext through a simple binding exposed from this package. You wouldn't have to parse urls again to create the right object store.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

@turbo1912 oh wow! thanks for the heads up, I had just refactored to remove the parse redundancy to externalize some of the config and am catching up on this. It looks like object-store-python is pretty far along

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

The new api from the last commit looks like this:

from datafusion.store import AmazonS3

s3 = AmazonS3(region="us-west-2", bucket_name="my_bucket", endpoint="https://myendpoint.url")
ctx.register_object_store("s3", "my_bucket", s3)

@matthewmturner is this closer to what you were thinking?

@matthewmturner
Copy link

@wseaton yes! exactly what i had in mind. a nit would be to have it datafusion.object_store instead to be consistent with name. i confess i havent looked at implementation but the api is what i had in mind. are there feature flags for the object store functionality?

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

are there feature flags for the object store functionality

@matthewmturner it'd be pretty easy to do cargo features on our side as each provider is gated behind an object_store crate feature, but I don't think maturin currently supports passthrough of cargo features to pip extras and vice-versa (did a little bit of digging last night), open to ideas on that though!

@matthewmturner
Copy link

ah thats unfortunate. i suppose the alternative would be a totally separate object_store module but im not even sure that would solve it.

do you have an idea how large this makes the datafusion python package? last i checked it was quite large because it was datafusion + pyarrow (which is large) and now were adding the full object_store crate.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 23, 2022

Not sure how scientific this is (just output from ls), but:

Before:

Release:

-rwxr-xr-x. 1 weaton weaton  27M Sep 23 17:21 _internal.abi3.so

Debug:

-rwxr-xr-x. 1 weaton weaton 553M Sep 23 17:19 _internal.abi3.so

After:

Release build:

-rwxr-xr-x. 1 weaton weaton  32M Sep 23 17:17 _internal.abi3.so

Debug build:

-rwxr-xr-x. 1 weaton weaton 624M Sep 23 17:18 _internal.abi3.so

@matthewmturner
Copy link

I was referring more to the size after installing the wheel. for example, right now when i pip install datafusion it comes out to 141mb (datafusion itself is only 25mb and the remainder is pyarrow / numpy). so im curious how big datafusion will be now with full featured object store bundled with it.

@wseaton
Copy link
Contributor Author

wseaton commented Sep 26, 2022

@matthewmturner honestly not quite sure how to emulate those numbers locally, when I build with maturin build --release I get a 8MB -> 11MB bump in wheel size compared to master from this branch. Wheel is a compressed binary format so I would think the symbol comparison from above would be more apt for comparison (the 27MB quoted for above master is pretty close to the 25MB you mention).

So anywhere from 5-20% bigger would be my guess. Including object storage drivers seems like a good idea and is basically standard for other tools like spark who bundle them with their distributions WRT hadoop (which is in the hundreds of megabytes)

@wseaton wseaton changed the title WIP: Add register_object_store to SessionContext Add register_object_store to SessionContext Sep 26, 2022
@wseaton wseaton marked this pull request as ready for review September 26, 2022 17:54
@wseaton
Copy link
Contributor Author

wseaton commented Sep 26, 2022

In addition to unifying the API, I also made bucket_name an optional kwarg, because for most providers it can be elided from the user's PoV and just inferred from the connection.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wseaton! This is great. 🎉

@wseaton
Copy link
Contributor Author

wseaton commented Oct 10, 2022

@francis-du @andygrove is there a chance this can get in w/ the 13.0.0 release?

@francis-du
Copy link
Contributor

I think it's okay, but it depends on Andy.

@francis-du
Copy link
Contributor

@wseaton Hi Will, I think you have a cargo fmt check error, so it can't be merged.

@wseaton
Copy link
Contributor Author

wseaton commented Oct 13, 2022

@francis-du should be good now, lmk how you want to handle the lock conflict

@francis-du
Copy link
Contributor

@francis-du should be good now, lmk how you want to handle the lock conflict

I usually rebase locally with master branch.

@andygrove
Copy link
Member

@francis-du @andygrove is there a chance this can get in w/ the 13.0.0 release?

I would love to release a new version of the Python bindings soon but could use some help. I am not familiar with releasing Python packages. Here is the issue to track releasing the next version (0.7.0 because the version numbers for the Python bindings are independent of the DataFusion version)

#7

@wseaton
Copy link
Contributor Author

wseaton commented Oct 13, 2022

@francis-du should be good now

edit: rebasing again after #61

@francis-du
Copy link
Contributor

Hi Will. There is a good practice to share with you.

If you don't like waiting for maintainers to approval your commit,you can try adding GPG validation. your commit will trigger the CI directly. Like this:

image

FYI: https://docs.github.com/en/authentication/managing-commit-signature-verification/adding-a-gpg-key-to-your-github-account

@wseaton
Copy link
Contributor Author

wseaton commented Oct 13, 2022

@francis-du ah great, I didn't realize it was commit verification that disabled the CI. I'll be sure to set that up for future contributions.

@isidentical
Copy link
Contributor

This PR looks pretty straightforward, amazing work @wseaton 🎊 On the question of testing, another relatively simple approach might be using LocalFileSystem from object store (and actually implement a backend for it) for testing. It is generally impossible to test every possible path / variation across all object store implementations, but having a simple/single test that an object store is I think way better than having none.

@andygrove
Copy link
Member

@wseaton We just merged #63 to enable auditing for license headers, in preparation for releasing the Python bindings. Could you please merge this into this PR (or rebase). Thanks,

@wseaton
Copy link
Contributor Author

wseaton commented Oct 25, 2022

@andygrove done. I can also work on @isidentical's suggestions for testing the LocalFileSystem, but also happy to do it in a follow up MR if a release is imminent.

@andygrove andygrove merged commit 9d2b974 into apache:master Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants