Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced service-side predicate filtering using S3 Select #678

Closed
mchowdry opened this issue May 5, 2021 · 4 comments · Fixed by #722
Closed

Advanced service-side predicate filtering using S3 Select #678

mchowdry opened this issue May 5, 2021 · 4 comments · Fixed by #722
Assignees
Labels
feature minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@mchowdry
Copy link

mchowdry commented May 5, 2021

Is your feature request related to a problem? Please describe.
I need to load a subset of a large dataset in S3, with filters that are more granular than the partition keys.

Right now, AWS Data Wrangler will only let me select individual partitions. I would like to pass additional filters or predicates and have them evaluated by the storage layer, so that I don't have to remove additional records within my own code.

Describe the solution you'd like
When using any Data Wrangler read function, I would like to pass an SQL query which can then be passed to S3 Select. The query would be similar to the examples here; https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

@jaidisido
Copy link
Contributor

jaidisido commented May 6, 2021

Thanks for raising this. I can see value in leveraging S3 select within Wrangler. However I would like to highlight some current limitations of the feature first:

  • It operates on a single S3 object
  • The maximum length of a record in the input or result is 1 MB
  • It can only emit nested data in JSON format

I don't consider these showstoppers though.

My suggestion would be to create a new method: wr.s3.read_sql_query instead of bloating the existing wr.s3.read_parquet/csv methods. The latter already handle too much in my opinion and adding a sql argument that operates on a single S3 object would not make much sense. Thoughts @igorborgest, @maxispeicher, @kukushking?

@jaidisido jaidisido self-assigned this May 6, 2021
@maxispeicher
Copy link
Contributor

I also think that it makes sense to put this functionality in a separate function as it is kind of different to just a default read operation. Especially because of the limitation to a single S3 object.

@igorborgest
Copy link
Contributor

This is an interesting S3 feature... For sure there is some space in the library to take advantage of that.

I also prefer to move on with the wr.s3.read_sql_query idea.
wr.s3.read_parquet/csv methods already are overwhelmed enough indeed.

In addition to the limitations listed by @jaidisido, there is another significant one:

  • The maximum uncompressed row group size is 256 MB. (For Parquet file)

@jaidisido jaidisido added the WIP Work in progress label May 15, 2021
@jaidisido jaidisido added this to the 2.9.0 milestone May 15, 2021
@jaidisido jaidisido mentioned this issue May 27, 2021
@jaidisido jaidisido added minor release Will be addressed in the next minor release ready to release and removed WIP Work in progress labels Jun 11, 2021
@jaidisido jaidisido linked a pull request Jun 11, 2021 that will close this issue
@jaidisido
Copy link
Contributor

This is now merged into main, to test:

pip uninstall awswrangler -y
pip install git+https://github.com/awslabs/aws-data-wrangler.git@main

until it's available in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants