Advanced service-side predicate filtering using S3 Select #678

mchowdry · 2021-05-05T16:31:50Z

Is your feature request related to a problem? Please describe.
I need to load a subset of a large dataset in S3, with filters that are more granular than the partition keys.

Right now, AWS Data Wrangler will only let me select individual partitions. I would like to pass additional filters or predicates and have them evaluated by the storage layer, so that I don't have to remove additional records within my own code.

Describe the solution you'd like
When using any Data Wrangler read function, I would like to pass an SQL query which can then be passed to S3 Select. The query would be similar to the examples here; https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

jaidisido · 2021-05-06T09:11:21Z

Thanks for raising this. I can see value in leveraging S3 select within Wrangler. However I would like to highlight some current limitations of the feature first:

It operates on a single S3 object
The maximum length of a record in the input or result is 1 MB
It can only emit nested data in JSON format

I don't consider these showstoppers though.

My suggestion would be to create a new method: wr.s3.read_sql_query instead of bloating the existing wr.s3.read_parquet/csv methods. The latter already handle too much in my opinion and adding a sql argument that operates on a single S3 object would not make much sense. Thoughts @igorborgest, @maxispeicher, @kukushking?

maxispeicher · 2021-05-06T09:46:48Z

I also think that it makes sense to put this functionality in a separate function as it is kind of different to just a default read operation. Especially because of the limitation to a single S3 object.

igorborgest · 2021-05-06T12:56:00Z

This is an interesting S3 feature... For sure there is some space in the library to take advantage of that.

I also prefer to move on with the wr.s3.read_sql_query idea.
wr.s3.read_parquet/csv methods already are overwhelmed enough indeed.

In addition to the limitations listed by @jaidisido, there is another significant one:

The maximum uncompressed row group size is 256 MB. (For Parquet file)

jaidisido · 2021-06-11T14:03:09Z

This is now merged into main, to test:

pip uninstall awswrangler -y
pip install git+https://github.com/awslabs/aws-data-wrangler.git@main

until it's available in the next release.

mchowdry added the feature label May 5, 2021

jaidisido self-assigned this May 6, 2021

jaidisido added the WIP Work in progress label May 15, 2021

jaidisido added this to the 2.9.0 milestone May 15, 2021

jaidisido mentioned this issue May 27, 2021

S3 Select #722

Merged

jaidisido added minor release Will be addressed in the next minor release ready to release and removed WIP Work in progress labels Jun 11, 2021

jaidisido linked a pull request Jun 11, 2021 that will close this issue

S3 Select #722

Merged

kukushking closed this as completed Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced service-side predicate filtering using S3 Select #678

Advanced service-side predicate filtering using S3 Select #678

mchowdry commented May 5, 2021

jaidisido commented May 6, 2021 •

edited

maxispeicher commented May 6, 2021

igorborgest commented May 6, 2021

jaidisido commented Jun 11, 2021

Advanced service-side predicate filtering using S3 Select #678

Advanced service-side predicate filtering using S3 Select #678

Comments

mchowdry commented May 5, 2021

jaidisido commented May 6, 2021 • edited

maxispeicher commented May 6, 2021

igorborgest commented May 6, 2021

jaidisido commented Jun 11, 2021

jaidisido commented May 6, 2021 •

edited