You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I need to load a subset of a large dataset in S3, with filters that are more granular than the partition keys.
Right now, AWS Data Wrangler will only let me select individual partitions. I would like to pass additional filters or predicates and have them evaluated by the storage layer, so that I don't have to remove additional records within my own code.
Thanks for raising this. I can see value in leveraging S3 select within Wrangler. However I would like to highlight some current limitations of the feature first:
It operates on a single S3 object
The maximum length of a record in the input or result is 1 MB
It can only emit nested data in JSON format
I don't consider these showstoppers though.
My suggestion would be to create a new method: wr.s3.read_sql_query instead of bloating the existing wr.s3.read_parquet/csv methods. The latter already handle too much in my opinion and adding a sql argument that operates on a single S3 object would not make much sense. Thoughts @igorborgest, @maxispeicher, @kukushking?
I also think that it makes sense to put this functionality in a separate function as it is kind of different to just a default read operation. Especially because of the limitation to a single S3 object.
Is your feature request related to a problem? Please describe.
I need to load a subset of a large dataset in S3, with filters that are more granular than the partition keys.
Right now, AWS Data Wrangler will only let me select individual partitions. I would like to pass additional filters or predicates and have them evaluated by the storage layer, so that I don't have to remove additional records within my own code.
Describe the solution you'd like
When using any Data Wrangler read function, I would like to pass an SQL query which can then be passed to S3 Select. The query would be similar to the examples here; https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html
P.S. Don't attach files. Please, prefer add code snippets directly in the message body.
The text was updated successfully, but these errors were encountered: