-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Destination: Support partition keys #6981
Comments
We are looking into airbyte and this functionality would really help us. We are in the advertising industry and will be using airbyte to pull advertising reports for a 30 day window daily. In order to assemble this into a longer view (say a year worth of data) we need to find the latest version of each day from a years worth of 30 day reports. If these reports partitioned their data into 30 separate S3 files, one for each day in the report, then this is a fairly simple task. We just need to loop over the S3 files finding the latest S3 file for each day. (Hopefully this description made sense). The reason for this is that the reports are constantly being updated for old days. |
Eagerly waiting for it. We are already maintaining incoming data in S3 dir hierarchy such as |
I am also very keen to see some kind of roadmap/plan around the support of this, if possible? |
Issue was linked to Harvestr Discovery: Destination Amazon S3: Support partition keys |
Veeery interested on this feature! |
Any updates here guys? |
S3 Destination has a "filename pattern" option. From our docs:
I think this is complete! |
That filename pattern is feature does not really enable partitions as they are normally defined and used, i.e. based on source data. One sample above was something like |
@evantahler If I understood correctly, I don't think what you linked answers the original request, which is to be able to create partitions based on the content of the extracted data. For example: if I extracted the following data in a sync:
and wanted to partition it by
This is not possible with the current destination, but I think it's a common use-case. |
Reopening! |
Tell us about the problem you're trying to solve
To achieve optimal performance, one often needs to partition data into S3 directories based on the value of a field in the incoming records e.g: based on the
date
field, records with dateOctober
should go into theoctober/
directory, those with valuenovember
should go innovember/
directory etc.This is pretty important for performance downstream. Often times, S3-based solutions (e.g: redshift, Hive, Trino, etc..) leverage the partition key structure to optimize queries.
This is an interesting problem for us to solve in that it is very connection-specific. Currently no configuration has a way of being customized on a per-connection basis. The partition key for a particular stream is different than for other streams, also between different connections.
Describe the solution you’d like
I would like to be able to set the partition key for each stream in my connection
The text was updated successfully, but these errors were encountered: