Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Destination: Support partition keys #6981

Open
sherifnada opened this issue Oct 12, 2021 · 10 comments
Open

S3 Destination: Support partition keys #6981

sherifnada opened this issue Oct 12, 2021 · 10 comments
Assignees
Labels
area/connectors Connector related issues area/platform issues related to the platform autoteam connectors/destination/s3 connectors/destinations-files frozen Not being actively worked on team/destinations Destinations team's backlog type/enhancement New feature or request

Comments

@sherifnada
Copy link
Contributor

Tell us about the problem you're trying to solve

To achieve optimal performance, one often needs to partition data into S3 directories based on the value of a field in the incoming records e.g: based on the date field, records with date October should go into the october/ directory, those with value november should go in november/ directory etc.

This is pretty important for performance downstream. Often times, S3-based solutions (e.g: redshift, Hive, Trino, etc..) leverage the partition key structure to optimize queries.

This is an interesting problem for us to solve in that it is very connection-specific. Currently no configuration has a way of being customized on a per-connection basis. The partition key for a particular stream is different than for other streams, also between different connections.

Describe the solution you’d like

I would like to be able to set the partition key for each stream in my connection

@sherifnada sherifnada added type/enhancement New feature or request area/connectors Connector related issues area/core labels Oct 12, 2021
@agroh1
Copy link
Contributor

agroh1 commented Oct 14, 2021

We are looking into airbyte and this functionality would really help us.

We are in the advertising industry and will be using airbyte to pull advertising reports for a 30 day window daily.
So each day, we will pull the last 30 days.

In order to assemble this into a longer view (say a year worth of data) we need to find the latest version of each day from a years worth of 30 day reports. If these reports partitioned their data into 30 separate S3 files, one for each day in the report, then this is a fairly simple task. We just need to loop over the S3 files finding the latest S3 file for each day. (Hopefully this description made sense). The reason for this is that the reports are constantly being updated for old days.

@ameyabapat-bsft
Copy link

ameyabapat-bsft commented Mar 1, 2022

Eagerly waiting for it. We are already maintaining incoming data in S3 dir hierarchy such as customer/yyyy/mm/dd. Current S3 Connector dumps everything at single place and we would need to do extra processing to handle single data dump location specific for airbyte use cases. This type of date wise partitioning dump will exactly fit into our existing flow and unify the process for data ingestion.
Any estimated timelines for it?

@Thelin90
Copy link

I am also very keen to see some kind of roadmap/plan around the support of this, if possible?

@misteryeo
Copy link
Contributor

Issue was linked to Harvestr Discovery: Destination Amazon S3: Support partition keys

@grishick grishick added the team/destinations Destinations team's backlog label Sep 27, 2022
@gilbertovilarunc
Copy link

Veeery interested on this feature!

@gilbertovilarunc
Copy link

Any updates here guys?

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@evantahler
Copy link
Contributor

S3 Destination has a "filename pattern" option. From our docs:

S3 Filename pattern

The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.

...

But it is possible to further customize by using the available variables to format the bucket path:

${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields.
${STREAM_NAME}: Name of the stream
${YEAR}: Year in which the sync was writing the output data in.
${MONTH}: Month in which the sync was writing the output data in.
${DAY}: Day in which the sync was writing the output data in.
${HOUR}: Hour in which the sync was writing the output data in.
${MINUTE} : Minute in which the sync was writing the output data in.
${SECOND}: Second in which the sync was writing the output data in.
${MILLISECOND}: Millisecond in which the sync was writing the output data in.
${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in.
${UUID}: random uuid string

I think this is complete!

@jaakla
Copy link

jaakla commented Oct 8, 2024

That filename pattern is feature does not really enable partitions as they are normally defined and used, i.e. based on source data. One sample above was something like customer/yyyy/mm/dd , where customer is reference to a customer name/id field in source data , and yyyy/mm/dd is not just a data processing time, but a specific date-type field from the source database. Maybe as a special case data processing date also makes sense, but normally this is some source event date.
For proper partitioning the writer would need to re-partition data based on given field(s), not just set file names.

@FredericoCoelhoNunes
Copy link

FredericoCoelhoNunes commented Oct 11, 2024

@evantahler If I understood correctly, I don't think what you linked answers the original request, which is to be able to create partitions based on the content of the extracted data.

For example: if I extracted the following data in a sync:

customer_id day value
1 10 x
2 20 y

and wanted to partition it by customer_id/day, resulting in the following s3 paths:

s3://my-bucket/my-stream/1/10/data.parquet (content: value=x)
s3://my-bucket/my-stream/2/20/data.parquet (content: value=y)

This is not possible with the current destination, but I think it's a common use-case.
Thanks!

@evantahler
Copy link
Contributor

Reopening!

@evantahler evantahler reopened this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/platform issues related to the platform autoteam connectors/destination/s3 connectors/destinations-files frozen Not being actively worked on team/destinations Destinations team's backlog type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests