S3 Destination: Support partition keys #6981

sherifnada · 2021-10-12T16:29:02Z

Tell us about the problem you're trying to solve

To achieve optimal performance, one often needs to partition data into S3 directories based on the value of a field in the incoming records e.g: based on the date field, records with date October should go into the october/ directory, those with value november should go in november/ directory etc.

This is pretty important for performance downstream. Often times, S3-based solutions (e.g: redshift, Hive, Trino, etc..) leverage the partition key structure to optimize queries.

This is an interesting problem for us to solve in that it is very connection-specific. Currently no configuration has a way of being customized on a per-connection basis. The partition key for a particular stream is different than for other streams, also between different connections.

Describe the solution you’d like

I would like to be able to set the partition key for each stream in my connection

The text was updated successfully, but these errors were encountered:

agroh1 · 2021-10-14T15:33:36Z

We are looking into airbyte and this functionality would really help us.

We are in the advertising industry and will be using airbyte to pull advertising reports for a 30 day window daily.
So each day, we will pull the last 30 days.

In order to assemble this into a longer view (say a year worth of data) we need to find the latest version of each day from a years worth of 30 day reports. If these reports partitioned their data into 30 separate S3 files, one for each day in the report, then this is a fairly simple task. We just need to loop over the S3 files finding the latest S3 file for each day. (Hopefully this description made sense). The reason for this is that the reports are constantly being updated for old days.

ameyabapat-bsft · 2022-03-01T09:58:39Z

Eagerly waiting for it. We are already maintaining incoming data in S3 dir hierarchy such as customer/yyyy/mm/dd. Current S3 Connector dumps everything at single place and we would need to do extra processing to handle single data dump location specific for airbyte use cases. This type of date wise partitioning dump will exactly fit into our existing flow and unify the process for data ingestion.
Any estimated timelines for it?

Thelin90 · 2022-05-29T13:53:46Z

I am also very keen to see some kind of roadmap/plan around the support of this, if possible?

misteryeo · 2022-06-01T20:47:48Z

Issue was linked to Harvestr Discovery: Destination Amazon S3: Support partition keys

gilbertovilarunc · 2022-10-07T15:36:20Z

Veeery interested on this feature!

gilbertovilarunc · 2023-05-11T19:05:36Z

Any updates here guys?

evantahler · 2024-09-04T19:42:32Z

S3 Destination has a "filename pattern" option. From our docs:

S3 Filename pattern

The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.

...

But it is possible to further customize by using the available variables to format the bucket path:

${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields.
${STREAM_NAME}: Name of the stream
${YEAR}: Year in which the sync was writing the output data in.
${MONTH}: Month in which the sync was writing the output data in.
${DAY}: Day in which the sync was writing the output data in.
${HOUR}: Hour in which the sync was writing the output data in.
${MINUTE} : Minute in which the sync was writing the output data in.
${SECOND}: Second in which the sync was writing the output data in.
${MILLISECOND}: Millisecond in which the sync was writing the output data in.
${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in.
${UUID}: random uuid string

I think this is complete!

jaakla · 2024-10-08T19:50:02Z

That filename pattern is feature does not really enable partitions as they are normally defined and used, i.e. based on source data. One sample above was something like customer/yyyy/mm/dd , where customer is reference to a customer name/id field in source data , and yyyy/mm/dd is not just a data processing time, but a specific date-type field from the source database. Maybe as a special case data processing date also makes sense, but normally this is some source event date.
For proper partitioning the writer would need to re-partition data based on given field(s), not just set file names.

FredericoCoelhoNunes · 2024-10-11T14:25:55Z

@evantahler If I understood correctly, I don't think what you linked answers the original request, which is to be able to create partitions based on the content of the extracted data.

For example: if I extracted the following data in a sync:

customer_id	day	value
1	10	x
2	20	y

and wanted to partition it by customer_id/day, resulting in the following s3 paths:

s3://my-bucket/my-stream/1/10/data.parquet (content: value=x)
s3://my-bucket/my-stream/2/20/data.parquet (content: value=y)

This is not possible with the current destination, but I think it's a common use-case.
Thanks!

evantahler · 2024-10-11T15:31:06Z

Reopening!

sherifnada added type/enhancement New feature or request area/connectors Connector related issues area/core labels Oct 12, 2021

cgardens added area/platform issues related to the platform and removed area/core labels Oct 23, 2021

igrankova added connectors/destinations-files connectors/destination/s3 labels Jan 17, 2022

marcosmarxm mentioned this issue Mar 2, 2022

Destination S3: supports for multiple file output #10777

Open

bleonard added autoteam team/databases labels Apr 26, 2022

grishick added the team/destinations Destinations team's backlog label Sep 27, 2022

grishick removed the team/databases label Oct 7, 2022

bryanpost mentioned this issue Oct 20, 2022

HDFS, Hive Integration #1877

Closed

bleonard added the frozen Not being actively worked on label Mar 22, 2024

marcosmarxm assigned evantahler Sep 4, 2024

evantahler closed this as completed Sep 4, 2024

evantahler reopened this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Destination: Support partition keys #6981

S3 Destination: Support partition keys #6981

sherifnada commented Oct 12, 2021

agroh1 commented Oct 14, 2021

ameyabapat-bsft commented Mar 1, 2022 •

edited

Loading

Thelin90 commented May 29, 2022

misteryeo commented Jun 1, 2022

gilbertovilarunc commented Oct 7, 2022

gilbertovilarunc commented May 11, 2023

evantahler commented Sep 4, 2024

jaakla commented Oct 8, 2024 •

edited

Loading

FredericoCoelhoNunes commented Oct 11, 2024 •

edited

Loading

evantahler commented Oct 11, 2024

S3 Destination: Support partition keys #6981

S3 Destination: Support partition keys #6981

Comments

sherifnada commented Oct 12, 2021

Tell us about the problem you're trying to solve

Describe the solution you’d like

agroh1 commented Oct 14, 2021

ameyabapat-bsft commented Mar 1, 2022 • edited Loading

Thelin90 commented May 29, 2022

misteryeo commented Jun 1, 2022

gilbertovilarunc commented Oct 7, 2022

gilbertovilarunc commented May 11, 2023

evantahler commented Sep 4, 2024

jaakla commented Oct 8, 2024 • edited Loading

FredericoCoelhoNunes commented Oct 11, 2024 • edited Loading

evantahler commented Oct 11, 2024

ameyabapat-bsft commented Mar 1, 2022 •

edited

Loading

jaakla commented Oct 8, 2024 •

edited

Loading

FredericoCoelhoNunes commented Oct 11, 2024 •

edited

Loading