-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 as destination #573
Comments
@deepakksahu thanks for making this request! Is there a specific data format you want to send to S3 (e.g: CSV), or are there different formats you would be interested in sending? |
One of the most used file format for the OLAP will be parquet,orc format. In some cases it can be AVRO or CSV(rare) as well. According to me if we can support all 4 that will be great. Otherwise can start with first two. |
We'll prioritize S3 as a destination in our roadmap shortly. Thanks for the clarification! |
Hello guys, Parquet is a good one. :) |
Anyone working on Delta lake destination ? We are looking forward to this. |
Any progress on this one? CSV destination for things like sheets sources would be nice. |
hi @kyprifog! This is not on Airbyte's roadmap for April, but this is definitely top of mind for us and will update you as we form our May roadmap. We'll also create some documentation for how to create a destination yourself in case you need to unblock yourself for the time being. |
@sherifnada any update on that documentation? (apologies in advance I haven't actually tried to it may be self apparent) |
+1. Would be great to have S3 destination that supports multiple formats. Ideally Deltalake/CSV/Parquet. |
Hi everyone! We're going to start work on the S3 destination very soon and would love to hear your feedback on features you need. Could you comment on this thread about any of the following:
|
Hi @sherifnada
We've estimated about 5-10 million records on each connection, having dozens connections (based on sync settings).
JSONL, Avro. If possible having something like DeltaLake and "upserts" it'd be great!
Yes. Partitioned by stream name, load timestamp plus some custom partitions to be filled by the user.
Yes, using Redshift, Athena, and other downstream apps to process this data.
A form to configure the credentials and the S3 path in which we could include dynamic values in it (something like parsing |
I had very simple use case, I wanted to create a quick and easy sync between google sheets and S3. There are other ways to do this, but wanted to see if I could knock it out with airbyte since i was using for other things. Have other tools that I use instead for large volumes of data, or partitioned data. |
Hi @sherifnada
CSV (maybe gzipped). Later support for Parquet, Avro, Delta Lake would be nice.
Yes. The Hive data model (see wiki) is probably very popular and also the one I use. So it would be nice to have the possibility to specify one or multiple columns in the source that are used to partition the data on S3 (e.g.
Data is further transformed and queried with Trino (f.k.a PrestoSQL) using the Hive connector. I'd like to add the CSV data written by Airbyte as an external table in Hive (see wiki). From there on, the data can be queried and transformed, e.g converted to Parquet, directly with Trino (or something like DBT).
|
Thanks @sherifnada |
Hi @sherifnada, I was planning to build an S3 destination connector myself as well and came across airbyte's Redshift connector that has a logic where it puts data into S3 and then copies it to Redshift from there, so what I can do is reverse engineer the Redshift connector till the S3 task and build it in my DEV version. However, it would be great if you guide me as well. Thanks and regards. |
@janhavi-debug no need! we should have a PR for S3 destination by EOW :) Though we'd love it if you help us understand your usecase by answering some of the questions above! |
Hi @sherifnada, just adding my two cents to some of your questions as we're waiting for this S3 destination to start using Airbyte.
Definitely Parquet!
Yes, and I agree with Hive partitioning. One key feature for us is the ability to use partition values which are not in the data (e.g. we partition by processing time, not event time).
Data will be read by Redshift Spectrum. Thanks and looking forward to the release! |
Update on this topic:
|
Tell us about the new integration you’d like to have
AWS s3 is a heavily used destination for the data lake. If we can sink to that that would be great.
Describe the context around this new integration
Data Engineering team does the daily ingestion and all the time the sync is S3 for me.
Describe the alternative you are considering or using
Custom spark program.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: