S3 as destination #573

deepakksahu · 2020-10-15T08:17:11Z

Tell us about the new integration you’d like to have

AWS s3 is a heavily used destination for the data lake. If we can sink to that that would be great.

Describe the context around this new integration

Data Engineering team does the daily ingestion and all the time the sync is S3 for me.

Describe the alternative you are considering or using

Custom spark program.

┆Issue is synchronized with this Asana task by Unito

sherifnada · 2020-10-15T18:49:25Z

@deepakksahu thanks for making this request! Is there a specific data format you want to send to S3 (e.g: CSV), or are there different formats you would be interested in sending?

deepakksahu · 2020-10-15T18:55:18Z

One of the most used file format for the OLAP will be parquet,orc format. In some cases it can be AVRO or CSV(rare) as well.

According to me if we can support all 4 that will be great. Otherwise can start with first two.

sherifnada · 2020-10-15T21:27:17Z

We'll prioritize S3 as a destination in our roadmap shortly. Thanks for the clarification!

rubenssoto · 2020-11-26T13:01:12Z

Hello guys,

Parquet is a good one. :)

dnskr · 2021-01-26T22:18:57Z

Hi!
Delta Lake format is very good and popular candidate: Home and Docs

brucemen711 · 2021-02-09T06:41:56Z

Anyone working on Delta lake destination ? We are looking forward to this.

kyprifog · 2021-04-01T01:09:24Z

Any progress on this one? CSV destination for things like sheets sources would be nice.

sherifnada · 2021-04-01T01:36:51Z

hi @kyprifog! This is not on Airbyte's roadmap for April, but this is definitely top of mind for us and will update you as we form our May roadmap. We'll also create some documentation for how to create a destination yourself in case you need to unblock yourself for the time being.

kyprifog · 2021-04-13T12:48:00Z

@sherifnada any update on that documentation?

(apologies in advance I haven't actually tried to it may be self apparent)

sherifnada · 2021-04-13T19:04:45Z

@kyprifog we'll have it this week! Follow along here: #2641

eduardgruy · 2021-04-28T13:41:14Z

+1. Would be great to have S3 destination that supports multiple formats. Ideally Deltalake/CSV/Parquet.
I suppose S3 would be used as a kind of landing zone so appending changes to a delta lake table would be perfect.

sherifnada · 2021-05-19T22:02:48Z

Hi everyone! We're going to start work on the S3 destination very soon and would love to hear your feedback on features you need. Could you comment on this thread about any of the following:

What volume of data do you anticipate replicating? bytes or number of records estimate is helpful to know.
Data format you'd like the connector to support
Does the data you write need to be partitioned? What partitioning scheme do you need?
How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..
What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

mmolimar · 2021-05-24T15:39:16Z

Hi @sherifnada
This is great news! My thoughts below :-)

Hi everyone! We're going to start work on the S3 destination very soon and would love to hear your feedback on features you need. Could you comment on this thread about any of the following:

What volume of data do you anticipate replicating? bytes or number of records estimate is helpful to know.

We've estimated about 5-10 million records on each connection, having dozens connections (based on sync settings).

Data format you'd like the connector to support

JSONL, Avro. If possible having something like DeltaLake and "upserts" it'd be great!

Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes. Partitioned by stream name, load timestamp plus some custom partitions to be filled by the user.

How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..

Yes, using Redshift, Athena, and other downstream apps to process this data.

What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

A form to configure the credentials and the S3 path in which we could include dynamic values in it (something like parsing DateTimeFormatter patterns to generate dynamic URIs).

kyprifog · 2021-05-24T16:13:02Z

I had very simple use case, I wanted to create a quick and easy sync between google sheets and S3. There are other ways to do this, but wanted to see if I could knock it out with airbyte since i was using for other things. Have other tools that I use instead for large volumes of data, or partitioned data.

sherifnada · 2021-05-25T03:15:09Z

@mmolimar for the initial release we are going to support CSV as the data format, but will quickly start adding other formats like JSONL, Parquet, and Avro. We will also separate release a delta lake connector (#2075 ) in the future to make importing into delta lake a seamless process.

bo5o · 2021-05-25T13:41:43Z

Hi @sherifnada

Data format you'd like the connector to support?

CSV (maybe gzipped). Later support for Parquet, Avro, Delta Lake would be nice.

Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes. The Hive data model (see wiki) is probably very popular and also the one I use. So it would be nice to have the possibility to specify one or multiple columns in the source that are used to partition the data on S3 (e.g. s3a://mybucket/myschema/mytable/country=ABC/region=DEF/data.part1.csv). The country and region columns don't have to be in the data.part1.csv if partitioning is used like that. Additional data could be added in the future using the correct object/partition key (ex. /country=ABC/region=DEF/data.part2.csv). This scheme is used often with date types, so that queries targeting a specific date range in the WHERE clause can be answered more efficiently.

How are you using the data in S3? Are you importing it into Redshift, Athena, dashboards, etc.?

Data is further transformed and queried with Trino (f.k.a PrestoSQL) using the Hive connector. I'd like to add the CSV data written by Airbyte as an external table in Hive (see wiki). From there on, the data can be queried and transformed, e.g converted to Parquet, directly with Trino (or something like DBT).

What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

Specify S3 credentials
Specify additional S3 settings like host, region and URL scheme, to be able to use Airbyte with alternative S3-compliant object storage implementations like MinIO.
Define partition keys to be used from the source data
Set compression on/off

mmolimar · 2021-05-25T15:04:44Z

@mmolimar for the initial release we are going to support CSV as the data format, but will quickly start adding other formats like JSONL, Parquet, and Avro. We will also separate release a delta lake connector (#2075 ) in the future to make importing into delta lake a seamless process.

Thanks @sherifnada
Are there connectors which retrieves data with nested structures? If so, how would they work in CSV?

an0nym3sh · 2021-05-26T06:33:42Z

Hi @sherifnada, I was planning to build an S3 destination connector myself as well and came across airbyte's Redshift connector that has a logic where it puts data into S3 and then copies it to Redshift from there, so what I can do is reverse engineer the Redshift connector till the S3 task and build it in my DEV version. However, it would be great if you guide me as well.

Thanks and regards.

sherifnada · 2021-05-26T06:52:03Z

@janhavi-debug no need! we should have a PR for S3 destination by EOW :) Though we'd love it if you help us understand your usecase by answering some of the questions above!

olivermeyer · 2021-05-27T08:06:17Z

Hi @sherifnada, just adding my two cents to some of your questions as we're waiting for this S3 destination to start using Airbyte.

Data format you'd like the connector to support

Definitely Parquet!

Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes, and I agree with Hive partitioning. One key feature for us is the ability to use partition values which are not in the data (e.g. we partition by processing time, not event time).

How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..

Data will be read by Redshift Spectrum.

Thanks and looking forward to the release!

tuliren · 2021-06-03T17:49:56Z

Update on this topic:

The S3 destination connector has been merged to master. Here is the documentation.
Currently it only supports the CSV format.
We are working on supporting the Parquet format right now (Destination S3: support writing Parquet data format #3642).
More formats and configurations will be supported in the future. Feel free to create new tickets for feature and configuration requests.

deepakksahu added area/connectors Connector related issues new-connector labels Oct 15, 2020

cgardens added this to the Core - 2021-05-07 milestone May 3, 2021

cgardens assigned jrhizor May 3, 2021

cgardens modified the milestones: Core - 2021-05-07, Core - 2021-05-14 May 10, 2021

jrhizor removed their assignment May 11, 2021

sherifnada added the lang/java label May 13, 2021

cgardens assigned tuliren May 17, 2021

cgardens modified the milestones: Core - 2021-05-14, Core - 2021-05-21 May 17, 2021

cgardens modified the milestones: Core - 2021-05-26, Core - 2021-06-02 May 26, 2021

tuliren mentioned this issue May 27, 2021

🎉 New destination: S3 #3672

Merged

2 tasks

tuliren linked a pull request May 27, 2021 that will close this issue

🎉 New destination: S3 #3672

Merged

2 tasks

cgardens modified the milestones: Core - 2021-06-02, Core - 2021-06-09 Jun 2, 2021

tuliren closed this as completed in #3672 Jun 3, 2021

igrankova added connectors/destinations-files connectors/destination/s3 labels Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 as destination #573

S3 as destination #573

deepakksahu commented Oct 15, 2020 •

edited by sync-by-unito bot

sherifnada commented Oct 15, 2020

deepakksahu commented Oct 15, 2020

sherifnada commented Oct 15, 2020

rubenssoto commented Nov 26, 2020

dnskr commented Jan 26, 2021

brucemen711 commented Feb 9, 2021 •

edited

kyprifog commented Apr 1, 2021 •

edited

sherifnada commented Apr 1, 2021

kyprifog commented Apr 13, 2021 •

edited

sherifnada commented Apr 13, 2021 •

edited

eduardgruy commented Apr 28, 2021

sherifnada commented May 19, 2021

mmolimar commented May 24, 2021

kyprifog commented May 24, 2021 •

edited

sherifnada commented May 25, 2021

bo5o commented May 25, 2021 •

edited

mmolimar commented May 25, 2021

an0nym3sh commented May 26, 2021

sherifnada commented May 26, 2021

olivermeyer commented May 27, 2021

tuliren commented Jun 3, 2021

S3 as destination #573

S3 as destination #573

Comments

deepakksahu commented Oct 15, 2020 • edited by sync-by-unito bot

Tell us about the new integration you’d like to have

Describe the context around this new integration

Describe the alternative you are considering or using

sherifnada commented Oct 15, 2020

deepakksahu commented Oct 15, 2020

sherifnada commented Oct 15, 2020

rubenssoto commented Nov 26, 2020

dnskr commented Jan 26, 2021

brucemen711 commented Feb 9, 2021 • edited

kyprifog commented Apr 1, 2021 • edited

sherifnada commented Apr 1, 2021

kyprifog commented Apr 13, 2021 • edited

sherifnada commented Apr 13, 2021 • edited

eduardgruy commented Apr 28, 2021

sherifnada commented May 19, 2021

mmolimar commented May 24, 2021

kyprifog commented May 24, 2021 • edited

sherifnada commented May 25, 2021

bo5o commented May 25, 2021 • edited

mmolimar commented May 25, 2021

an0nym3sh commented May 26, 2021

sherifnada commented May 26, 2021

olivermeyer commented May 27, 2021

tuliren commented Jun 3, 2021

deepakksahu commented Oct 15, 2020 •

edited by sync-by-unito bot

brucemen711 commented Feb 9, 2021 •

edited

kyprifog commented Apr 1, 2021 •

edited

kyprifog commented Apr 13, 2021 •

edited

sherifnada commented Apr 13, 2021 •

edited

kyprifog commented May 24, 2021 •

edited

bo5o commented May 25, 2021 •

edited