Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 as destination #573

Closed
deepakksahu opened this issue Oct 15, 2020 · 21 comments · Fixed by #3672
Closed

S3 as destination #573

deepakksahu opened this issue Oct 15, 2020 · 21 comments · Fixed by #3672

Comments

@deepakksahu
Copy link

deepakksahu commented Oct 15, 2020

Tell us about the new integration you’d like to have

AWS s3 is a heavily used destination for the data lake. If we can sink to that that would be great.

Describe the context around this new integration

Data Engineering team does the daily ingestion and all the time the sync is S3 for me.

Describe the alternative you are considering or using

Custom spark program.

┆Issue is synchronized with this Asana task by Unito

@deepakksahu deepakksahu added area/connectors Connector related issues new-connector labels Oct 15, 2020
@sherifnada
Copy link
Contributor

@deepakksahu thanks for making this request! Is there a specific data format you want to send to S3 (e.g: CSV), or are there different formats you would be interested in sending?

@deepakksahu
Copy link
Author

One of the most used file format for the OLAP will be parquet,orc format. In some cases it can be AVRO or CSV(rare) as well.

According to me if we can support all 4 that will be great. Otherwise can start with first two.

@sherifnada
Copy link
Contributor

We'll prioritize S3 as a destination in our roadmap shortly. Thanks for the clarification!

@rubenssoto
Copy link

Hello guys,

Parquet is a good one. :)

@dnskr
Copy link

dnskr commented Jan 26, 2021

Hi!
Delta Lake format is very good and popular candidate: Home and Docs

@brucemen711
Copy link

brucemen711 commented Feb 9, 2021

Anyone working on Delta lake destination ? We are looking forward to this.

@kyprifog
Copy link

kyprifog commented Apr 1, 2021

Any progress on this one? CSV destination for things like sheets sources would be nice.

@sherifnada
Copy link
Contributor

hi @kyprifog! This is not on Airbyte's roadmap for April, but this is definitely top of mind for us and will update you as we form our May roadmap. We'll also create some documentation for how to create a destination yourself in case you need to unblock yourself for the time being.

@kyprifog
Copy link

kyprifog commented Apr 13, 2021

@sherifnada any update on that documentation?

(apologies in advance I haven't actually tried to it may be self apparent)

@sherifnada
Copy link
Contributor

sherifnada commented Apr 13, 2021

@kyprifog we'll have it this week! Follow along here: #2641

@eduardgruy
Copy link

+1. Would be great to have S3 destination that supports multiple formats. Ideally Deltalake/CSV/Parquet.
I suppose S3 would be used as a kind of landing zone so appending changes to a delta lake table would be perfect.

@sherifnada
Copy link
Contributor

Hi everyone! We're going to start work on the S3 destination very soon and would love to hear your feedback on features you need. Could you comment on this thread about any of the following:

  • What volume of data do you anticipate replicating? bytes or number of records estimate is helpful to know.
  • Data format you'd like the connector to support
  • Does the data you write need to be partitioned? What partitioning scheme do you need?
  • How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..
  • What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

@mmolimar
Copy link
Contributor

Hi @sherifnada
This is great news! My thoughts below :-)

Hi everyone! We're going to start work on the S3 destination very soon and would love to hear your feedback on features you need. Could you comment on this thread about any of the following:

  • What volume of data do you anticipate replicating? bytes or number of records estimate is helpful to know.

We've estimated about 5-10 million records on each connection, having dozens connections (based on sync settings).

  • Data format you'd like the connector to support

JSONL, Avro. If possible having something like DeltaLake and "upserts" it'd be great!

  • Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes. Partitioned by stream name, load timestamp plus some custom partitions to be filled by the user.

  • How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..

Yes, using Redshift, Athena, and other downstream apps to process this data.

  • What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

A form to configure the credentials and the S3 path in which we could include dynamic values in it (something like parsing DateTimeFormatter patterns to generate dynamic URIs).

@kyprifog
Copy link

kyprifog commented May 24, 2021

I had very simple use case, I wanted to create a quick and easy sync between google sheets and S3. There are other ways to do this, but wanted to see if I could knock it out with airbyte since i was using for other things. Have other tools that I use instead for large volumes of data, or partitioned data.

@sherifnada
Copy link
Contributor

@mmolimar for the initial release we are going to support CSV as the data format, but will quickly start adding other formats like JSONL, Parquet, and Avro. We will also separate release a delta lake connector (#2075 ) in the future to make importing into delta lake a seamless process.

@bo5o
Copy link

bo5o commented May 25, 2021

Hi @sherifnada

Data format you'd like the connector to support?

CSV (maybe gzipped). Later support for Parquet, Avro, Delta Lake would be nice.

Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes. The Hive data model (see wiki) is probably very popular and also the one I use. So it would be nice to have the possibility to specify one or multiple columns in the source that are used to partition the data on S3 (e.g. s3a://mybucket/myschema/mytable/country=ABC/region=DEF/data.part1.csv). The country and region columns don't have to be in the data.part1.csv if partitioning is used like that. Additional data could be added in the future using the correct object/partition key (ex. /country=ABC/region=DEF/data.part2.csv). This scheme is used often with date types, so that queries targeting a specific date range in the WHERE clause can be answered more efficiently.

How are you using the data in S3? Are you importing it into Redshift, Athena, dashboards, etc.?

Data is further transformed and queried with Trino (f.k.a PrestoSQL) using the Hive connector. I'd like to add the CSV data written by Airbyte as an external table in Hive (see wiki). From there on, the data can be queried and transformed, e.g converted to Parquet, directly with Trino (or something like DBT).

What is the UX you want for interacting with the S3 connector? What inputs/configuration do you want to give it, and how do you want the output to look like?

  • Specify S3 credentials
  • Specify additional S3 settings like host, region and URL scheme, to be able to use Airbyte with alternative S3-compliant object storage implementations like MinIO.
  • Define partition keys to be used from the source data
  • Set compression on/off

@mmolimar
Copy link
Contributor

@mmolimar for the initial release we are going to support CSV as the data format, but will quickly start adding other formats like JSONL, Parquet, and Avro. We will also separate release a delta lake connector (#2075 ) in the future to make importing into delta lake a seamless process.

Thanks @sherifnada
Are there connectors which retrieves data with nested structures? If so, how would they work in CSV?

@an0nym3sh
Copy link

Hi @sherifnada, I was planning to build an S3 destination connector myself as well and came across airbyte's Redshift connector that has a logic where it puts data into S3 and then copies it to Redshift from there, so what I can do is reverse engineer the Redshift connector till the S3 task and build it in my DEV version. However, it would be great if you guide me as well.

Thanks and regards.

@sherifnada
Copy link
Contributor

@janhavi-debug no need! we should have a PR for S3 destination by EOW :) Though we'd love it if you help us understand your usecase by answering some of the questions above!

@olivermeyer
Copy link
Contributor

Hi @sherifnada, just adding my two cents to some of your questions as we're waiting for this S3 destination to start using Airbyte.

Data format you'd like the connector to support

Definitely Parquet!

Does the data you write need to be partitioned? What partitioning scheme do you need?

Yes, and I agree with Hive partitioning. One key feature for us is the ability to use partition values which are not in the data (e.g. we partition by processing time, not event time).

How are you using the data in S3? Are you importing it into Redshift, athena, dashboards, etc..

Data will be read by Redshift Spectrum.

Thanks and looking forward to the release!

@tuliren tuliren mentioned this issue May 27, 2021
2 tasks
@tuliren tuliren linked a pull request May 27, 2021 that will close this issue
2 tasks
@tuliren
Copy link
Contributor

tuliren commented Jun 3, 2021

Update on this topic:

  1. The S3 destination connector has been merged to master. Here is the documentation.
  2. Currently it only supports the CSV format.
  3. We are working on supporting the Parquet format right now (Destination S3: support writing Parquet data format #3642).
  4. More formats and configurations will be supported in the future. Feel free to create new tickets for feature and configuration requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.