Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination Azure Blob Storage: Support writing timestamps #6063

Closed
Tracked by #6996
ghost opened this issue Sep 14, 2021 · 5 comments · Fixed by #9682
Closed
Tracked by #6996

Destination Azure Blob Storage: Support writing timestamps #6063

ghost opened this issue Sep 14, 2021 · 5 comments · Fixed by #9682

Comments

@ghost
Copy link

ghost commented Sep 14, 2021

Tell us about the problem you're trying to solve

I'd like to be able to use Azure Blob Storage (or S3 / GCS) as a durable data lake while also facilitating quick loads into a DW, like Snowflake and BigQuery.

Describe the solution you’d like

The option to add append the current timestamp (_airbyte_emitted_at) to the resulting filename in Cloud Storage. This would allow incremental reads to create individual files that can be loaded, queried, managed efficiently.

Describe the alternative you’ve considered or used

An alternative would be to manage a larger workflow outside of Airbyte that loads the file, copies to a durable location, and then removes the original.

Another possible alternative may be to enhance DW Destinations that leverage Cloud Storage by allowing the user to retain the staged data, as opposed to removing it automatically. I could see value in both enhancements.

Additional context

Similar to #4610

Are you willing to submit a PR?

Perhaps. :)

@andriikorotkov
Copy link
Contributor

andriikorotkov commented Jan 19, 2022

Hello, @dsdorazio and @sherifnada. I would like to share with you my vision of this task. Now, I have opened a pull request, in which S3 and GCS can be used as staging. The data of each synchronization is saved in a separate file on the staging, and in Azure Blob Storage there is always one blob with the airbyte stream name with actual data. Is this solution right for you?

If the solution that I described is not suitable, please describe in more detail the solution that you propose.

@tuliren
Copy link
Contributor

tuliren commented Jan 20, 2022

Let's move our discussion from #9336 to here.

I have opened a pull request, in which S3 and GCS can be used as staging.

I don't think this is the right solution. The purpose of the Azure blob storage destination is to store objects directly on Azure. Adding the S3 or GCS as staging area unnecessarily copy the data first to S3 or GCS, and then to Azure.


What's the current filename outputted by Azure destination? It seems to me that if the Azure output filename follows a similar pattern as the S3 or GCS, a timestamp will be included in the filename.

Here is what the S3 destination filename looks like:

https://docs.airbyte.com/integrations/destinations/s3#configuration

testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv
↑              ↑                ↑      ↑     ↑          ↑             ↑ ↑
|              |                |      |     |          |             | format extension
|              |                |      |     |          |             partition id
|              |                |      |     |          upload time in millis
|              |                |      |     upload date in YYYY-MM-DD
|              |                |      stream name
|              |                source namespace (if it exists)
|              bucket path
bucket name

Same for GCS:
https://docs.airbyte.com/integrations/destinations/gcs#configuration

@andriikorotkov
Copy link
Contributor

andriikorotkov commented Jan 20, 2022

@tuliren, The current filename outputted by Azure destination - it is stream name. Also, all blobs are stored in one container, which is specified by the user when creating the destination.

testing_container/blob_name
↑                          ↑ 
|                          |
|                         stream name
|
user container name

Are you suggesting changing this to -

container_with_stream_name/2021_01_01_1609541171643_0

Or is the next option better?

testing_container/stream_name__2021_01_01_1609541171643_0

@tuliren
Copy link
Contributor

tuliren commented Jan 20, 2022

Got it. I think keeping the same pattern as S3 should be good:

<bucket>/<output_path>/<namespace-if-there-is-one>/<stream-name>/2021_01_01_1609541171643_0.csv

@tuliren
Copy link
Contributor

tuliren commented Jan 24, 2022

@andriikorotkov, was a previous comment deleted? Although folders are not supported in Azure, the object path can have / in it so that it looks like a traditional path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants