Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destinations] Databricks #762

Closed
3 tasks
AstrakhantsevaAA opened this issue Nov 13, 2023 · 6 comments · Fixed by #892
Closed
3 tasks

[destinations] Databricks #762

AstrakhantsevaAA opened this issue Nov 13, 2023 · 6 comments · Fixed by #892
Assignees

Comments

@AstrakhantsevaAA
Copy link
Contributor

AstrakhantsevaAA commented Nov 13, 2023

Feature description

Implement Databricks destination. It will be quite similar to the Snowflake implementation.

Authorization:

  • Databricks personal access token
  • OAuth with refresh token

Support data sources (staging storage):

  • Managed tables
  • Amazon S3 (External storage)
  • Azure Blob Storage (External storage)

Tests

  • we should enable it as destination in tests. all tests using ALL_DESTINATIONS must pass
  • it should pass the same common tests which BigQuery job client and sql client is passing
  • note that to pass sql_client tests you'll need to map DBApi exceptions into 3 categories that dlt needs: relation not found, terminal exceptions, transient exceptions. see bigquery and postgres implementations. that's manual work
@AstrakhantsevaAA AstrakhantsevaAA changed the title [destinations] Databricks Azure [destinations] Databricks Nov 13, 2023
@rudolfix
Copy link
Collaborator

@AstrakhantsevaAA let's sync on slack as well. we should invite the implementer as contributor so it is easy to run our CI.

@phillem15
Copy link
Contributor

phillem15 commented Nov 13, 2023

@AstrakhantsevaAA, this looks great. Under "Support data sources", does this mean that S3 and ADLS would only be supported for the staging part? Or would this also be available for the destination dataset as well? We would likely want the option to have the destination dataset be created as an "external table".

Also, Auto-Loader for Databricks SQL is currently in public preview (see here). I typically use COPY INTO, but having the option for autoloader would be nice to have.

@AstrakhantsevaAA
Copy link
Contributor Author

AstrakhantsevaAA commented Nov 14, 2023

Under "Support data sources"

@phillem15 yes, I meant staging data sources. User can use filesystem destination to load files directly to s3, azure, gs storage. Or use filesystem verified source as a source to load files from buckets to the Databricks workspace.

@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Nov 15, 2023
@rudolfix
Copy link
Collaborator

rudolfix commented Nov 15, 2023

@phillem15 are people using databricks file system like they use s3 for athena? actually we could support it in our filesystem destination and source so people can send files to is as a staging (and also read)
https://filesystem-spec.readthedocs.io/en/stable/_modules/fsspec/implementations/dbfs.html
UPDATE: it seems that dbfs is abstraction over bucket storage so the question is if going via dbfs is something that people do, or they just interact with bucket storage directly

@mrhopko
Copy link

mrhopko commented Dec 6, 2023

Is this being worked on? I am happy to contribute if required?

@AstrakhantsevaAA
Copy link
Contributor Author

Is this being worked on? I am happy to contribute if required?

@phillem15 works on it, I'll check the development status and if it turns out that we need help, I'll get back to you. Thank you for your interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants