Databricks destination #892

steinitzu · 2024-01-12T22:45:40Z

Description

Draft of databricks destination #762

Related Issues

Resolves [destinations] Databricks #762

Additional Context

At the moment COPY INTO is always used and requires s3/azure staging bucket

netlify · 2024-01-12T22:45:45Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`dc984df`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65bc422f2014e1000869ce00
😎 Deploy Preview	https://deploy-preview-892--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

steinitzu · 2024-01-12T22:47:31Z

dlt/destinations/impl/databricks/__init__.py

+def capabilities() -> DestinationCapabilitiesContext:
+    caps = DestinationCapabilitiesContext()
+    caps.preferred_loader_file_format = "parquet"
+    caps.supported_loader_file_formats = ["parquet"]


I could not get jsonl to work with COPY INTO. Suspect it's because of gzip, i.e. either databricks doesn't support reading gzipped json or more likely requires file extension json.gz

can you confirm if using a different extension works?

I had a look at the databricks documentation, since we have jsonl and not json in our staging location, you probably have to set multilinemode to true. maybe that'll help. lmk.
https://docs.databricks.com/en/sql/language-manual/delta-copy-into.html

The multiline mode applies to records that span multiple lines, jsonl works otherwise by default but only with gz compression disabled.
I have to try some script with jsonl.gz filename, looks like a lot of work to actually change extensions in dlt.

Confirmed that gzip works with jsonl.gz extension.
Might be good to change this so dlt creates files with gz ext, that's pretty universally supported, but would leave it for another pr.
I just have parquet enabled for now.

@steinitzu my 2 cents:
there's really no way to set the compression explictly ie. in copy or format options?
if not then please do not disable JSONL, just detect that compression is enabled and fail the job with an exception (ie. ask to disable compression)

another ask @steinitzu could you write a ticket to add gz extensions to zipped files in dlt load packages?

@steinitzu my 2 cents: there's really no way to set the compression explictly ie. in copy or format options?

Hey @rudolfix no there's no option for it that's documented.

if not then please do not disable JSONL, just detect that compression is enabled and fail the job with an exception (ie. ask to disable compression)

Hmm yeah that makes sense for now. Until we change the extensions.

It fails on empty jsonl files too. Produced by this test: https://github.com/dlt-hub/dlt/blob/sthor%2Fdatabricks/tests/load/pipeline/test_replace_disposition.py#L154-L163
Is this supposed to produce a stage file? Not sure if I'm doing something wrong.

Edit: workaround https://github.com/dlt-hub/dlt/blob/sthor%2Fdatabricks/dlt/destinations/impl/databricks/databricks.py#L187-L190

yes, empty jsonl files are correct. they mean that content of the table should be dropped as there are no rows in current load. we generate them ie. to delete data from child tables on replace if there were no data at all for given table. your workaround is good and will do what we expected! thanks

sh-rp

very nice work, I have a few comments

dlt/common/data_writers/writers.py

dlt/destinations/impl/databricks/configuration.py

sh-rp · 2024-01-21T19:00:00Z

dlt/destinations/impl/databricks/configuration.py

+        "schema",
+    ]
+
+    def __post_init__(self) -> None:


we also have the on_resolved function that gets called after the configuration is resolved, maybe you can use that, or do you need this to run before?

Yeah on_resolved should work. Didn't touch these classes much from the previous PR since it seemed to work, needs some tweaks.

For now I removed a lot of this and am only supporting the "personal access token" auth.
I don't have a way to test oauth atm, will need to be done somehow so we're not prompting on every connection.

sh-rp · 2024-01-21T19:07:56Z

dlt/destinations/impl/databricks/databricks.py

+                    f"Databricks cannot load data from staging bucket {bucket_path}. Only s3 and azure buckets are supported",
+                )
+        else:
+            raise LoadJobTerminalException(


somehow i think it would be nice to be able to define in the capabilities which staging filesystem types are supported by each destination. not sure wether to add this now or later, but it would save some of this code in a few places and we could also catch config errors earlier.

Good idea to have this in capabilities. Can do that and refactor in another PR.

dlt/destinations/impl/databricks/databricks.py

sh-rp · 2024-01-21T19:22:38Z

dlt/destinations/impl/databricks/databricks.py

+
+        return sql
+
+    def _execute_schema_update_sql(self, only_tables: Iterable[str]) -> TSchemaTables:


why do you need to override this method? won't the one in the baseclass work?

Now it does. Updated the baseclass to use execute_many. The base sqlclient implements it as:
https://github.com/dlt-hub/dlt/blob/sthor%2Fdatabricks/dlt/destinations/sql_client.py#L122-L139 depending on capabilities

dlt/destinations/impl/databricks/sql_client.py

tests/load/test_job_client.py

escape.py

configuration and rename classes and methods

DatabricksSqlClient

types.

DatabricksLoadJob

credentials.

instead of database. Add has_dataset method to check if dataset exists.

client.

method to use correct query based on presence of catalog in db_params.

readability and add optional parameters.

update COPY INTO statement

rudolfix

LGTM! thanks everyone for the quality work! if @sh-rp agrees we should merge this.

dlt/destinations/impl/databricks/databricks.py

sh-rp · 2024-02-01T19:55:08Z

dlt/destinations/impl/databricks/factory.py

+        super().__init__(
+            credentials=credentials,
+            stage_name=stage_name,
+            keep_staged_files=keep_staged_files,


is this used anywhere?

dlt/destinations/impl/snowflake/__init__.py

sh-rp

a few small comments again :)

steinitzu · 2024-02-02T01:22:34Z

docs/website/docs/dlt-ecosystem/destinations/databricks.md

+The `jsonl` format has some limitations when used with Databricks:
+
+1. Compression must be disabled to load jsonl files in databricks. Set `data_writer.disable_compression` to `true` in dlt config when using this format.
+2. The following data types are not supported when using `jsonl` format with `databricks`: `decimal`, `complex`, `date`, `binary`. Use `parquet` if your data contains these types.
+3. `bigint` data type with precision is not supported with `jsonl` format
+


@sh-rp @rudolfix
Json support is even more limited. All these came up in tests/load/pipeline/test_stage_loading.py::test_all_data_types

binary in base64 does not work, not sure if it works at all in some other encoding.

Databricks parses date strings to timestamp and fails

There is no general JSON or equivelent type (I think), you need to create a struct column with a defined schema. And json objects/arrays don't convert automatically to string

Somehow a 16 bit int in JSON is detected as LONG

I definately think we should contact databricks about this and see what they say. @rudolfix who do we know there?

steinitzu commented Jan 12, 2024

View reviewed changes

steinitzu force-pushed the sthor/databricks branch 2 times, most recently from 9b712aa to e45ce07 Compare January 18, 2024 00:51

sh-rp requested changes Jan 21, 2024

View reviewed changes

phillem15 and others added 25 commits January 26, 2024 16:31

Add escape_databricks_identifier function to

13e6a0e

escape.py

Add Databricks destination client and

6c3e8af

configuration and rename classes and methods

Refactor Databricks SQL client to use new API

e941f4f

Refactor Databricks SQL client code

6bc1e4f

Refactor DatabricksCredentials configuration class

8cd57b4

Implement commit_transaction method in

d3dd5f5

DatabricksSqlClient

Fix DatabricksCredentials host and http_path

a2a53fc

types.

Refactor DatabricksTypeMapper and

b630d31

DatabricksLoadJob

Add support for secret string values in Databricks

c0ce477

credentials.

Fix DatabricksSqlClient super call to use catalog

b85b930

instead of database. Add has_dataset method to check if dataset exists.

Update Databricks credentials configuration

f6aac09

Update Databricks destination capabilities and SQL

f2ff181

client.

Update file formats for Databricks staging

1e9992d

Refactored DatabricksSqlClient.has_dataset()

60e9b8b

method to use correct query based on presence of catalog in db_params.

Refactor DatabricksLoadJob constructor to improve

f82bc53

readability and add optional parameters.

a few small changes

640a04b

Add and comment execute fragments method

0d98248

Update staging file format preference

69404cc

Refactor execute_fragments method

635fd7e

Fix DatabricksLoadJob constructor arguments and

01604bb

update COPY INTO statement

Update Databricks destination capabilities

d9fbcda

Update Databricks destination code

749aa11

Fix SQL execution in SqlLoadJob

e0fdf3f

Add SqlMergeJob to DatabricksLoadJob

8955cd9

Move databricks to new destination layout

38357a8

steinitzu added 12 commits January 26, 2024 16:31

Remove keep-staged-files option

ed3d58b

Exceptions fix, binary escape

1fcdb5d

databricks dbt profile

e2c9aed

Handle databricks 2.9 paramstyle

d6584d9

Databricks docs

e63de19

Remove debug raise

dad726a

Fix sql load job

3fad7e4

Revert debug

b3cb533

Typo fix

07a5eec

General execute_many method in base sql client

a615a0a

Databricks client cleanup

051e9d3

Implement staging clone table in base class

64b7e2e

steinitzu force-pushed the sthor/databricks branch from 3f190d6 to 64b7e2e Compare January 26, 2024 21:33

steinitzu added 2 commits January 30, 2024 15:27

Personal access token auth only

3076700

stage_name is not relevant

0f32964

steinitzu marked this pull request as ready for review January 30, 2024 20:51

steinitzu added 3 commits January 31, 2024 20:39

databricks jsonl without compression

5d212a9

Update jsonl in docs

1e548e0

Check and ignore empty json files in load job

e8c08e2

rudolfix previously approved these changes Feb 1, 2024

View reviewed changes

sh-rp reviewed Feb 1, 2024

View reviewed changes

dlt/destinations/impl/databricks/databricks.py Outdated Show resolved Hide resolved

sh-rp reviewed Feb 1, 2024

View reviewed changes

dlt/destinations/impl/snowflake/__init__.py Show resolved Hide resolved

sh-rp requested changes Feb 1, 2024

View reviewed changes

steinitzu added 2 commits February 1, 2024 19:12

Cleanup

2d50cc7

Update unsupported data types

dc984df

steinitzu dismissed rudolfix’s stale review via dc984df February 2, 2024 01:15

steinitzu commented Feb 2, 2024

View reviewed changes

sh-rp merged commit d2dd951 into devel Feb 5, 2024
59 checks passed

rudolfix deleted the sthor/databricks branch February 26, 2024 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks destination #892

Databricks destination #892

steinitzu commented Jan 12, 2024 •

edited

netlify bot commented Jan 12, 2024 •

edited

steinitzu Jan 12, 2024

sh-rp Jan 21, 2024

sh-rp Jan 21, 2024 •

edited

steinitzu Jan 26, 2024

steinitzu Jan 30, 2024

rudolfix Jan 31, 2024

rudolfix Jan 31, 2024

steinitzu Feb 1, 2024

steinitzu Feb 1, 2024 •

edited

rudolfix Feb 1, 2024

sh-rp left a comment

sh-rp Jan 21, 2024

steinitzu Jan 26, 2024

steinitzu Jan 30, 2024

sh-rp Jan 21, 2024

steinitzu Jan 30, 2024

sh-rp Jan 21, 2024

steinitzu Jan 29, 2024

rudolfix left a comment

sh-rp Feb 1, 2024

sh-rp left a comment

steinitzu Feb 2, 2024

sh-rp Feb 2, 2024


		return sql

		def _execute_schema_update_sql(self, only_tables: Iterable[str]) -> TSchemaTables:

Databricks destination #892

Databricks destination #892

Conversation

steinitzu commented Jan 12, 2024 • edited

Description

Related Issues

Additional Context

netlify bot commented Jan 12, 2024 • edited

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp Jan 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinitzu Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinitzu commented Jan 12, 2024 •

edited

netlify bot commented Jan 12, 2024 •

edited

sh-rp Jan 21, 2024 •

edited

steinitzu Feb 1, 2024 •

edited