Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dremio as a destination #1026

Merged
merged 74 commits into from Apr 8, 2024
Merged

Add Dremio as a destination #1026

merged 74 commits into from Apr 8, 2024

Conversation

maxfirman
Copy link
Contributor

Description

This PR adds Dremio as a destination to DLT

Additional Context

  • Tested locally against OSS Dremio running in docker and against Dremio Enterprise cluster.
  • dockercompose.yaml provided for creating local Dremio environment with minio as an object store for staging.

Some details for future PRs:

  • implement merge write disposition using MERGE INTO
  • implement table partition evolution

Copy link

netlify bot commented Feb 28, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit db91346
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66142e1f18d3f100080bfe35

@rudolfix
Copy link
Collaborator

@maxfirman this is amazing! we'll assign a reviewer tomorrow and add tests to our CI jobs after merging

tests/utils.py Outdated
@@ -60,7 +61,7 @@

# filter out active destinations for current tests
ACTIVE_DESTINATIONS = set(dlt.config.get("ACTIVE_DESTINATIONS", list) or IMPLEMENTED_DESTINATIONS)

ACTIVE_DESTINATIONS = {"dremio"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: remove!

rudolfix
rudolfix previously approved these changes Apr 7, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxfirman @sh-rp this is so good! this destination was quite different from the other but you were able to fit it perfectly into existing model and test structure (thx for making tests working)

I opened a PR that will be used that the newest code works #1194

We upgraded how we deal with configurations, parquet files and redid our CI structure so when all tests in the branch above are passing I'll push the newest version here and then we merge it.

IMO this destination is really good even now (maybe I'm missing something that dremio does and we do not support :))

  • we already have ticket to start using MERGE SQL statements
  • adding gcp and azure blob storages are mostly matter of config and we can do that later
  • using a sentinel table for destinations that have no schemas/datasets actually makes sense. we do that for vector databases. we also support empty dataset names so tables are not prefixed (not sure it applies here after all)
  • we can implement dremio adapter to expose all the partition/clustering/retention features (some are implemented but adapters let users to add the hints conveniently to the resources)

I also found following comment in tests

    # schemaless destinations allow adding of root key without the pipeline failing
    # for now this is only the case for dremio
    # doing this will result in somewhat useless behavior
    destination_allows_adding_root_key = destination_config.destination == "dremio"

tbh. to me it looks like dremio does not enforce nulls and allows to add root_id column even if it is NOT NULL to an existing table with the data. @maxfirman could that be the case?

anyway thanks again for your amazing work!

@maxfirman
Copy link
Contributor Author

@rudolfix thanks for time and effort reviewing this PR!

Indeed Dremio does not support NOT NULL constraints on columns. I think you are right that is the cause of the test failure.

I agree with all of your points. I have a few other thoughts as well for follow up PRs:

  • Dremio recently fixed a bug that enables the use of the ADBC client [Python] Querying Dremio with the ADBC Flight SQL client apache/arrow-adbc#1559. We could therefore remove the pydremio.py file and cut over to ADBC. I don't think there is any rush on this.
  • I'd like to test a wider selection of destination data source types in Dremio. We are somewhat hamstrung by the lack of CREATE SCHEMA support, which is why the CI tests run against a "NAS" (really local file system). I know that Dremio does support CREATE FOLDER sql statements for Nessie catalogs, so Nessie might be next logical thing to add to the testing pipelines as it should be possible to get all the tests passing against it. It might also be worth running a subset of the tests that don't require CREATE SCHEMA against a Hive Metastore data source.
  • I can foresee that we may want to expose a "path_prefix" config option to build tables at arbitrarily nested levels in a folder structure.
  • MERGE INTO sql support is vital given that Dremio doesn't support transactions, so I'm really glad to hear that is on your backlog!

@rudolfix
Copy link
Collaborator

rudolfix commented Apr 8, 2024

@maxfirman one question to:

I'd like to test a wider selection of destination data source types in Dremio. We are somewhat hamstrung by the lack of CREATE SCHEMA support, which is why the CI tests run against a "NAS" (really local file system). I know that Dremio does support CREATE FOLDER sql statements for Nessie catalogs, so Nessie might be next logical thing to add to the testing pipelines as it should be possible to get all the tests passing against it. It might also be worth running a subset of the tests that don't require CREATE SCHEMA against a Hive Metastore data source.

we will be able to test against dremio cloud right? this "NAS" is also available there - the "local" filesystem is in the cloud then, right?

also: #1199

@maxfirman
Copy link
Contributor Author

this "NAS" is also available there - the "local" filesystem is in the cloud then, right?

@rudolfix I'm not 100% sure as I'm only familiar with their on-prem offering.

If we are running tests against Dremio Cloud it may make sense to use the built in "Arctic" catalog (which is basically managed Nessie) instead.

Nessie can be run in a docker container, so we could also extend the docker-compose.yaml to stand up a Nessie server if we wanted to test this out against a Dockerised deployment.

@rudolfix rudolfix requested a review from sh-rp April 8, 2024 18:49
@rudolfix rudolfix merged commit cf3e8fc into dlt-hub:devel Apr 8, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace destination Issue related to new destinations
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants