Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2024-04-22] [BUG] Databricks Pipelines No Longer Work #4

Closed
JessicaLHartog opened this issue Apr 23, 2024 · 0 comments
Closed

[2024-04-22] [BUG] Databricks Pipelines No Longer Work #4

JessicaLHartog opened this issue Apr 23, 2024 · 0 comments

Comments

@JessicaLHartog
Copy link
Contributor

Expected Behavior

  • Databricks profiling pipeline should be able to successfully profile Databricks tables for sensitive data.
  • Databricks masking pipeline should be able to successfully mask Databricks tables that contain sensitive data and copy tables that don't contain sensitive data.

Actual Behavior

  • Databricks profiling pipeline fails because it is unable to find table storage paths.
  • Databricks masking pipeline fails because it is unable to find table storage paths.

Steps To Reproduce the Problem

After importing and configuring the Databricks pipelines per their READMEs. Trigger the pipelines with relevant parameters.

Version

Latest versions of pipelines in dcsazure_Databricks_to_Databricks folder, following import.

Additional Context

These pipelines both had been relying on the INFORMATION_SCHEMA.tables table. Specifically, pulling and modifying the value of STORAGE_SUB_DIRECTORY from that table in the pipelines in order to determine the table's storage path in the unity catalog.

Recently, per updated Databricks documentation, that value is now always NULL.

This information will instead have to be pulled on a per-table basis, from either describe detail <catalog>.<schema>.<table> or describe table extended <catalog>.<schema>.<table>. However, the current strategy of using pipeline variables to pull and parse this value will no longer work since variables cannot be modified in parallel, and so it may be required to break out these queries and their parsing into a separate pipeline that is called per table.

Other options may be available, we will need to reach out to Databricks support to determine what, if any, those are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant