Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dbt incremental_strategy behavior by fixing schema table existing check #530

Merged

Conversation

case-k-git
Copy link
Contributor

@case-k-git case-k-git commented Dec 9, 2023

Resolves #
#523

Description

I have made a modification to consider case sensitivity when checking for the presence of a schema. Databricks manages schemas in lowercase, so if there isn't an exact match, it can cause problems in subsequent processes. For example, when setting materialized='incremental' in the SQL Config for an append operation, it checks for the existence of the table before appending the differential data. If the schema is passed in uppercase in this case, since the schema on Databricks is in lowercase, the check for the table's existence will fail, and instead of appending, it will run recreate the existing table even if using incremental option. We have now fixed it to ensure an exact match during the initial schema check.

{{ config(
    materialized='incremental',
    incremental_strategy='append',
) }}

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@case-k-git case-k-git changed the title update-schema-check-validation update schema existing check validation considering the distinction between uppercase and lowercase Dec 9, 2023
@benc-db
Copy link
Collaborator

benc-db commented Dec 11, 2023

It's unclear how the change you're proposing accomplishes your stated goal. It seems if anything, that your change will try to compare a potentially mixed-case schema against a list that I think is natively lower-cased.

@case-k-git
Copy link
Contributor Author

case-k-git commented Dec 12, 2023

@benc-db
Thank you for your reply. Let me explain. Both Normal Expected Case and Abnormal Unexpected Case

🔹 Normal Expected Case:Defining the schema in lowercase
We use the following profiles.yml and execute the incremental SQL. If the schema is defined in lowercase, there are no problems

databricks_demo:
  target: dev
  outputs:
    dev:
      catalog: catalog_demo
      host: <host>
      http_path: <http_path>
      schema: schema_demo
      threads: 1
      token:  "<token>"
      type: databricks
  • Execution Log
    Upon checking the target table, it can be seen that it has been incremented
dbt run
01:48:33  2 of 2 START sql incremental model schema_demo.daily_model_2 ................... [RUN]
01:48:39  2 of 2 OK created sql incremental model schema_demo.daily_model_2 .............. [OK in 6.12s]
  • Table Hisotry:
    Upon checking the history, it can be seen that it has been appended (Write).
    write

🔹  Abnormal Unexpected Case:Defining the schema in uppercase
We use the following profiles.yml and execute the incremental SQL. If the schema is defined in uppercase, it behaves unexpectedly

databricks_demo:
  target: dev
  outputs:
    dev:
      catalog: catalog_demo
      host: <host>
      http_path: <http_path>
      schema: Schema_Demo
      threads: 1
      token:  "<token>"
      type: databricks
  • Execution Log
    Upon checking the target table, it can be seen that it has been incremented. But table History is wrong
dbt run
01:48:33  2 of 2 START sql incremental model Schema_Demo.daily_model_2 ................... [RUN]
01:48:39  2 of 2 OK created sql incremental model Schema_Demo.daily_model_2 .............. [OK in 6.12s]
  • Table History
    However, upon checking the history, it turns out that it has been recreated.Should be increment
    create_or_replace

This happens because when doing an incremental update, it checks for the existence of the table. If the schema is passed in uppercase, it is determined that the table does not exist and thus a recreate is triggered. While it is possible to fix this on the incremental side, I thought it would be more considerate to check beforehand if the schema defined in profiles.yml exactly matches.This is what I would like to solve and the change that I make.

Copy link
Collaborator

@benc-db benc-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you that the problem exists, but the code you're proposing to remove should be exactly preventing what you describe, as it compares the fully lower-cased version of the schema in question to the lower-cased name of every schema in the database. Given that, as you said, Databricks returns the schema name lower-cased, how does removing schema.lower() achieve your ends? Have you tested that using your local version fixes your problem?

@case-k-git
Copy link
Contributor Author

Thank you for your reply. I am attempting to integration test, but I haven't yet been able to create the environment for that.I am trying to do following document way.
https://github.com/databricks/dbt-databricks/blob/main/CONTRIBUTING.MD#functional--integration-tests

What I want to achieve is that if the 'profiles' schema is passed in uppercase, it would be treated as a non-existent schema, leading to a connection failure and a fail state. Currently, even if passed in uppercase, it does not fail to connect, which causes issues in subsequent processes like I shared. By removing the 'lower', I believe it will compare the uppercase 'profiles' schema passed with the lowercase schema retrieved from Databricks, and fail the connection as a non-existent schema.

@benc-db
Copy link
Collaborator

benc-db commented Dec 14, 2023

If the schema is passed in uppercase in this case, since the schema on Databricks is in lowercase, the check for the table's existence will fail

So, I'm now understanding that you are trying to make it fail at the schema step so that you don't recreate the table. I didn't realize that was your goal initially. Ok, so I think the fix that is more inline with our principles is to find out why the table check is failing and fix that instead. In general we trying to make everything lowercase for comparison because, to my knowledge, Databricks is not supposed to be case sensitive. I think getting rid of the lowercasing here will be too disruptive. Can we instead identify why the table check is failing?

@case-k-git
Copy link
Contributor Author

case-k-git commented Dec 16, 2023

I think getting rid of the lowercasing here will be too disruptive. Can we instead identify why the table check is failing?

Yea I apologize for the confusion, of course we can identify and can find the ideal way to solve this issue. It will be possible that user lost history data. So I think better to solve this issue some how.Thank you!

@case-k-git
Copy link
Contributor Author

case-k-git commented Dec 19, 2023

I have created the new PR to investigate this schema issue. Could you also check this PR?
#538

Currently I am investigating by passing schema.This PR help to pass default schema option and will be useful to investigate this issue .

@case-k-git case-k-git changed the title update schema existing check validation considering the distinction between uppercase and lowercase Fix dbt incremental_strategy behavior by fixing schema table existing check Dec 29, 2023
@case-k-git
Copy link
Contributor Author

case-k-git commented Dec 29, 2023

Hi @benc-db

I fix this issue could you please check this change? I have test it locally and confirm the issue has been fixed. I pass the uppercase of schema and confirm instead of recreate table, append the rows into existing tables

To pass the uppercase of schema from dbt integration test , I need to fix existing integration environment. I have created that change as different PR. So please check this PR as well
#541

  • Test Results
 .tox/integration-databricks-uc-sql-endpoint/bin/python -m pytest -v -m profile_databricks_uc_sql_endpoint -n4 tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
=========================================== test session starts ===========================================
platform darwin -- Python 3.11.5, pytest-7.4.3, pluggy-1.3.0 -- /Users/user/Library/CloudStorage/OneDrive/Desktop/workspace/dbt-databricks/.tox/integration-databricks-uc-sql-endpoint/bin/python
cachedir: .pytest_cache
rootdir: /Users/user/Library/CloudStorage/OneDrive/Desktop/workspace/dbt-databricks
configfile: pytest.ini
plugins: csv-3.0.0, flaky-3.7.0, dotenv-0.5.2, xdist-3.5.0
4 workers [1 item]      
scheduling tests via LoadScheduling

tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint 

[gw0] [100%] PASSED tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint 

============================================ warnings summary =============================================
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
tests/integration/incremental_strategies/test_incremental_strategies.py::TestDeltaStrategiesWarehouse::test_delta_strategies_databricks_uc_sql_endpoint
  /Users/user/Library/CloudStorage/OneDrive/Desktop/workspace/dbt-databricks/dbt/adapters/databricks/connections.py:693: UserWarning: The cursor was closed by destructor.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ 1 passed, 7 warnings in 85.58s (0:01:25) =================================
  • Query History ※ 2024/01/03 modified image

Checked uppercase of schema has been passed from integration test and list table check has been done by lowercase converted schema.Table existing check has been passed and this change increment rows into existing table instead of recreate table
query_history_logs

  • Table History: ※ 2024/01/03 modified image

Checked append the row as expected. In the second log of the table history, 「WRITE」 is recorded.
append_delta_write

def list_tables(self, database: str, schema: str, identifier: Optional[str] = None) -> Table:
        database = database.strip("`")
        schema = schema.strip("`").lower() # The test passed successfully

I also try to test without lower convert from list tables method. I confirmed instead of append rows, table has been recreated and integration test is failed as we expected. In the second log of the table history, 「CREATE OR REPLACE」 is recorded.
append_delta_create

def list_tables(self, database: str, schema: str, identifier: Optional[str] = None) -> Table:
        database = database.strip("`")
        schema = schema.strip("`") # The test failed because it was recreated

@case-k-git
Copy link
Contributor Author

I think this is the same issue
#523

@benc-db
Copy link
Collaborator

benc-db commented Jan 2, 2024

Sorry for the delay, was taking holidays. Will take a look today.

@case-k-git
Copy link
Contributor Author

case-k-git commented Jan 3, 2024

@benc-db Thank you! I modified the query and table history images to make the verification results easier to understand.

@case-k-git
Copy link
Contributor Author

Hello @benc-db Thank you for the other PR check that I requested ! I appreciate if you check this PR as well when you have time. Thank you.

@benc-db
Copy link
Collaborator

benc-db commented Jan 11, 2024

@case-k-git please resolve the conflict, and then I will run tests and review :)

@case-k-git
Copy link
Contributor Author

@benc-db Thank you fix the conflict

benc-db
benc-db previously approved these changes Jan 12, 2024
@benc-db
Copy link
Collaborator

benc-db commented Jan 12, 2024

Thanks for your help and patience. Will merge today, and this will be in the next release (which I expect will come out next week).

@benc-db benc-db merged commit 7421491 into databricks:main Jan 12, 2024
18 checks passed
@case-k-git
Copy link
Contributor Author

@benc-db Thank you !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants