Refactor DatabricksHook #19835

eskarimov · 2021-11-26T08:25:48Z

related: PR #19736 and feature request #18999

The PR intends to refactor DatabricksHook and related classes as a preparation step before introducing Deferrable Operator for Databricks.

Main points:

Use int type for run_id and job_id. Databricks docs specifies it as integer. Also, actual responses from Databricks API return integer, not string.
Fix filling AAD headers.
Move AAD token validation into a separate function, covered with tests.
Change initialisation of RunState - make it explicit that run_state might be None. Currently there's a tiny comment in the code telling that result_state might be None in case a job is still running.

eskarimov · 2021-11-26T08:26:50Z

@alexott could you check this PR, please?

uranusjr · 2021-11-26T09:04:25Z

Could you revert the unnecessary double-quote-to-single-quote changes? They make the patch very difficult to review.

eskarimov · 2021-11-26T09:23:02Z

Could you revert the unnecessary double-quote-to-single-quote changes? They make the patch very difficult to review.

Reverted

alexott · 2021-11-26T19:01:20Z

airflow/providers/databricks/hooks/databricks.py

@@ -64,10 +66,12 @@
 class RunState:
    """Utility class for the run state concept of Databricks runs."""

-    def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None:
+    def __init__(
+        self, life_cycle_state: str, state_message: str, result_state: str = None, *args, **kwargs


why do we need *args, **kwargs ? Also, why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.

It needs to be reviewed together with the changes for initialising the class instance out of API response:

Current version:

state = response['state'] life_cycle_state = state['life_cycle_state'] # result_state may not be in the state if not terminal result_state = state.get('result_state', None) state_message = state['state_message'] return RunState(life_cycle_state, result_state, state_message)

Proposed version:

state = response['state'] return RunState(**state)

Current version is basically an intermediate layer between the API response and class, extracting values out of the API response and initialising class instance. But actually the response should already represent a state, why do we need this layer then?
I see the following drawbacks with it:

Class signature doesn't tell that result_state might be missing if state is not terminal. Currently it's described with the comment deep in the code.

It tends to increase repeating code - let's say we want to introduce async class for DatabricksHook. This logic needs to be written twice. Also in case we want to change the class in the future, let's say add new property user_cancelled_or_timedout (which is already a part of the API response), then we need to change class arguments, parsing response logic and class instance initialisation everywhere it's used.
With the proposed version, we only need to change class arguments.

With all the above, answering the questions:

why do we need *args, **kwargs ?

It shows that RunState might receive other init arguments (since we don't have control over API response), see above example with user_cancelled_or_timedout in the response.

why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.

Just because of Python syntax, we need to put arguments with default values after required arguments.

Just thinking - should we initialize result_state to empty string? If we leave it's None, then we need to adjust get_run_state_str to use empty string instead when result_state is None. WDYT?

Makes sense, when it's None it might be also treated that the class argument wasn't set at all, but we actually set it in the init, so empty string sounds like a better option. Plus we'd avoid checking the type where it's used further. Will change it, thanks! 👍

Thinking loud: does it make sense to assume that state_message by default is an empty string? Then we'd keep the order of the arguments the same

yes, it's safe to assume that it's empty string

Changed it back to life_cycle_state, result_state, state_message with the latter two default to empty string

alexott · 2021-11-26T19:04:21Z

airflow/providers/databricks/hooks/databricks.py

        """
-        # SP is outside of the workspace
-        if 'azure_resource_id' in self.databricks_conn.extra_dejson:


I would keep this check inside the function, because it could be called by accident (in the future). maybe call it _fill_aad_headers_if_needed?

What do you think if we call it _get_aad_headers(), which would return either empty dict or a filled dict? Also we won't need input arg headers in this case.

Then we could construct headers like:

aad_headers = self._get_aad_headers() headers = {**USER_AGENT_HEADER.copy(), **aad_headers}

yes, I thought something like this. it's easier to use because the logic of adding headers is incorporating inside function...

alexott · 2021-11-26T19:05:47Z

airflow/providers/databricks/hooks/databricks.py

-    def run_now(self, json: dict) -> str:
+    def run_now(self, json: dict) -> int:


can it break existing code? for example if people are using this result to concatenate with log string without using str?

It won't break existing code, actually it's opposite - if someone assumes that output is str because of the function signature, then it'd break the code, because the actual returned type is int.

alexott · 2021-11-26T19:07:01Z

airflow/providers/databricks/hooks/databricks.py

@@ -522,6 +515,20 @@ def uninstall(self, json: dict) -> None:
        """
        self._do_api_call(UNINSTALL_LIBS_ENDPOINT, json)

+    @staticmethod
+    def _is_aad_token_valid(aad_token: dict) -> bool:


why do we need a separate function that is called from one place?

Mainly for readability to hide the details for checking that token is valid under the separate function, because it's not the main purpose of the parent function _get_aad_token

There's a mistake in the current function implementation: it subtracts TOKEN_REFRESH_LEAD_TIME out of the current time, while it should actually sum it. With this we fix it and cover the function with tests.

alexott · 2021-11-28T11:18:52Z

Otherwise, it looks good. I need to test changes on the real Databricks instances

eskarimov · 2021-11-28T12:00:58Z

Otherwise, it looks good. I need to test changes on the real Databricks instances

Perfect, thanks a lot! I've just pushed the changes regarding empty string by default for result_state and refactoring function for AAD headers retrieval.

uranusjr · 2021-11-29T17:26:43Z

airflow/providers/databricks/hooks/databricks.py

-    def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None:
+    def __init__(
+        self, life_cycle_state: str, state_message: str, result_state: str = '', *args, **kwargs
+    ) -> None:


Why flipping the arguments? This feels like a source for unnecessary trouble. Also, why take *args, **kwarg and ignore them? This is usually an anti-pattern.

Please refer to the comment above, replied there earlier about the same

… cover with tests

eskarimov · 2021-12-05T07:29:41Z

@alexott may I kindly request your review please? (the button for re-requesting somehow doesn't work, nothing happens when I click on it)

alexott

lgtm

boring-cyborg bot added the area:providers label Nov 26, 2021

eskarimov force-pushed the 18999-refactor-databrickshook branch from ff936d2 to bd6c7d5 Compare November 26, 2021 09:22

alexott reviewed Nov 26, 2021

View reviewed changes

uranusjr reviewed Nov 29, 2021

View reviewed changes

eskarimov force-pushed the 18999-refactor-databrickshook branch from e88b278 to ee49b03 Compare December 2, 2021 09:28

eskarimov added 4 commits December 5, 2021 08:02

Refactor DatabricksHook

f770b76

Refactor _fill_add_headers

98d91c0

Extract check for Azure Metadata Service into a separate function and…

6234779

… cover with tests

Place _is_aad_token_valid() together with other internal functions

9d08a12

eskarimov force-pushed the 18999-refactor-databrickshook branch from ee49b03 to 9d08a12 Compare December 5, 2021 07:10

eskarimov requested a review from uranusjr December 5, 2021 07:11

alexott approved these changes Dec 5, 2021

View reviewed changes

potiuk approved these changes Dec 5, 2021

View reviewed changes

potiuk merged commit 728e94a into apache:main Dec 5, 2021

eskarimov deleted the 18999-refactor-databrickshook branch December 6, 2021 16:24

kaxil mentioned this pull request Dec 9, 2021

Status of testing Providers that were prepared on December 07, 2021 #20097

Closed

9 tasks

josh-fell mentioned this pull request Dec 9, 2021

Remove db call from DatabricksHook.__init__() #20180

Merged

josh-fell mentioned this pull request Dec 28, 2021

Remove host as an instance attr in DatabricksHook #20540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor DatabricksHook #19835

Refactor DatabricksHook #19835

eskarimov commented Nov 26, 2021 •

edited

Loading

eskarimov commented Nov 26, 2021

uranusjr commented Nov 26, 2021

eskarimov commented Nov 26, 2021

alexott Nov 26, 2021

eskarimov Nov 28, 2021 •

edited

Loading

alexott Nov 28, 2021

eskarimov Nov 28, 2021

eskarimov Dec 2, 2021

alexott Dec 2, 2021

eskarimov Dec 2, 2021

alexott Nov 26, 2021

eskarimov Nov 28, 2021

alexott Nov 28, 2021

alexott Nov 26, 2021

eskarimov Nov 28, 2021 •

edited

Loading

alexott Nov 26, 2021

eskarimov Nov 28, 2021 •

edited

Loading

alexott commented Nov 28, 2021

eskarimov commented Nov 28, 2021

uranusjr Nov 29, 2021

eskarimov Nov 29, 2021

eskarimov commented Dec 5, 2021

alexott left a comment

		def run_now(self, json: dict) -> str:
		def run_now(self, json: dict) -> int:

Refactor DatabricksHook #19835

Refactor DatabricksHook #19835

Conversation

eskarimov commented Nov 26, 2021 • edited Loading

eskarimov commented Nov 26, 2021

uranusjr commented Nov 26, 2021

eskarimov commented Nov 26, 2021

Choose a reason for hiding this comment

eskarimov Nov 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskarimov Nov 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskarimov Nov 28, 2021 • edited Loading

Choose a reason for hiding this comment

alexott commented Nov 28, 2021

eskarimov commented Nov 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskarimov commented Dec 5, 2021

alexott left a comment

Choose a reason for hiding this comment

eskarimov commented Nov 26, 2021 •

edited

Loading

eskarimov Nov 28, 2021 •

edited

Loading

eskarimov Nov 28, 2021 •

edited

Loading

eskarimov Nov 28, 2021 •

edited

Loading