Skip to content

Comments

Databricks ClusteState & Clsuter GET API#34071

Closed
Seokyun-Ha wants to merge 26 commits intoapache:mainfrom
Seokyun-Ha:databricks-hook-cluster-manage
Closed

Databricks ClusteState & Clsuter GET API#34071
Seokyun-Ha wants to merge 26 commits intoapache:mainfrom
Seokyun-Ha:databricks-hook-cluster-manage

Conversation

@Seokyun-Ha
Copy link
Contributor

@Seokyun-Ha Seokyun-Ha commented Sep 4, 2023

closes: #19490

  • Implement Databricks ClusterState class
  • Databricks cluster relating functions
    • get_cluster()
    • async_get_cluster()
    • activate_cluster()
  • renaming a_get_... methods to async_get_...
  • Handle unexpected states of RunState, ClusterState on__init__() level.

Organization: @bagelcode-data
Co-workers: @Seokyun-Ha, @bskim45, @kyeonghoon-kim


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Sep 4, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@Seokyun-Ha Seokyun-Ha force-pushed the databricks-hook-cluster-manage branch from a5c4beb to 77376fb Compare September 4, 2023 09:29
Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need some test cases as well

@hussein-awala hussein-awala self-requested a review September 4, 2023 11:20
@Seokyun-Ha Seokyun-Ha requested a review from Lee-W September 5, 2023 05:12
@Seokyun-Ha Seokyun-Ha changed the title Databricks ClusteState & Clsuter get, activate Databricks ClusteState & Clsuter GET API Sep 6, 2023
@Seokyun-Ha
Copy link
Contributor Author

Seokyun-Ha commented Sep 7, 2023

We are making mock testing code. After the task, we will let you know 😄 w/ @kyeonghoon-kim

@Seokyun-Ha
Copy link
Contributor Author

Hello, @Lee-W @hussein-awala, We've applied all your feedbacks! 😄 Also, we implemented mock testing code as well. Please, take a look 🙏 Thanks!

@potiuk potiuk force-pushed the databricks-hook-cluster-manage branch from 1e921cc to 1b30c4e Compare September 7, 2023 18:07
@Seokyun-Ha
Copy link
Contributor Author

We checked some test are failed :(
We will take a look and fix them!

kyeonghoon-kim and others added 3 commits September 8, 2023 12:05
* fix lint

* fix lint
* fix test

* fix test

* fix line too long
@Seokyun-Ha
Copy link
Contributor Author

Seokyun-Ha commented Sep 8, 2023

We checked our code works well using mock test and validation 😄 Please, take a look at it 🙏

@potiuk
Copy link
Member

potiuk commented Sep 11, 2023

cc: @alexott ?

# Conflicts:
#	airflow/providers/databricks/hooks/databricks.py
@Seokyun-Ha
Copy link
Contributor Author

Hello, @alexott . We wrote some code for get ClusterState and others. Please take a look at :)

api_called = False
time_start = time.time()

while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add an option for users to trigger the API simply but not wait for it?

Copy link
Contributor

@kyeonghoon-kim kyeonghoon-kim Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just triggering the API is already implemented in start_cluster method on line 562

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, should we consolidate them into one method? For me, I can not understand the logic difference by reading the names start_cluster and activate_cluster. But this's not a huge issue though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When start a Databricks Cluster, it tunrs on as PENDING state, so we cannot use the cluster until it's on ready state. That's the what start_cluster() method does.

Meanwhile, active_cluster() method guarantees the cluster is on Running State, so after the completed, we can utilize the cluster immediately.

Sometimes users want just starting the cluster, or want making the cluster into the usable status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I know it's a desired behavior. Just wondering if it would be possible for us to merge the method and toggle the waiting behavior through a parameter

Copy link
Contributor Author

@Seokyun-Ha Seokyun-Ha Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe. I now understand your suggestion. I think callingactive_cluster() method without polling and timeout is same as calling start_cluster(). So, if I change the code it looks like this

def start_cluster(self, json: dict, polling: int | None = None, timeout: int | None = None) -> None:
        cluster_id = json["cluster_id"]

        api_called = False
        time_start = time.time()

        while True:
            run_state = self.get_cluster_state(cluster_id)

            if run_state.is_running:
                return
            elif run_state.is_terminal:
                if api_called:
                    raise AirflowException(
                        f"Cluster {cluster_id} start failed with '{run_state.state}' "
                        f"state: {run_state.state_message}"
                    )

                # This part changed
                # self.start_cluster(json)
                self._do_api_call(START_CLUSTER_ENDPOINT, json)
                api_called = True

           # This part changed
            if polling:
              # wait for cluster to start
              time.sleep(polling)
           else:
             return

            elapsed_time = time.time() - time_start
            if timeout and elapsed_time > timeout:
                raise AirflowException(f"Cluster {cluster_id} start timed out after {timeout} seconds")

Please, take a look and get some opinions 🙏

+ I think then we can apply this to restart_cluster() method too!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have one method only, with default to return immediately, but allow to wait until start/rester when specifying options.

Comment on lines -315 to +366
async def a_get_run_page_url(self, run_id: int) -> str:
async def async_get_run_page_url(self, run_id: int) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this & other similar changes could be considered as breaking. There is quite heavy direct use of the hook. I suggest that we rename, but leave functions with original names calling the new names, but adding deprecation warnings to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the best approach is by renaming it as you did, then create a new method deprecated method with the old name which call the new name method:

async def a_get_run_page_url(self, run_id: int) -> str:
    warnings.warn(
        "This method is deprecated. Please use `<path to the new method>` instead.",
        AirflowProviderDeprecationWarning,
        stacklevel=2,
    )
    return await async_get_run_page_url(run_id= run_id)

Then we can remove it in the next major version

"""
self._do_api_call(START_CLUSTER_ENDPOINT, json)

def activate_cluster(self, json: dict, polling: int, timeout: int | None = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it start_cluster_and_wait ?

api_called = False
time_start = time.time()

while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have one method only, with default to return immediately, but allow to wait until start/rester when specifying options.

@Seokyun-Ha
Copy link
Contributor Author

Seokyun-Ha commented Sep 27, 2023

Hello, guys, I and @kyeonghoon-kim discussed about this active_cluster and this PR. We noticed this PR has too many changes on DatabricksHook and other. So, we decided to separate this PR into smaller ones.

Kind of changes

  • RunState, ClusterState and get_cluster_state()
    • We will open PR for this as soon as possible
  • active_cluster() method
    • This method can be replaced into Airflow Sensor. We will implement a sensor for Databricks Cluster State in future PR. It will monitor cluster state until the desired state met.
  • async_get_xxx()
    • This change are dropped, because they make breaking changes.

Thanks for your kind and sincere reviews 🙏 : We will close this PR after we make a new separated PR 😄
cc. @Lee-W @hussein-awala @alexott

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 16, 2023
@github-actions github-actions bot closed this Dec 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:databricks stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

provider Databricks : add cluster API

6 participants