Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider Databricks add jobs create operator. #29790

Closed
wants to merge 2 commits into from
Closed

Provider Databricks add jobs create operator. #29790

wants to merge 2 commits into from

Conversation

kyle-winkelman
Copy link
Contributor

@kyle-winkelman kyle-winkelman commented Feb 27, 2023

Add the DatabricksJobsCreateOperator for use cases where the DatabricksSubmitRunOperator is insufficient.
closes: #29733

@boring-cyborg
Copy link

boring-cyborg bot commented Feb 27, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Mar 4, 2023

cc: @alexott ?

@alexott
Copy link
Contributor

alexott commented Mar 4, 2023

@kyle-winkelman what are the use cases for that operator?

@kyle-winkelman
Copy link
Contributor Author

My specific use case is that I want to run multiple tasks in the same job and on a single job_clusters. This is not supported by the DatabricksSubmitRunOperator because it doesn't support the parameter job_clusters, so the only way to support it is to create a new_cluster per task I want to run.

The other approach is to use the DatabricksRunNowOperator which has the limitation that you have to define your Databricks Job somewhere else and in some different manner (i.e. manually in Databricks UI, custom CI/CD pipeline, etc.). My team doesn't like having the definition of the job be separated from the use of it from Airflow. In my opinion the DatabricksRunNowOperator with just a single job_id lacks information about what is actually happening.

So to sum up, it is useful to be paired with the DatabricksRunNowOperator to define a job and run it in the same DAG.

Copy link
Contributor

@alexott alexott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll discuss this further with the Databricks Workflows team

Comment on lines +218 to +226
tasks: list[object] | None = None,
job_clusters: list[object] | None = None,
email_notifications: object | None = None,
webhook_notifications: object | None = None,
timeout_seconds: int | None = None,
schedule: dict[str, str] | None = None,
max_concurrent_runs: int | None = None,
git_source: dict[str, str] | None = None,
access_control_list: list[dict[str, str]] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new operators I would really like to use real data structures (for example, data classes), not the simple objects as it doesn't provide users with auto-completion, etc. - it's easy to make mistake in the untyped JSON definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of a different operator that follows this pattern? It would be helpful to see an example.

Any thoughts on a tool to generate such data structures from the Databricks OpenAPI Spec? There are a lot of data structures that would need to be created and I don't want to do so manually.

My initial thought was to rely on Databricks Python API, but it doesn't have these data structures or validations either.

Copy link
Contributor

@alexott alexott Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have existing example, it was just thoughts for the new operators. New Python API will be available soon, that will provide access to the latest APIs. I need to ask dev team when it’s coming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this new operator should wait for the new Python API that you expect to be available soon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m waiting for answer from product management - maybe we won’t need this operator…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you heard anything from the Databricks product management @alexott?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no decision yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexott lets please get to a decision in the upcoming days (before next wave of providers)... if not i will accept the PR as is.
Should in the future Databricks decide against these operators and present arguments why we shouldn't have them we can always remove with a major release.

include_prior_dates=True,
)
if self.job_id:
self._hook.reset(self.job_id, self.json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of I would recommend to get the current job definition and compare with the new definition, and reset only if this definition changes (that should happen relatively rare).

Also, you need to handle following edge cases:

  • Job is deleted via UI - the reset will fail because job ID doesn't exist
  • We can lose XCom information, so we'll create a duplicate for the job.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexott convinced me that it needs changes.

@potiuk
Copy link
Member

potiuk commented Apr 23, 2023

@alexott convinced me that it needs changes.

@alexott - are there any news from Databricks on that one? We are considering merging it regardless, but it seems you wanted to have some input from the team first?

@alexott
Copy link
Contributor

alexott commented Apr 23, 2023

@potiuk Unfortunately no news. The decision takes longer than planned. If we can deprecate it in the later versions, let merge it.

@potiuk
Copy link
Member

potiuk commented Apr 23, 2023

@potiuk Unfortunately no news. The decision takes longer than planned. If we can deprecate it in the later versions, let merge it.

OK. Cool. I am a big fan of having tactical solutions that solve part of the probem or a problem for smaller group of people and releasing them early, as long as they are not preventing strategic long term solutions that needs a longer debate and far more work to be implemented later. This one looks like one of those.

I re-reviewed it and it looks cool. @kyle-winkelman -> can you please rebase and fix the static check problem, then I am happy to merge it (@eladkal - WDYT?).

@lennartkats-db
Copy link

@potiuk @kyle-winkelman This seems like a useful addition from the Databricks perspective. Thanks for contributing, Kyle! 🙏

I do think it would be good to look at the XCOM issue that Alex called out. Wouldn't it be better to just use the API to find an existing job for the cases that Alex called out where the current approach doesn't work? To avoid creating duplicate jobs or having the operator fail? The DatabricksRunNowOperator already applies logic like this, as it allows users to run a given job based on the name of that job (rather than the job id).

@stikkireddy
Copy link
Contributor

hey @kyle-winkelman do you have bandwidth to move forward with this PR?

@stikkireddy
Copy link
Contributor

@potiuk is it okay if someone else forks Kyle's repo and fix the static check to not lose @kyle-winkelman's credit/commits and also address the previous comments made by @alexott? I am also okay to wait a bit for Kyle to respond.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2023

@potiuk is it okay if someone else forks Kyle's repo and fix the static check to not lose @kyle-winkelman's credit/commits and also address the previous comments made by @alexott? I am also okay to wait a bit for Kyle to respond.

Fine for me.

@github-actions
Copy link

github-actions bot commented Aug 8, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 8, 2023
@github-actions github-actions bot closed this Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:databricks stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Databricks create/reset then run-now
6 participants