[AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic by mik-laj · Pull Request #6386 · apache/airflow

mik-laj · 2019-10-22T09:25:12Z

This PR is one of a series that aims to improve this integration
https://issues.apache.org/jira/browse/AIRFLOW-5697

Avoid sending requests in the constructor.
Extract 2 methods, which makes it easier to read the code.
Choose one way to update the jobs class field
Remove redundant loops
Refresh jobs status explicitly
Change method names to more descriptive ones

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-XXX
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

jaketf · 2019-10-25T15:34:07Z

airflow/gcp/hooks/dataflow.py

The clean up in this PR LGTM. My only thought for further clean up is IMO this function is a misnomer it is called "_start_dataflow" but it actually does two things start and wait_for_done. detangling this so the hook provides a function for starting and a function for waiting and leave details to the Operator's execute. I think this would make it simpler once we sort out my reschedule poking operator PR. Another place it could be useful is we could allow the hook to start a Dataflow Streaming job without waiting on it until some other system cancels. I think this could be cool for streaming jobs we'd only need running at certain times of day. Of course we'd have to add a function to the hook to stop / drain a dataflow streaming job. This could be interesting if you are using a dataflow job to do streaming analytics on IoT data but only during 8 hr working day. Your dag could be @daily start dataflow job and then have a stop dataflow job which reschedules itself for 8hrs after the start dataflow job succeeds. This "ephemeral streaming job " is a rather contrived use case but it demonstrates additional value of separating start and wait_for_done operations in hooks like this one.

Circulating this bounded streaming pipeline idea internally it doesn't seem like there's been real use cases for it in the field.

We can't split this one method into two methods because a process is being run that supervisor the task. Unfortunately, this is a limitation of Apache Beam, which does not have the option of forcing external supervision. In any case, we must wait until the Apache Beam system process is completed to be sure of completing the job.

This operator can also be used to initiate streaming jobs, but we lack the operator to stop the task if we want to handle your process fully.
https://github.com/apache/airflow/blob/master/tests/gcp/hooks/test_dataflow.py#L490-L520

My understanding that this client side controller process that supervises the job is only the case for "normal" jobs that would be submitted with DataflowPythonOperator or DataflowJavaOperator. However templates can be instantiated and poll for completion separately (see running templates).

You added a comment about the function that is responsible for running tasks on the local machine, so I was slightly confused

If you would like to create an asynchronous operator then you would have to make changes to the _start_template_dataflow method, so that it does not start the waiting process. In next step, you should use is_job_dataflow_running method to poke jobs status. Currently, most hook methods for GCP integration are synchronous, because it was part of the practice that my team used. I think this is not explicitly written in the integration guide and we should update it on this issue.
https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?ts=5bb72dfd#

codecov-io · 2019-10-26T08:13:15Z

Codecov Report

Merging #6386 into master will decrease coverage by 0.06%.
The diff coverage is 85.29%.

@@            Coverage Diff             @@
##           master    #6386      +/-   ##
==========================================
- Coverage    83.8%   83.74%   -0.07%     
==========================================
  Files         635      635              
  Lines       36750    36743       -7     
==========================================
- Hits        30800    30769      -31     
- Misses       5950     5974      +24

Impacted Files	Coverage Δ
airflow/gcp/hooks/dataflow.py	`91.59% <85.29%> (-0.65%)`	⬇️
airflow/kubernetes/volume_mount.py	`44.44% <0%> (-55.56%)`	⬇️
airflow/kubernetes/volume.py	`52.94% <0%> (-47.06%)`	⬇️
airflow/kubernetes/pod_launcher.py	`45.25% <0%> (-46.72%)`	⬇️
airflow/kubernetes/kube_client.py	`33.33% <0%> (-41.67%)`	⬇️
...rflow/contrib/operators/kubernetes_pod_operator.py	`70.14% <0%> (-28.36%)`	⬇️
airflow/models/taskinstance.py	`93.28% <0%> (-0.51%)`	⬇️
airflow/utils/dag_processing.py	`58.48% <0%> (+0.32%)`	⬆️
airflow/hooks/dbapi_hook.py	`91.52% <0%> (+0.84%)`	⬆️
airflow/models/connection.py	`65% <0%> (+1.11%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33ddcd9...679ae91. Read the comment docs.

potiuk

I like it. Really nice simplification!

mik-laj added the provider:google Google (including GCP) related issues label Oct 22, 2019

mik-laj changed the title ~~[AIRFLOW-5716][part of AIRFLOW-5697][[depends on AIRFLOW-5711] Simplify DataflowJobsController logic~~ [AIRFLOW-5716][part of AIRFLOW-5697][depends on AIRFLOW-5711] Simplify DataflowJobsController logic Oct 22, 2019

jaketf reviewed Oct 25, 2019

View reviewed changes

mik-laj force-pushed the AIRFLOW-5716 branch from 37d0d3e to 9926268 Compare October 26, 2019 07:29

mik-laj changed the title ~~[AIRFLOW-5716][part of AIRFLOW-5697][depends on AIRFLOW-5711] Simplify DataflowJobsController logic~~ [AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic Nov 8, 2019

[AIRFLOW-5716] Simplify DataflowJobsController logic

679ae91

mik-laj force-pushed the AIRFLOW-5716 branch from 9926268 to 679ae91 Compare November 8, 2019 15:03

potiuk approved these changes Nov 10, 2019

View reviewed changes

potiuk merged commit 56dd819 into apache:master Nov 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic#6386

[AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic#6386
potiuk merged 1 commit intoapache:masterfrom
PolideaInternal:AIRFLOW-5716

mik-laj commented Oct 22, 2019

Uh oh!

jaketf Oct 25, 2019

Uh oh!

jaketf Oct 25, 2019

Uh oh!

mik-laj Nov 8, 2019 •

edited

Loading

Uh oh!

jaketf Nov 9, 2019

Uh oh!

mik-laj Nov 10, 2019

Uh oh!

codecov-io commented Oct 26, 2019 •

edited

Loading

Uh oh!

potiuk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mik-laj commented Oct 22, 2019

Jira

Description

Tests

Commits

Documentation

Uh oh!

jaketf Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jaketf Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

mik-laj Nov 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaketf Nov 9, 2019

Choose a reason for hiding this comment

Uh oh!

mik-laj Nov 10, 2019

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mik-laj Nov 8, 2019 •

edited

Loading

codecov-io commented Oct 26, 2019 •

edited

Loading