Skip to content

[AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic#6386

Merged
potiuk merged 1 commit intoapache:masterfrom
PolideaInternal:AIRFLOW-5716
Nov 10, 2019
Merged

[AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic#6386
potiuk merged 1 commit intoapache:masterfrom
PolideaInternal:AIRFLOW-5716

Conversation

@mik-laj
Copy link
Member

@mik-laj mik-laj commented Oct 22, 2019

This PR is one of a series that aims to improve this integration
https://issues.apache.org/jira/browse/AIRFLOW-5697


  • Avoid sending requests in the constructor.
  • Extract 2 methods, which makes it easier to read the code.
  • Choose one way to update the jobs class field
  • Remove redundant loops
  • Refresh jobs status explicitly
  • Change method names to more descriptive ones

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Oct 22, 2019
@mik-laj mik-laj changed the title [AIRFLOW-5716][part of AIRFLOW-5697][[depends on AIRFLOW-5711] Simplify DataflowJobsController logic [AIRFLOW-5716][part of AIRFLOW-5697][depends on AIRFLOW-5711] Simplify DataflowJobsController logic Oct 22, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean up in this PR LGTM. My only thought for further clean up is IMO this function is a misnomer it is called "_start_dataflow" but it actually does two things start and wait_for_done. detangling this so the hook provides a function for starting and a function for waiting and leave details to the Operator's execute. I think this would make it simpler once we sort out my reschedule poking operator PR. Another place it could be useful is we could allow the hook to start a Dataflow Streaming job without waiting on it until some other system cancels. I think this could be cool for streaming jobs we'd only need running at certain times of day. Of course we'd have to add a function to the hook to stop / drain a dataflow streaming job. This could be interesting if you are using a dataflow job to do streaming analytics on IoT data but only during 8 hr working day. Your dag could be @daily start dataflow job and then have a stop dataflow job which reschedules itself for 8hrs after the start dataflow job succeeds. This "ephemeral streaming job " is a rather contrived use case but it demonstrates additional value of separating start and wait_for_done operations in hooks like this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circulating this bounded streaming pipeline idea internally it doesn't seem like there's been real use cases for it in the field.

Copy link
Member Author

@mik-laj mik-laj Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't split this one method into two methods because a process is being run that supervisor the task. Unfortunately, this is a limitation of Apache Beam, which does not have the option of forcing external supervision. In any case, we must wait until the Apache Beam system process is completed to be sure of completing the job.

This operator can also be used to initiate streaming jobs, but we lack the operator to stop the task if we want to handle your process fully.
https://github.com/apache/airflow/blob/master/tests/gcp/hooks/test_dataflow.py#L490-L520

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding that this client side controller process that supervises the job is only the case for "normal" jobs that would be submitted with DataflowPythonOperator or DataflowJavaOperator. However templates can be instantiated and poll for completion separately (see running templates).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a comment about the function that is responsible for running tasks on the local machine, so I was slightly confused

If you would like to create an asynchronous operator then you would have to make changes to the _start_template_dataflow method, so that it does not start the waiting process. In next step, you should use is_job_dataflow_running method to poke jobs status. Currently, most hook methods for GCP integration are synchronous, because it was part of the practice that my team used. I think this is not explicitly written in the integration guide and we should update it on this issue.
https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?ts=5bb72dfd#

@codecov-io
Copy link

codecov-io commented Oct 26, 2019

Codecov Report

Merging #6386 into master will decrease coverage by 0.06%.
The diff coverage is 85.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6386      +/-   ##
==========================================
- Coverage    83.8%   83.74%   -0.07%     
==========================================
  Files         635      635              
  Lines       36750    36743       -7     
==========================================
- Hits        30800    30769      -31     
- Misses       5950     5974      +24
Impacted Files Coverage Δ
airflow/gcp/hooks/dataflow.py 91.59% <85.29%> (-0.65%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
airflow/kubernetes/kube_client.py 33.33% <0%> (-41.67%) ⬇️
...rflow/contrib/operators/kubernetes_pod_operator.py 70.14% <0%> (-28.36%) ⬇️
airflow/models/taskinstance.py 93.28% <0%> (-0.51%) ⬇️
airflow/utils/dag_processing.py 58.48% <0%> (+0.32%) ⬆️
airflow/hooks/dbapi_hook.py 91.52% <0%> (+0.84%) ⬆️
airflow/models/connection.py 65% <0%> (+1.11%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33ddcd9...679ae91. Read the comment docs.

@mik-laj mik-laj changed the title [AIRFLOW-5716][part of AIRFLOW-5697][depends on AIRFLOW-5711] Simplify DataflowJobsController logic [AIRFLOW-5716][part of AIRFLOW-5697] Simplify DataflowJobsController logic Nov 8, 2019
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Really nice simplification!

@potiuk potiuk merged commit 56dd819 into apache:master Nov 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants