Add DataflowStartFlexTemplateOperator #8550

mik-laj · 2020-04-25T00:16:53Z

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

mik-laj · 2020-04-26T04:07:07Z

@jaketf Can I ask for review? I know that you are also interested in integration with Dataflow. Will this solve your clients' problems fully? If they use it, they will no longer have to use KubernetesPodOperator.

aaltay

/cc: @azurezyq

aaltay · 2020-04-27T20:41:35Z

airflow/providers/google/cloud/hooks/dataflow.py

@@ -93,15 +93,30 @@ def inner_wrapper(self: "DataflowHook", *args, **kwargs) -> RT:
 class DataflowJobStatus:
    """
    Helper class with Dataflow job statuses.
+    Reference: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.JobState


This file is also modified in #8553. I am assuming it is the same changes, skipping this file for the review.

I replied here #8553 (comment) and changed here 92858eb

aaltay · 2020-04-27T20:42:33Z

airflow/providers/google/cloud/operators/dataflow.py

+    def on_kill(self) -> None:
+        self.log.info("On kill.")
+        if self.job_id:
+            self.hook.cancel_job(job_id=self.job_id, project_id=self.project_id)


Do you need to call this if job is no longer running?

Good point. I will change it

I changed it here 1eac772

Improved it here 4b78b27

aaltay · 2020-04-27T20:47:03Z

tests/providers/google/cloud/operators/test_dataflow_system.py

+            f.write(
+                textwrap.dedent(
+                    """\
+                    steps:


What is being built in this cloud image pipeline?

I am building an image from your repository.
https://github.com/GoogleCloudPlatform/java-docs-samples/tree/954553c/dataflow/flex-templates/streaming_beam_sql

Ack. Does it required this many containers in the image?

I am not very familiar. Is this a VM image with containers?

I removed some build steps.
aa29389

In order to build this Docker image I used Google Cloud Build. The Cloud Build service will provide a new virtual machine for each build. Each build contains many steps. Each step is described by a Docker image. As a result of the build, a new Docker image is built.

I personally would not worry about a large number of images, because after the build is completed, the virtual machine is deleted. It does not run on the local machine.

aaltay · 2020-04-27T20:48:42Z

tests/providers/google/cloud/operators/test_dataflow_system.py

@@ -15,9 +15,20 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+import os


This file reads more like an end2end test of various dataflow system things. Should it be a different PR?

We always try to write system tests and integrations simultaneously to ensure the best integration reliability. Thanks to this, every person on my team can easily test every integration.

aaltay · 2020-04-27T20:50:05Z

airflow/providers/google/cloud/example_dags/example_dataflow_flex_template.py

+    schedule_interval=None,  # Override to match your needs
+) as dag_flex_template:
+    start_flex_template = DataflowStartFlexTemplateOperator(
+        task_id="start_flex_template_java",


maybe drop java from the name? (Given a template structure, what language was used to write it matters less.)

I renamed it to start_flex_template_streaming_beam_sql here d788bab

tests/providers/google/cloud/operators/test_dataflow_system.py

jaketf · 2020-05-02T20:03:44Z

airflow/providers/google/cloud/hooks/dataflow.py

+        """
+        Starts flex templates with the Dataflow  pipeline.
+
+        :param body: The request body


nit: use more descriptive name launch_flex_template_parameters

If we are not going to construct this body here and document the parameters then this should include link tot he API spec for this model.

Unfortunately it looks like we mention this end point here in flex templates docs but not here in the REST API docs.

I've opened an internal docs bug on this.

In the mean time I'm pretty sure this body is just the same models as the "old" templates endpoint uses LaunchTemplateParameters.
@aaltay can you confirm?

Looks like LaunchTemplateParameters is missing the container_spec_gcs_path key which is necessary for this endpoint.
Why is this key snake_case in the API while all others (e.g. jobName) are camelCase anyway?

I see documentation of the request body is available now:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#request-body
Thanks taking care of it @jaketf
I added links to in hook and operator.

In relation to naming I think snake_case is used as standard for python and camelCase when it is required by external APIs (usually within dictionaries). I am not sure I understand correctly your question. Could you elaborate please?

jaketf · 2020-05-02T20:05:57Z

airflow/providers/google/cloud/operators/dataflow.py

+    """
+    Starts flex templates with the Dataflow  pipeline.
+
+    :param body: The request body


see comment on same param in hook.

jaketf · 2020-05-02T20:08:47Z

tests/providers/google/cloud/hooks/test_dataflow.py

        get_method = (
            self.mock_dataflow.projects.return_value.
            locations.return_value.
            jobs.return_value.
            get
        )
-        get_method.return_value.execute.return_value = job
+        get_method.return_value.execute.side_effect = [
+            {"id": TEST_JOB_ID, "name": TEST_JOB_NAME, "currentState": DataflowJobStatus.JOB_STATE_RUNNING},


note from other PR #8553 see how JobStatus.JOB_STATE is redundant stutter?
Especially if we allow user to control failed states we should remove the stutter.

created separate issue for that: #11205

jaketf · 2020-05-19T18:08:24Z

airflow/providers/google/cloud/hooks/dataflow.py

+            location=location,
+            poll_sleep=self.poll_sleep,
+            num_retries=self.num_retries)
+        jobs_controller.wait_for_done()


@mik-laj was reflecting on this in light of the data fusion operator issue.
This is "start" naming confusing.

This method (and DataflowStartFlexTemplateOperator) are called "start" flex template but this appears like you are waiting for the job to complete.

The existing dataflow operators do not have this start word and I think the user expectation is that they poll the job to completion. Otherwise you can't do much useful downstream in the DAG without having some sensor that waits on this job completion.

If we want to support blocking or not blocking I'd suggest having a wait_for_done kwarg that defaults to True (the expected behavior based on similar operators). This might mean that we need a new method in the controller wait_for_running that blocks until the pipeline enters the RUNNING state.

What do you think?

same applies for #8553

I am trying to imitate the naming conventions that we currently have in this integration.
DataflowTemplatedJobStartOperator
It uses the verb "Start" to describe identical operations.

I plan to add a blocking and non-blocking mode to all operators in a separate PR, because this requires the development of sensors. For now, I only wanted to focus on one issue.

syucream · 2020-06-10T02:55:56Z

What's the current status? Do you have any progress? I want this patch, Can I help you anything? @mik-laj

mik-laj · 2020-06-10T06:35:24Z

Hiello.
I had urgent tasks that prevented me from continuing my work. Now I am continuing work on Dataflow, but for now I have to fix problems with system tests.I hope to finish it next week.
Best regards

syucream · 2020-08-03T14:47:04Z

@mik-laj Any update? I created that pull request PolideaInternal#952 to fix the conflict and some points which were discussed here. Can you see it if it's helpful?

ad-m · 2020-08-21T12:28:20Z

tests/providers/google/cloud/operators/test_dataflow_system.py

+            f.write(
+                textwrap.dedent(
+                    """\
+                    steps:


Have you considered use standard dict and use json.dumps to generate file content? You don't use any YAML specific syntax and JSON is subset of YAML. I believe that dict is easier to maintain long term than a hard-coded text block.

I think it was used for sake of simplicity. It is easier to take a look on the multiline string which represents quite simple YAML. Am I right @mik-laj ?

Is it easier to read a YAML dictionary inserted as text (no formatting in most IDE) than a Python dictionary with correct editor formatting?

You are right, I changed my mind. I think dictionary looks better especially that Cloud Build accept JSON config files

cloud_build_config = { 'steps': [ {'name': 'gcr.io/cloud-builders/git', 'args': ['clone', "$_EXAMPLE_REPO", "repo_dir"]}, { 'name': 'gcr.io/cloud-builders/git', 'args': ['checkout', '$_EXAMPLE_COMMIT'], 'dir': 'repo_dir', }, { 'name': 'maven', 'args': ['mvn', 'clean', 'package'], 'dir': 'repo_dir/$_EXAMPLE_SUBDIR', }, { 'name': 'gcr.io/cloud-builders/docker', 'args': ['build', '-t', '$_TEMPLATE_IMAGE', '.'], 'dir': 'repo_dir/$_EXAMPLE_SUBDIR', }, ], 'images': ['$_TEMPLATE_IMAGE'], }

TobKed · 2020-10-02T12:01:11Z

I replied to comments and made requested changes. Recently rebased on the latest master.

cc @mik-laj @syucream @ad-m @aaltay @jaketf

ad-m

Could you rebase?

TobKed · 2020-10-07T04:33:02Z

Could you rebase?

@ad-m I rebased on the lastest master and squashed commits.

TobKed · 2020-10-07T04:33:12Z

@mik-laj could you remove dont-merge label, please?
cc @potiuk @turbaszek

turbaszek · 2020-10-07T13:41:01Z

airflow/providers/google/cloud/hooks/dataflow.py

+        """
+        if not self._jobs:
+            raise ValueError("The _jobs should be set")
+        while True:


Should we consider implementing a timeout?

We have timeout: See: https://airflow.readthedocs.io/en/latest/_api/airflow/models/index.html#airflow.models.BaseOperator -> execution_timeout

PTAL: 144b63f

Since the cancel method is executed only by the operator (on_kill) I didn't allow user to configure this timeout because I don't think it is worth to add complexity to the user and the code itself. It is added to prevent hanging this _wait_for_states forever when execution_timeout is not set and provide more meaningful log message if the timeout eventually occur.

I proposed 5 minutes of default timeout. @aaltay @kamilwu what do you think about this value ?

I think this is reasonable.

Side note, cancel usually completes quickly. On the other hand drain is the safer way to cancel streaming pipelines and that can take a long time depending on the state of the pipeline.

If we use this only in on_kill, should we even bother with waiting? I would assume that a successful API call should be enough for us

@mik-laj what do you think about removing waiting? I lean towards @turbaszek opinion

@aaltay I created separated draft PR for handling draining on kill: #11374. WDYT?

Since it is not strictly related changed and I don't want to block this PR I deleted this part of logic and created separated PR for handling waiting and timeout in job cancel: #11501 (draft because I must add tests).

ad-m

LGTM

potiuk · 2020-10-13T19:04:13Z

Hey @TobKed . Can you please rebase this one to the latest master. We fixed (hopefully) a problem with queues of jobs for GitHub actions and I think when you rebase, it should run much faster (more info on devlist shortly).

TobKed · 2020-10-16T11:22:13Z

rebased on the latest master
cc @potiuk @turbaszek

potiuk · 2020-10-16T16:28:45Z

Thanks @TobKed !

ad-m · 2020-10-17T01:02:50Z

Ja także dziękuje @TobKed za rozwój projektu Airflow!

boring-cyborg bot added the provider:google Google (including GCP) related issues label Apr 25, 2020

mik-laj force-pushed the dataflow-flex branch from 08e485f to ca1cc89 Compare April 25, 2020 00:17

mik-laj marked this pull request as ready for review April 26, 2020 04:05

mik-laj added the dont-merge label Apr 27, 2020

aaltay reviewed Apr 27, 2020

View reviewed changes

mik-laj commented Apr 27, 2020

View reviewed changes

tests/providers/google/cloud/operators/test_dataflow_system.py Outdated Show resolved Hide resolved

mik-laj commented Apr 27, 2020

View reviewed changes

tests/providers/google/cloud/operators/test_dataflow_system.py Outdated Show resolved Hide resolved

jaketf reviewed May 15, 2020

View reviewed changes

jaketf reviewed May 19, 2020

View reviewed changes

mik-laj added the pinned Protect from Stalebot auto closing label Jun 21, 2020

syucream mentioned this pull request Aug 3, 2020

Update https://github.com/apache/airflow/pull/8550 PolideaInternal/airflow#952

Closed

ad-m reviewed Aug 21, 2020

View reviewed changes

mik-laj force-pushed the dataflow-flex branch from aa29389 to 7821a93 Compare August 27, 2020 09:47

TobKed force-pushed the dataflow-flex branch from ba5cc30 to 98d6c85 Compare September 28, 2020 15:39

TobKed mentioned this pull request Sep 30, 2020

Simplify DataflowJobStatus states #11205

Closed

TobKed force-pushed the dataflow-flex branch from a363b36 to 0881b02 Compare October 2, 2020 10:25

ad-m approved these changes Oct 2, 2020

View reviewed changes

aaltay approved these changes Oct 2, 2020

View reviewed changes

jaketf approved these changes Oct 6, 2020

View reviewed changes

ad-m suggested changes Oct 6, 2020

View reviewed changes

TobKed force-pushed the dataflow-flex branch 2 times, most recently from af0da63 to a9ba302 Compare October 7, 2020 04:31

turbaszek removed the dont-merge label Oct 7, 2020

turbaszek added the area:providers label Oct 7, 2020

turbaszek reviewed Oct 7, 2020

View reviewed changes

ad-m approved these changes Oct 9, 2020

View reviewed changes

TobKed force-pushed the dataflow-flex branch from 144b63f to cf44f53 Compare October 13, 2020 13:57

TobKed mentioned this pull request Oct 13, 2020

Dataflow - add waiting for successful job cancel #11501

Merged

TobKed force-pushed the dataflow-flex branch 4 times, most recently from c57cd16 to ea04ead Compare October 15, 2020 23:36

Kamil Breguła and others added 4 commits October 16, 2020 12:12

Add DataflowStartFlexTemplateOperator

05ab8b8

Add timeout for job cancel

a8daafd

fixup! Add timeout for job cancel

2a01723

fixup! fixup! Add timeout for job cancel

e204c61

TobKed force-pushed the dataflow-flex branch from ea04ead to e204c61 Compare October 16, 2020 10:12

potiuk approved these changes Oct 16, 2020

View reviewed changes

potiuk merged commit 3c10ca6 into apache:master Oct 16, 2020

Add DataflowStartFlexTemplateOperator #8550

Add DataflowStartFlexTemplateOperator #8550

Conversation

mik-laj commented Apr 25, 2020

mik-laj commented Apr 26, 2020

aaltay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

syucream commented Jun 10, 2020

mik-laj commented Jun 10, 2020

syucream commented Aug 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TobKed commented Oct 2, 2020

ad-m left a comment

Choose a reason for hiding this comment

TobKed commented Oct 7, 2020

TobKed commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ad-m left a comment

Choose a reason for hiding this comment

potiuk commented Oct 13, 2020

TobKed commented Oct 16, 2020

potiuk commented Oct 16, 2020

ad-m commented Oct 17, 2020