Wait for pipeline state in Data Fusion operators #8954

turbaszek · 2020-05-21T16:00:28Z

Closes: #8673

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

airflow/providers/google/cloud/hooks/datafusion.py

jaketf · 2020-05-21T19:48:09Z

airflow/providers/google/cloud/operators/datafusion.py

            pipeline_name=self.pipeline_name,
            instance_url=api_url,
            namespace=self.namespace,
            runtime_args=self.runtime_args

        )
        self.log.info("Pipeline started")
+        hook.wait_for_pipeline_state(
+            success_states=[PipelineStates.COMPLETED],


This operator is name CloudDataFusionStartPipelineOperator (key word start).
IMO that means PipelineStates.RUNNING should be a success state by default.
However, it might be worth allowing the user to optionally control these success states to span more use cases.

I've added success_states as an optional argument. If not provided the operator will wait for operation to be RUNNING

airflow/providers/google/cloud/hooks/datafusion.py

jaketf · 2020-05-22T16:51:56Z

airflow/providers/google/cloud/hooks/datafusion.py

+                runtime_args = json.loads(pipe["properties"]["runtimeArgs"])
+                if runtime_args[job_id_key] == faux_pipeline_id:
+                    return pipe["runid"]
+            sleep(10)


How often do you notice the first request failing? If >50% then we expect the program run doesn't show up for some time (seconds) and we can expect first iteration to fail not work, do we expect this loop to happen at least 2-3x?
Could we move the sleep to the beginning of the loop body to increase likelihood that this loop exits on an earlier iteration? Hopefully save API calls we don't expect to yield a successful get on the program run.

Usually it's successful on 2nd loop but sometimes I get to the 3rd. I will move the sleep to the beginning as you suggested so it should decrease number of requests.

jaketf · 2020-05-22T17:02:50Z

airflow/providers/google/cloud/hooks/datafusion.py

+        # may not be present instantly
+        for _ in range(5):
+            response = self._cdap_request(url=url, method="GET")
+            if response.status != 200:


I'm not sure this will have the behavior you expect.

what happens when you do a get on the program run id that isn't present yet? 404 or empty body?

I personally get 404 on a 6.1.1 instance for a random uuid i generated. Has the API behavior changed?

FYI you can get to this convenient UI by clicking system admin > configuration > Make HTTP calls
It's very useful for getting used to / testing the CDAP REST API.

You are requesting .../runs/run-id, the code here is calling .../runs to get list of all runs because we don't know yet the proper CDAP run-id. I assume that this request should be successful unless something wrong is with API / network.

Ah I see, my mistake.
You'll get 200 and empty collection if there's no runs.

Yes, so I will retry the request. I am not sure if there's anything we can do about this. When we call this method we are expecting to see some runs

@jaketf the API that CDAP exposes is the basic building blocks of programs. Which are workflows, spark, mapreduce jobs etc. The Data Fusion pipelines use workflows for batch jobs and spark streaming jobs for realtime. The operators should wait for the batch jobs and not wait for the streaming ones.

@sreevatsanraman how can we distinguish those two types? According to Data Fusion CDAP API reference users should use the same endpoint to start both batch and streaming pipelines:
https://cloud.google.com/data-fusion/docs/reference/cdap-reference#start_a_batch_pipeline

I think we should just handle batch pipelines in this PR (as this is implicitly all the current operator does). Also, anecdotally, I think this covers 90% of use cases for airflow. In the field i have not see a lot of streaming orchestration with airflow.

jaketf · 2020-05-22T17:23:07Z

airflow/providers/google/cloud/hooks/datafusion.py

+            namespace,
+            "apps",
+            pipeline_name,
+            "workflows",


This CDAP API is very convoluted.

Seems like there are several program types and this is hard coding workflows and DataPipelineWorkflow .

I think DataPipelineWorkflow this will not be present for streaming pipelines instead you have to poll a spark program.

I'm not sure how many other scenarios require other program types.
It would be good to get someone from CDAP community to review this.

CC: @sreevatsanraman

This seems to be a bigger change because we will have to adjust each method that uses DataPipelineWorkflow in URI. So, I would say we can do this but in a follow up PR.

Btw. I was relying on Google docs: https://cloud.google.com/data-fusion/docs/reference/cdap-reference#start_a_batch_pipeline

As long as we cover batch pipelines (with spark or MR backend I think we should be good)

turbaszek · 2020-06-09T09:12:14Z

Hi @jaketf @sreevatsanraman what should we do to move this forward?

jaketf · 2020-06-09T17:09:06Z

Thanks for following up @turbaszek.
tl;dr I think we should merge this PR as it fixes the immediate issue. We can file a lower priority issue to handle streaming pipelines in the future. This can be an additional kwarg that accepts a streaming flag and uses a different paths for polling.

I've updated the threads. I agree I think we should keep this PR small and focused on patching the existing operator for starting data fusion batch pipelines.

In general I think batch is more used than streaming and spark is more used than MR.
In batch both MR and spark can be polled at the .../DataPipelineWorkflow/runs/run_id endpoint.

airflow/providers/google/cloud/hooks/datafusion.py

rachael-ds · 2020-06-10T15:31:04Z

airflow/providers/google/cloud/operators/datafusion.py

        namespace: str = "default",
+        pipeline_timeout: int = 10 * 60,


If the success state is COMPLETED, we will timeout before the pipeline run completes.
It can take > 5 minutes to just provision a Data Fusion pipeline run. Some pipelines can take hours to complete.
Can we increase the default timeout to 1 hour?

The contract of this operator is to start a pipeline. not wait til pipeline completion.
10 mins is reasonable timeout.
COMPLETED is just a success state in case it's a super quick pipeline that completes between polls.
We can add a sensor for waiting on pipeline completion (which should use reschedule mode if it expects to be so long).

As part of these changes you can now pass in a parameter to have the operator wait for pipeline completion (not just pipeline start).
sensor + reschedule mode sounds like a good suggestion, thanks

I've created an issue to limit scope of this PR
#9300

rachael-ds · 2020-06-10T15:32:14Z

airflow/providers/google/cloud/operators/datafusion.py

@@ -616,6 +625,9 @@ class CloudDataFusionStartPipelineOperator(BaseOperator):
    :type pipeline_name: str
    :param instance_name: The name of the instance.
    :type instance_name: str
+    :param success_states: If provided the operator will wait for pipeline to be in one of
+        the provided states.
+    :type success_states: List[str]


missing info for new pipeline_timeout parameter

rachael-ds · 2020-06-10T15:41:49Z

airflow/providers/google/cloud/hooks/datafusion.py

+            "programId": "DataPipelineWorkflow",
+            "runtimeargs": runtime_args
+        }]
+        response = self._cdap_request(url=url, method="POST", body=body)


Just an FYI - this is the API request to start multiple pipelines.
There will eventually be a fix return the run Id as part of the API request to run a single pipeline.
We can revert to your original URL when this is available. For context:
https://issues.cask.co/browse/CDAP-7641

fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! fixup! Wait for pipeline state in Data Fusion operators

* Wait for pipeline state in Data Fusion operators fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! fixup! Wait for pipeline state in Data Fusion operators * Use quote to encode url parts * fixup! Use quote to encode url parts

boring-cyborg bot added the provider:google Google (including GCP) related issues label May 21, 2020

jaketf suggested changes May 21, 2020

View reviewed changes

jaketf reviewed May 22, 2020

View reviewed changes

turbaszek force-pushed the improve-datafusion-ops branch from 2cdd840 to 0458a4c Compare June 10, 2020 11:56

turbaszek marked this pull request as ready for review June 10, 2020 13:33

mik-laj reviewed Jun 10, 2020

View reviewed changes

airflow/providers/google/cloud/hooks/datafusion.py Outdated Show resolved Hide resolved

turbaszek requested a review from mik-laj June 10, 2020 14:42

mik-laj approved these changes Jun 10, 2020

View reviewed changes

rachael-ds reviewed Jun 10, 2020

View reviewed changes

turbaszek mentioned this pull request Jun 15, 2020

Create DatafusionPipelineStateSensor #9300

Closed

turbaszek added 3 commits June 15, 2020 10:50

Wait for pipeline state in Data Fusion operators

57d3a0c

fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! Wait for pipeline state in Data Fusion operators fixup! fixup! fixup! Wait for pipeline state in Data Fusion operators

Use quote to encode url parts

5d8f551

fixup! Use quote to encode url parts

76ff806

turbaszek force-pushed the improve-datafusion-ops branch from f370ade to 76ff806 Compare June 15, 2020 09:19

mik-laj approved these changes Jun 15, 2020

View reviewed changes

turbaszek merged commit aee6ab9 into apache:master Jun 15, 2020

turbaszek deleted the improve-datafusion-ops branch June 15, 2020 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for pipeline state in Data Fusion operators #8954

Wait for pipeline state in Data Fusion operators #8954

turbaszek commented May 21, 2020 •

edited

Loading

jaketf May 21, 2020

turbaszek May 22, 2020

jaketf May 22, 2020

turbaszek May 24, 2020

jaketf May 22, 2020

jaketf May 22, 2020

turbaszek May 24, 2020

jaketf May 24, 2020

turbaszek May 25, 2020 •

edited

Loading

sreevatsanraman May 26, 2020

turbaszek May 27, 2020

jaketf Jun 9, 2020

jaketf May 22, 2020

jaketf May 22, 2020

turbaszek May 24, 2020 •

edited

Loading

jaketf Jun 9, 2020

turbaszek commented Jun 9, 2020

jaketf commented Jun 9, 2020

rachael-ds Jun 10, 2020

jaketf Jun 10, 2020

rachael-ds Jun 10, 2020

turbaszek Jun 15, 2020

rachael-ds Jun 10, 2020

rachael-ds Jun 10, 2020

Wait for pipeline state in Data Fusion operators #8954

Wait for pipeline state in Data Fusion operators #8954

Conversation

turbaszek commented May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek May 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek May 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek commented Jun 9, 2020

jaketf commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek commented May 21, 2020 •

edited

Loading

turbaszek May 25, 2020 •

edited

Loading

turbaszek May 24, 2020 •

edited

Loading