[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

albertusk95 · 2020-01-06T11:06:32Z

Problem

I tried to use SparkSubmitOperator using standalone cluster first. Unfortunately, the spark-submit task was failed. The following exception occurred.

airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

The first thing that came up into my mind was why the master address excluded the spark:// prefix. So it should be like --master spark://host:port. I performed a quick check to the source code and found that such a thing (scheme addition) hadn't been handled.

After reviewing the subsequent method callings, it turned out that the driver status tracking feature won't be utilised at all because of the above bug. Look at the following code snippet.

def _resolve_should_track_driver_status(self):
	"""
	Determines whether to not this hook should poll the spark driver status through subsequent spark-submit status requests after the initial spark-submit request
	:return: if the driver status should be tracked
	"""
	return ('spark://' in self._connection['master'] and self._connection['deploy_mode'] == 'cluster')

The above method will always return False as the spark master's address doesn't start with the scheme, such as spark://.

Later on, I investigated the Connection module (airflow.models.connection) further and found that if we provide the URI (ex: spark://host:port), then the attributes of the Connection object will be derived via URI parsing.

When parsing the host, the resulting value was only the hostname without the scheme. It also becomes a critical enough bug.

Proposed Solution

I think we don't really need the whole URI. I mean, when we store the connection data as an environment variable, we could just specify the URI parts in form of JSON. This approach is mainly used to tackle the URI parsing problem.

In this case, the conn_id will still be preserved.

Take a look at the following example (conn_id = "spark_default"). For simplicity, let's presume that extra is in JSON form.

AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": <conn_type>, "host":<host>, "port":<port>, "schema":<schema>, "extra":<extra>}'

Even though this solution could reduce the false result returned by URI parsing, one need to strictly ensure that each attribute (host, port, scheme, etc.) should store the relevant value. I think it's much easier than creating a correct URI parser. Moreover, applying such a technique makes the whole connection data builder for both database & environment variable mode have the same pattern (both use a structured data specification).

Link to JIRA issue: https://issues.apache.org/jira/browse/AIRFLOW-6212

Description above provides context of the change
Commit message starts with [AIRFLOW-NNNN], where AIRFLOW-NNNN = JIRA ID*
Unit tests coverage for changes (not needed for documentation changes)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

(*) For document-only changes, no JIRA issue is needed. Commit message starts [AIRFLOW-XXXX].

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

boring-cyborg · 2020-01-06T11:06:36Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community!
If you have any issues or are unsure about any anything please check our
Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)

In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://apache-airflow-slack.herokuapp.com/

tooptoop4

spark hook works for me and tests pass without this pr. can u send the add connection cli command u create to hit the issue? i dont agree with removing the spark check on line 177

albertusk95 · 2020-01-07T04:07:18Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.

Or this hook is created only for standalone mode?

albertusk95 · 2020-01-07T04:32:05Z

@tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in database. I tried this hook by storing the connection info as an environment variable.

This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, URI=spark://host:port will be parsed into host:port without the spark://. Obviously it returns this exception:

airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

tooptoop4 · 2020-01-07T07:00:53Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.

Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

albertusk95 · 2020-01-07T07:02:17Z

@tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in database. I tried this hook by storing the connection info as an environment variable.

This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, URI=spark://host:port will be parsed into host:port without the spark://. Obviously it returns this exception:
airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

@tooptoop4

albertusk95 · 2020-01-07T07:22:31Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

albertusk95 · 2020-01-07T07:30:33Z

@tooptoop4 I think you might want to try this sample DAG to reproduce the issue.

a) create an environment var for spark connection.

export AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": spark, "host":<host>, "port":<port>}'

b) create a DAG file to run

from airflow import DAG
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
import os

job_file = 'path/to/job/file'

default_args = {
    'depends_on_past': False,
    'start_date': <fill_start_date>,
    'retries': <fill_retries>,
    'retry_delay': <fill_retry_delay>
}
dag = DAG('spark-submit-hook', default_args=default_args, schedule_interval=<fill_interval>)

avg = SparkSubmitOperator(task_id=<fill_task_id>, dag=dag, 
	application=job_file,
	spark_binary='path/to/spark-submit')

tooptoop4 · 2020-01-07T20:43:29Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

tooptoop4 · 2020-01-07T20:46:23Z

airflow/contrib/hooks/spark_submit_hook.py

@@ -174,8 +174,7 @@ def _resolve_should_track_driver_status(self):
        subsequent spark-submit status requests after the initial spark-submit request
        :return: if the driver status should be tracked
        """
-        return ('spark://' in self._connection['master'] and
-                self._connection['deploy_mode'] == 'cluster')
+        return self._connection['deploy_mode'] == 'cluster'


return (('spark://' in self._connection['master'] or conn_type starts with spark) and self._connection['deploy_mode'] == 'cluster')

tooptoop4 · 2020-01-07T20:49:28Z

airflow/contrib/hooks/spark_submit_hook.py

@@ -190,17 +189,22 @@ def _resolve_connection(self):
            # Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and
            # k8s://https://<HOST>:<PORT>
            conn = self.get_connection(self._conn_id)
-            if conn.port:
-                conn_data['master'] = "{}:{}".format(conn.host, conn.port)
+            if conn.conn_type in ['spark', 'mesos']:


revert this section 192-200

tooptoop4 · 2020-01-07T20:50:28Z

existing tests for connection added via db/cli needs to work

albertusk95 · 2020-01-08T03:34:33Z

existing tests for connection added via db/cli needs to work

well, I guess the current tests don't support connection added via cli, right?

albertusk95 · 2020-01-08T10:34:09Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.

tooptoop4 · 2020-01-08T20:49:13Z

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.

there is another active pr on livy i saw

stale · 2020-02-22T21:30:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Albertus Kelvin added 2 commits January 6, 2020 17:46

retrieve connection info from environment variables

fc71624

resolve connection support for mesos, yarn, local, and k8s

e77ce53

fix tabs within try and catch block

8511e5a

tooptoop4 suggested changes Jan 6, 2020

View reviewed changes

Albertus Kelvin added 2 commits January 7, 2020 10:33

fix master address for yarn and local

7e04b78

serialize extra arguments of the environment uri

41857c5

Albertus Kelvin added 2 commits January 7, 2020 13:18

fix tabs within try and catch block

f66f41e

remove extra spaces

68a5d89

tooptoop4 reviewed Jan 7, 2020

View reviewed changes

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 22, 2020

stale bot closed this Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

albertusk95 commented Jan 6, 2020 •

edited

boring-cyborg bot commented Jan 6, 2020

tooptoop4 left a comment

albertusk95 commented Jan 7, 2020 •

edited

albertusk95 commented Jan 7, 2020 •

edited

tooptoop4 commented Jan 7, 2020 •

edited

albertusk95 commented Jan 7, 2020

albertusk95 commented Jan 7, 2020

albertusk95 commented Jan 7, 2020

tooptoop4 commented Jan 7, 2020

tooptoop4 Jan 7, 2020

tooptoop4 Jan 7, 2020

tooptoop4 commented Jan 7, 2020 •

edited

albertusk95 commented Jan 8, 2020 •

edited

albertusk95 commented Jan 8, 2020 •

edited

tooptoop4 commented Jan 8, 2020

stale bot commented Feb 22, 2020

[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

Conversation

albertusk95 commented Jan 6, 2020 • edited

boring-cyborg bot commented Jan 6, 2020

tooptoop4 left a comment

Choose a reason for hiding this comment

albertusk95 commented Jan 7, 2020 • edited

albertusk95 commented Jan 7, 2020 • edited

tooptoop4 commented Jan 7, 2020 • edited

albertusk95 commented Jan 7, 2020

albertusk95 commented Jan 7, 2020

albertusk95 commented Jan 7, 2020

tooptoop4 commented Jan 7, 2020

tooptoop4 Jan 7, 2020

Choose a reason for hiding this comment

tooptoop4 Jan 7, 2020

Choose a reason for hiding this comment

tooptoop4 commented Jan 7, 2020 • edited

albertusk95 commented Jan 8, 2020 • edited

albertusk95 commented Jan 8, 2020 • edited

tooptoop4 commented Jan 8, 2020

stale bot commented Feb 22, 2020

albertusk95 commented Jan 6, 2020 •

edited

albertusk95 commented Jan 7, 2020 •

edited

albertusk95 commented Jan 7, 2020 •

edited

tooptoop4 commented Jan 7, 2020 •

edited

tooptoop4 commented Jan 7, 2020 •

edited

albertusk95 commented Jan 8, 2020 •

edited

albertusk95 commented Jan 8, 2020 •

edited