Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6212] SparkSubmitHook resolve connection #7075

Closed

Conversation

albertusk95
Copy link

@albertusk95 albertusk95 commented Jan 6, 2020

Problem

I tried to use SparkSubmitOperator using standalone cluster first. Unfortunately, the spark-submit task was failed. The following exception occurred.

airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

The first thing that came up into my mind was why the master address excluded the spark:// prefix. So it should be like --master spark://host:port. I performed a quick check to the source code and found that such a thing (scheme addition) hadn't been handled.

After reviewing the subsequent method callings, it turned out that the driver status tracking feature won't be utilised at all because of the above bug. Look at the following code snippet.

def _resolve_should_track_driver_status(self):
	"""
	Determines whether to not this hook should poll the spark driver status through subsequent spark-submit status requests after the initial spark-submit request
	:return: if the driver status should be tracked
	"""
	return ('spark://' in self._connection['master'] and self._connection['deploy_mode'] == 'cluster')

The above method will always return False as the spark master's address doesn't start with the scheme, such as spark://.

Later on, I investigated the Connection module (airflow.models.connection) further and found that if we provide the URI (ex: spark://host:port), then the attributes of the Connection object will be derived via URI parsing.

When parsing the host, the resulting value was only the hostname without the scheme. It also becomes a critical enough bug.

Proposed Solution

I think we don't really need the whole URI. I mean, when we store the connection data as an environment variable, we could just specify the URI parts in form of JSON. This approach is mainly used to tackle the URI parsing problem.

In this case, the conn_id will still be preserved.

Take a look at the following example (conn_id = "spark_default"). For simplicity, let's presume that extra is in JSON form.

AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": <conn_type>, "host":<host>, "port":<port>, "schema":<schema>, "extra":<extra>}'

Even though this solution could reduce the false result returned by URI parsing, one need to strictly ensure that each attribute (host, port, scheme, etc.) should store the relevant value. I think it's much easier than creating a correct URI parser. Moreover, applying such a technique makes the whole connection data builder for both database & environment variable mode have the same pattern (both use a structured data specification).


Link to JIRA issue: https://issues.apache.org/jira/browse/AIRFLOW-6212

  • Description above provides context of the change
  • Commit message starts with [AIRFLOW-NNNN], where AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

(*) For document-only changes, no JIRA issue is needed. Commit message starts [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg
Copy link

boring-cyborg bot commented Jan 6, 2020

Congratulations on your first Pull Request and welcome to the Apache Airflow community!
If you have any issues or are unsure about any anything please check our
Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)

In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://apache-airflow-slack.herokuapp.com/

Copy link
Contributor

@tooptoop4 tooptoop4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark hook works for me and tests pass without this pr. can u send the add connection cli command u create to hit the issue? i dont agree with removing the spark check on line 177

@albertusk95
Copy link
Author

albertusk95 commented Jan 7, 2020

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.

Or this hook is created only for standalone mode?

@albertusk95
Copy link
Author

albertusk95 commented Jan 7, 2020

@tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in database. I tried this hook by storing the connection info as an environment variable.

This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, URI=spark://host:port will be parsed into host:port without the spark://. Obviously it returns this exception:

airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jan 7, 2020

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.

Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

@albertusk95
Copy link
Author

@tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in database. I tried this hook by storing the connection info as an environment variable.

This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, URI=spark://host:port will be parsed into host:port without the spark://. Obviously it returns this exception:

airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]

@tooptoop4

@albertusk95
Copy link
Author

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

@albertusk95
Copy link
Author

@tooptoop4 I think you might want to try this sample DAG to reproduce the issue.

a) create an environment var for spark connection.

export AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": spark, "host":<host>, "port":<port>}'

b) create a DAG file to run

from airflow import DAG
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
import os

job_file = 'path/to/job/file'

default_args = {
    'depends_on_past': False,
    'start_date': <fill_start_date>,
    'retries': <fill_retries>,
    'retry_delay': <fill_retry_delay>
}
dag = DAG('spark-submit-hook', default_args=default_args, schedule_interval=<fill_interval>)

avg = SparkSubmitOperator(task_id=<fill_task_id>, dag=dag, 
	application=job_file,
	spark_binary='path/to/spark-submit')

@tooptoop4
Copy link
Contributor

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

@@ -174,8 +174,7 @@ def _resolve_should_track_driver_status(self):
subsequent spark-submit status requests after the initial spark-submit request
:return: if the driver status should be tracked
"""
return ('spark://' in self._connection['master'] and
self._connection['deploy_mode'] == 'cluster')
return self._connection['deploy_mode'] == 'cluster'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (('spark://' in self._connection['master'] or conn_type starts with spark) and self._connection['deploy_mode'] == 'cluster')

@@ -190,17 +189,22 @@ def _resolve_connection(self):
# Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and
# k8s://https://<HOST>:<PORT>
conn = self.get_connection(self._conn_id)
if conn.port:
conn_data['master'] = "{}:{}".format(conn.host, conn.port)
if conn.conn_type in ['spark', 'mesos']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this section 192-200

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jan 7, 2020

existing tests for connection added via db/cli needs to work

@albertusk95
Copy link
Author

albertusk95 commented Jan 8, 2020

existing tests for connection added via db/cli needs to work

well, I guess the current tests don't support connection added via cli, right?

@albertusk95
Copy link
Author

albertusk95 commented Jan 8, 2020

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.

@tooptoop4
Copy link
Contributor

@tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think spark:// is only for standalone mode.
Or this hook is created only for standalone mode?

yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

There isn't async driver polling in YARN, I know Spark on YARN.

How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.

there is another active pr on livy i saw

@stale
Copy link

stale bot commented Feb 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 22, 2020
@stale stale bot closed this Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants