You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SparkSubmitOperator uses a dictionary to handle the 'conf' property of the operator
The SparkSqlOperator uses a string in format PARAM=VALUE,PARAM2=VALUE2 to handle the 'conf' property.
The first option allows a config like this to be passed:
while the second option will always split the packages into: --conf spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 --conf org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0 due to it being split on the comma as a delimiter.
This effectively does not allow adding any of the comma delimited configurations of spark.
The SparkSubmitOperator also has a bigger list of available properties; including the --packages flag which is available as well on the spark/bin/spark-sql script.
What you think should happen instead
The first option allows for more flexibility when adding configs, and a dictionary seems the right way to store the configs. It would enforce the same behaviour on both spark operators, making it easier to adjust/maintain. Also less documentation to keep :)
This is the place where the config is split on ','
How to reproduce
Create a dag and task:
conf = {
'spark.jars.packages': 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0',
'spark.driver.extraJavaOptions': '-Divy.cache.dir=/tmp -Divy.home=/tmp',
'spark.sql.extensions': 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions'
}
config_string = ','.join([f"{key}={value}" for key, value in conf.items()])
merge_branch = SparkSqlOperator(
name="merge_branch",
task_id="merge_branch",
conf=config_string, # requires a string instead of a dict
conn_id='spark',
dag=dag,
sql=f"MERGE BRANCH {ref} INTO main IN nessie",
retries=0
)
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
duhizjame
added a commit
to duhizjame/dataops-with-nessie
that referenced
this issue
Jun 30, 2024
Apache Airflow Provider(s)
apache-spark
Versions of Apache Airflow Providers
apache-airflow==2.9.2
apache-airflow-providers-apache-spark==4.8.2
Apache Airflow version
2.9.2
Operating System
MacOS
Deployment
Docker-Compose
Deployment details
No response
What happened
The SparkSubmitOperator uses a dictionary to handle the 'conf' property of the operator
The SparkSqlOperator uses a string in format
PARAM=VALUE,PARAM2=VALUE2
to handle the 'conf' property.The first option allows a config like this to be passed:
while the second option will always split the packages into:
--conf spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 --conf org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0
due to it being split on the comma as a delimiter.This effectively does not allow adding any of the comma delimited configurations of spark.
The SparkSubmitOperator also has a bigger list of available properties; including the --packages flag which is available as well on the spark/bin/spark-sql script.
What you think should happen instead
The first option allows for more flexibility when adding configs, and a dictionary seems the right way to store the configs. It would enforce the same behaviour on both spark operators, making it easier to adjust/maintain. Also less documentation to keep :)
airflow/airflow/providers/apache/spark/hooks/spark_sql.py
Line 146 in 54dfead
This is the place where the config is split on ','
How to reproduce
Create a dag and task:
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: