SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

duhizjame · 2024-06-30T13:38:00Z

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

apache-airflow==2.9.2
apache-airflow-providers-apache-spark==4.8.2

Apache Airflow version

2.9.2

Operating System

MacOS

Deployment

Docker-Compose

Deployment details

No response

What happened

The SparkSubmitOperator uses a dictionary to handle the 'conf' property of the operator
The SparkSqlOperator uses a string in format PARAM=VALUE,PARAM2=VALUE2 to handle the 'conf' property.

The first option allows a config like this to be passed:

conf = {
        'spark.jars.packages': 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0',
        'spark.driver.extraJavaOptions': '-Divy.cache.dir=/tmp -Divy.home=/tmp',
        'spark.sql.extensions': 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions'
}

while the second option will always split the packages into:
--conf spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 --conf org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0 due to it being split on the comma as a delimiter.
This effectively does not allow adding any of the comma delimited configurations of spark.

The SparkSubmitOperator also has a bigger list of available properties; including the --packages flag which is available as well on the spark/bin/spark-sql script.

What you think should happen instead

The first option allows for more flexibility when adding configs, and a dictionary seems the right way to store the configs. It would enforce the same behaviour on both spark operators, making it easier to adjust/maintain. Also less documentation to keep :)

airflow/airflow/providers/apache/spark/hooks/spark_sql.py

Line 146 in 54dfead

for conf_el in self._conf.split(","):

This is the place where the config is split on ','

How to reproduce

Create a dag and task:

conf = {
        'spark.jars.packages': 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0',
        'spark.driver.extraJavaOptions': '-Divy.cache.dir=/tmp -Divy.home=/tmp',
        'spark.sql.extensions': 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions'
}

config_string = ','.join([f"{key}={value}" for key, value in conf.items()])

merge_branch = SparkSqlOperator(
    name="merge_branch",
    task_id="merge_branch",
    conf=config_string, # requires a string instead of a dict
    conn_id='spark',
    dag=dag,
    sql=f"MERGE BRANCH {ref} INTO main IN nessie",
    retries=0
)

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2024-06-30T13:38:02Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

aritra24 · 2024-07-01T07:28:28Z

@duhizjame feel free to raise a PR

duhizjame added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 30, 2024

duhizjame added a commit to duhizjame/dataops-with-nessie that referenced this issue Jun 30, 2024

commented out the sql operators until apache/airflow#40507 is solved

7f49476

aritra24 assigned duhizjame Jul 1, 2024

aritra24 removed the needs-triage label for new issues that we didn't triage yet label Jul 1, 2024

duhizjame mentioned this issue Jul 1, 2024

Change conf property from str to dict in SparkSqlOperator #40527

Closed

duhizjame linked a pull request Oct 8, 2024 that will close this issue

Changed conf property from str to dict in SparkSqlOperator #42835

Open

eladkal added good first issue provider:apache-spark labels Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

duhizjame commented Jun 30, 2024 •

edited

Loading

boring-cyborg bot commented Jun 30, 2024

aritra24 commented Jul 1, 2024

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

Comments

duhizjame commented Jun 30, 2024 • edited Loading

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Jun 30, 2024

aritra24 commented Jul 1, 2024

duhizjame commented Jun 30, 2024 •

edited

Loading