[SPARK-52687][PYTHON] Fix SPARK_CONNECT_MODE parsing#54800
[SPARK-52687][PYTHON] Fix SPARK_CONNECT_MODE parsing#54800mukimasta wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thanks for making this PR @mukimasta :) I'll take a look right now (sorry my week got away from me :)) |
holdenk
left a comment
There was a problem hiding this comment.
This looks good to me :D Thank you so much for fixing this! Let's leave it open for a day and see if any other comments land :)
I would actually avoid such values to be allowed. Can we simply fix the error message better? |
Oh yeah so this value is not allowed in the current version. The change here gets the return value from $? so we don't have to worry about whitespace being output from python/shell. |
| import sys | ||
| try: | ||
| from pyspark.util import spark_connect_mode | ||
| m = spark_connect_mode().strip() |
There was a problem hiding this comment.
Hm .. actually wouldn't this make " 0 " working?
Lines 935 to 942 in b634429
There was a problem hiding this comment.
I think we should probably raise a better error message here as well.
fix https://issues.apache.org/jira/browse/SPARK-52687
What changes were proposed in this pull request?
Replaced the stdout-based SPARK_CONNECT_MODE detection in bin/pyspark with an exit-code-based approach to avoid invisible characters (e.g. \r, spaces, tabs) causing false "unknown value" errors.
Changes:
Run Python to determine the mode and exit with 0 (classic), 1 (connect), or 2 (error) instead of printing to stdout
Use .strip() in Python to handle whitespace in the env var
Add try/except to handle import failures and invalid values
Update the bash logic to branch on exit code instead of parsing command output
Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-52687
The previous implementation used $(python -c "print(spark_connect_mode())") and compared the output to "0" or "1". This could fail when:
Python output included \r (e.g. Windows CRLF)
Any such character made the comparison fail and triggered the "unknown value" message even when the value was logically correct.
Does this PR introduce any user-facing change?
Yes. Behavior is unchanged for normal cases; the fix only affects previously broken scenarios.
Before: Users could see "The environment variable SPARK_CONNECT_MODE has unknown value or pyspark.util package is not available: 0" even when the value was valid.
After: Valid values (including with surrounding whitespace) are accepted; invalid values still produce an error.
How was this patch tested?
Manual tests for exit codes 0, 1, and 2
Manual test with SPARK_CONNECT_MODE=" 0 " (whitespace)
Was this patch authored or co-authored using generative AI tooling?
No
Thanks to @holdenk for guidance during the Spark Community Sprint