-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22495] Fix setup of SPARK_HOME variable on Windows #19370
Conversation
ok to test |
bin/pyspark2.cmd
Outdated
set SPARK_HOME=%~dp0.. | ||
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py | ||
if exist "%FIND_SPARK_HOME_SCRIPT%" ( | ||
for /f %%i in ('python %FIND_SPARK_HOME_SCRIPT%') do set SPARK_HOME=%%i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind adding some comments? I believe we resemble here:
Lines 28 to 40 in 9244957
# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't | |
# need to search the different Python directories for a Spark installation. | |
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or | |
# spark-submit in another directory we want to use that version of PySpark rather than the | |
# pip installed version of PySpark. | |
export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)" | |
else | |
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME | |
# Default to standard python interpreter unless told otherwise | |
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then | |
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}" | |
fi | |
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT") |
which detects find_spark_home.py
that should be included in pip installation:
Lines 143 to 145 in aad2125
# We add find_spark_home.py to the bin directory we install so that pip installed PySpark | |
# will search for SPARK_HOME with Python. | |
scripts.append("pyspark/find_spark_home.py") |
I'd be nicer if PR description explains this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion; I will add them.
cc @holdenk, @felixcheung and @ueshin who I believe are interested in this. |
Test build #82258 has finished for PR 19370 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please have the background of the issue and the approach for the fix in this PR description
bin/run-example.cmd
Outdated
rem Figure out where the Spark framework is installed | ||
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py | ||
if exist "%FIND_SPARK_HOME_SCRIPT%" ( | ||
for /f %%i in ('python %FIND_SPARK_HOME_SCRIPT%') do set SPARK_HOME=%%i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if python would be the right one (python2, python3, is it in PATH)?
and I don't think cmd /c foo.py
works either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption was that if find_spark_home.py
can be found in the local folder, this will be a good setup. While I think find_spark_home.py
with any Python version (well, at least modern 2 and 3), indeed this not takes into account case that the python is just not there at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess python is generally expected on *nix but quite possibly absent on Windows.
I'll move out the codes to a separate |
I'm not sure about duplicating the entire logic in find_spark into a windows version though - given how poorly maintained things are on windows (to be honest) I worry about any addition of code |
I don't want to move the python logic into CMD. I'll move the current checks I've added to a separate file and extend them into one or two additional cases similar to the bash script {{find_spark_home}}. This will put the business logic of finding spark on windows into one file, which is better for maintenance. In the long run I'd suggest moving to a more portable solution like the python console scripts I suggested above. |
Moving it to Python script sounds a good idea to me if possible in the future.I did have the smae concern with @felixcheung but to me okay to add it after reading the comment above. Let's add comments in more details in the new script for followups in the future. |
what is this python console scripts idea, @jsnowacki ? |
I've explained it in the ticket https://issues.apache.org/jira/browse/SPARK-18136.
|
0b12975
to
3ac2c86
Compare
Test build #82380 has finished for PR 19370 at commit
|
I've incorporated all of your comments and moved the CMD logic of finding SPARK_HOME into separate script. It has been tested locally and works fine. It will complain if Python is not installed, but I don't have a better way around it really. It will try to run in only when |
Test build #82396 has finished for PR 19370 at commit
|
retest this please |
bin/run-example.cmd
Outdated
@@ -17,6 +17,13 @@ rem See the License for the specific language governing permissions and | |||
rem limitations under the License. | |||
rem | |||
|
|||
set SPARK_HOME=%~dp0.. | |||
rem Figure out where the Spark framework is installed | |||
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we change this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, fixing.
Test build #82405 has finished for PR 19370 at commit
|
@@ -18,7 +18,7 @@ rem limitations under the License. | |||
rem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like we should add this file to the appveyor list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to any action on this or just wait for it to be approved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add - bin/*.cmd
to here:
Lines 27 to 35 in 828fab0
only_commits: | |
files: | |
- appveyor.yml | |
- dev/appveyor-install-dependencies.ps1 | |
- R/ | |
- sql/core/src/main/scala/org/apache/spark/sql/api/r/ | |
- core/src/main/scala/org/apache/spark/api/r/ | |
- mllib/src/main/scala/org/apache/spark/ml/r/ | |
- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala |
Done altering |
add to whitelist |
Test build #82414 has finished for PR 19370 at commit
|
Test build #82416 has finished for PR 19370 at commit
|
I've added |
@jsnowacki, would you mind if I ask squash those commits into single one so that we can check if the squashed commit, having the changes in |
Otherwise, looks good to me although I have to double check and test. |
5f52c79
to
aec49a0
Compare
@HyukjinKwon Commit squashed to one as you've requested. |
Yup, it looks triggering fine - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1822-master although I wonder why check mark does not appear. I think it is not specific to this PR but rather AppVeyor itself though .. |
Test build #82508 has finished for PR 19370 at commit
|
b829bc0
to
c0138a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit/possible suggestion.
could we merge this to branch-2.2?
rem | ||
|
||
rem Path to Python script finding SPARK_HOME | ||
set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fails rather uglily if python is not in the path? should we add some checks for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested and it looks going to give a message like this:
C:\...>pyspark
'python' is not recognized as an internal or external command, operable program or batch file.
which seems roughly fine though it's bit uglily.
So, googled a possible approach, for example, https://superuser.com/a/718194. However, seems where
does not recognise an absolute path, for example, C:\...\Python27\python.exe
. So, looks we should make a combination with exists
keyword.
@jsnowacki if you are active now and know a simple better way, we could definitely try. Or, probably we could go ahead as is too ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only idea I have at the moment is adding is something along the line:
rem If there is python installed, try to use the root dir as SPARK_HOME
where %PYTHON_RUNNER%
if %ERRORLEVEL% neq 0 (
echo %PYTHON_RUNNER% wasn't found
if "x%SPARK_HOME%"=="x" (
set SPARK_HOME=%~dp0..
)
)
I need to add it though in a main code not in the conditional subpblock as it doesn't work inside a block of code due to this CMD limitations regarding env variable evaluation.
Test build #84004 has finished for PR 19370 at commit
|
Yup, +1 for going ahead with branch-2.2. |
I think the concern is we are adding python dependency for even non python use cases.
We could track handling missing python better separately. My concerns is python is not a standard component on Windows and the user might not have it installed for using spark-submit or spark-shell. And the message might not be obvious as to why.
|
c0138a9
to
4b3a963
Compare
I've added this check using |
4b3a963
to
ed379b1
Compare
bin/find-spark-home.cmd
Outdated
if "x%SPARK_HOME%"=="x" ( | ||
set SPARK_HOME=%~dp0.. | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems tab is used instead. Let's match it with spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Test build #84038 has finished for PR 19370 at commit
|
ed379b1
to
84d468b
Compare
bin/find-spark-home.cmd
Outdated
if %ERRORLEVEL% neq 0 ( | ||
echo %PYTHON_RUNNER% wasn't found; Python doesn't seem to be installed | ||
if "x%SPARK_HOME%"=="x" ( | ||
set SPARK_HOME=%~dp0.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there is a hidden tab ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, sorry. Now it should be fine.
Test build #84068 has finished for PR 19370 at commit
|
84d468b
to
a4d516f
Compare
Test build #84077 has finished for PR 19370 at commit
|
bin/find-spark-home.cmd
Outdated
if "x%SPARK_HOME%"=="x" ( | ||
set SPARK_HOME=%~dp0.. | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here from the last commit is:
-
now
PYTHON_RUNNER
can't be an absolute path aswhere
does not work with itERROR: Invalid pattern is specified in "path:pattern". C:\Python27\python.exe wasn't found; Python doesn't seem to be installed
-
It print out the output from
where
:C:\...>pyspark C:\cygwin\bin\python C:\Python27\python.exe ...
-
and the error message looks not more useful than the previous one:
python wasn't found; Python doesn't seem to be installed
I suggest this: rem Path to Python script finding SPARK_HOME
set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py
rem Default to standard python interpreter unless told otherwise
set PYTHON_RUNNER=python
rem If PYSPARK_DRIVER_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_DRIVER_PYTHON%" =="x" (
set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
)
rem If PYSPARK_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_PYTHON%" =="x" (
set PYTHON_RUNNER=%PYSPARK_PYTHON%
)
rem If there is python installed, trying to use the root dir as SPARK_HOME
where %PYTHON_RUNNER% > nul 2>$1
if %ERRORLEVEL% neq 0 (
if not exist %PYTHON_RUNNER% (
if "x%SPARK_HOME%"=="x" (
echo Missing Python executable '%PYTHON_RUNNER%', defaulting to '%~dp0..' for SPARK_HOME ^
environment variable. Please install Python or specify the correct Python executable in ^
PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
set SPARK_HOME=%~dp0..
)
)
)
rem Only attempt to find SPARK_HOME if it is not set.
if "x%SPARK_HOME%"=="x" (
if not exist "%FIND_SPARK_HOME_PYTHON_SCRIPT%" (
rem If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
rem need to search the different Python directories for a Spark installation.
rem Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
rem spark-submit in another directory we want to use that version of PySpark rather than the
rem pip installed version of PySpark.
set SPARK_HOME=%~dp0..
) else (
rem We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
for /f "delims=" %%i in ('%PYTHON_RUNNER% %FIND_SPARK_HOME_PYTHON_SCRIPT%') do set SPARK_HOME=%%i
)
) I manually tested each branch. This addresses the concern in #19370 (comment). The error message shows like: C:\...>pyspark
C:\...>pyspark
|
In Line 27 in a36a76a
|
a4d516f
to
b58f740
Compare
Thanks for looking into it again. I've followed your suggestions and updated PR. It also seems to work for me now. |
Test build #84119 has finished for PR 19370 at commit
|
Build started: [SparkR] |
Merged to master. |
Seems there is a conflict while backporting/cherry-picking to branch-2.2. @jsnowacki, mind opening a backporting PR to branch-2.2 please? I think this is important for many Windows users and I guess relatively low-risky. |
## What changes were proposed in this pull request? This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in #19370 (comment). Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files. First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup. Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system. ## How was this patch tested? Tested on local installation. Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19807 from jsnowacki/fix_spark_cmds_2.
## What changes were proposed in this pull request? This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in apache#19370 (comment). Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files. First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup. Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system. ## How was this patch tested? Tested on local installation. Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes apache#19807 from jsnowacki/fix_spark_cmds_2.
What changes were proposed in this pull request?
Fixing the way how
SPARK_HOME
is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySparkpip
orconda
install. This has been reflected in Linux files inbin
but not for Windowscmd
files.First fix improves the way how the
jars
directory is found, as this was stoping Windows version ofpip/conda
install from working; JARs were not found by on Session/Context setup.Second fix is adding
find-spark-home.cmd
script, which usesfind_spark_home.py
script, as the Linux version, to resolveSPARK_HOME
. It is based onfind-spark-home
bash script, though, some operations are done in different order due to thecmd
script language limitations. If environment variable is set, the Python scriptfind_spark_home.py
will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed viapip/conda
, thus, there is some Python in the system.How was this patch tested?
Tested on local installation.