Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22495] Fix setup of SPARK_HOME variable on Windows #19370

Closed
wants to merge 1 commit into from

Conversation

jsnowacki
Copy link
Contributor

@jsnowacki jsnowacki commented Sep 27, 2017

What changes were proposed in this pull request?

Fixing the way how SPARK_HOME is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark pip or conda install. This has been reflected in Linux files in bin but not for Windows cmd files.

First fix improves the way how the jars directory is found, as this was stoping Windows version of pip/conda install from working; JARs were not found by on Session/Context setup.

Second fix is adding find-spark-home.cmd script, which uses find_spark_home.py script, as the Linux version, to resolve SPARK_HOME. It is based on find-spark-home bash script, though, some operations are done in different order due to the cmd script language limitations. If environment variable is set, the Python script find_spark_home.py will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via pip/conda, thus, there is some Python in the system.

How was this patch tested?

Tested on local installation.

@HyukjinKwon
Copy link
Member

ok to test

bin/pyspark2.cmd Outdated
set SPARK_HOME=%~dp0..
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py
if exist "%FIND_SPARK_HOME_SCRIPT%" (
for /f %%i in ('python %FIND_SPARK_HOME_SCRIPT%') do set SPARK_HOME=%%i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding some comments? I believe we resemble here:

# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
# need to search the different Python directories for a Spark installation.
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
# spark-submit in another directory we want to use that version of PySpark rather than the
# pip installed version of PySpark.
export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
else
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
# Default to standard python interpreter unless told otherwise
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
fi
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")

which detects find_spark_home.py that should be included in pip installation:

spark/python/setup.py

Lines 143 to 145 in aad2125

# We add find_spark_home.py to the bin directory we install so that pip installed PySpark
# will search for SPARK_HOME with Python.
scripts.append("pyspark/find_spark_home.py")

I'd be nicer if PR description explains this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion; I will add them.

@HyukjinKwon
Copy link
Member

cc @holdenk, @felixcheung and @ueshin who I believe are interested in this.

@SparkQA
Copy link

SparkQA commented Sep 28, 2017

Test build #82258 has finished for PR 19370 at commit 0b12975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have the background of the issue and the approach for the fix in this PR description

rem Figure out where the Spark framework is installed
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py
if exist "%FIND_SPARK_HOME_SCRIPT%" (
for /f %%i in ('python %FIND_SPARK_HOME_SCRIPT%') do set SPARK_HOME=%%i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if python would be the right one (python2, python3, is it in PATH)?
and I don't think cmd /c foo.py works either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption was that if find_spark_home.py can be found in the local folder, this will be a good setup. While I think find_spark_home.py with any Python version (well, at least modern 2 and 3), indeed this not takes into account case that the python is just not there at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess python is generally expected on *nix but quite possibly absent on Windows.

@jsnowacki
Copy link
Contributor Author

I'll move out the codes to a separate find_spark_home.cmd, similar to the bash version, which will take into account the above comments.

@felixcheung
Copy link
Member

I'm not sure about duplicating the entire logic in find_spark into a windows version though - given how poorly maintained things are on windows (to be honest) I worry about any addition of code

@jsnowacki
Copy link
Contributor Author

I don't want to move the python logic into CMD. I'll move the current checks I've added to a separate file and extend them into one or two additional cases similar to the bash script {{find_spark_home}}. This will put the business logic of finding spark on windows into one file, which is better for maintenance. In the long run I'd suggest moving to a more portable solution like the python console scripts I suggested above.

@HyukjinKwon
Copy link
Member

Moving it to Python script sounds a good idea to me if possible in the future.I did have the smae concern with @felixcheung but to me okay to add it after reading the comment above.

Let's add comments in more details in the new script for followups in the future.

@felixcheung
Copy link
Member

what is this python console scripts idea, @jsnowacki ?

@jsnowacki
Copy link
Contributor Author

I've explained it in the ticket https://issues.apache.org/jira/browse/SPARK-18136.

I think in a long run it would be better to move to some Python packaging mechanisms like console_scripts or related, as it will provide better multiplatform support; see https://packaging.python.org/tutorials/distributing-packages/#scripts and https://setuptools.readthedocs.io/en/latest/setuptools.html#automatic-script-creation. I'll create a separate issue with improvement proposal.

@SparkQA
Copy link

SparkQA commented Oct 2, 2017

Test build #82380 has finished for PR 19370 at commit 3ac2c86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jsnowacki
Copy link
Contributor Author

I've incorporated all of your comments and moved the CMD logic of finding SPARK_HOME into separate script. It has been tested locally and works fine. It will complain if Python is not installed, but I don't have a better way around it really. It will try to run in only when find_spark_home.py is present in the current directory, so I assume this will only happen when it is installed via pip so there should be python somewhere there.

@SparkQA
Copy link

SparkQA commented Oct 2, 2017

Test build #82396 has finished for PR 19370 at commit d62ae59.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@@ -17,6 +17,13 @@ rem See the License for the specific language governing permissions and
rem limitations under the License.
rem

set SPARK_HOME=%~dp0..
rem Figure out where the Spark framework is installed
set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we change this one too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, fixing.

@SparkQA
Copy link

SparkQA commented Oct 3, 2017

Test build #82405 has finished for PR 19370 at commit d62ae59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -18,7 +18,7 @@ rem limitations under the License.
rem
Copy link
Member

@felixcheung felixcheung Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we should add this file to the appveyor list...

Copy link
Contributor Author

@jsnowacki jsnowacki Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to any action on this or just wait for it to be approved?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add - bin/*.cmd to here:

spark/appveyor.yml

Lines 27 to 35 in 828fab0

only_commits:
files:
- appveyor.yml
- dev/appveyor-install-dependencies.ps1
- R/
- sql/core/src/main/scala/org/apache/spark/sql/api/r/
- core/src/main/scala/org/apache/spark/api/r/
- mllib/src/main/scala/org/apache/spark/ml/r/
- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

@jsnowacki
Copy link
Contributor Author

Done altering run-example.cmd.

@HyukjinKwon
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented Oct 3, 2017

Test build #82414 has finished for PR 19370 at commit 5ad1590.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 3, 2017

Test build #82416 has finished for PR 19370 at commit 5ad1590.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jsnowacki
Copy link
Contributor Author

I've added - bin/*.cmd to the AppVeyor file. Please let me know if this is sufficient.

@HyukjinKwon
Copy link
Member

@jsnowacki, would you mind if I ask squash those commits into single one so that we can check if the squashed commit, having the changes in appveyor.yml and *.cmd, actually triggers AppVeyor test?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 6, 2017

Otherwise, looks good to me although I have to double check and test.

@jsnowacki
Copy link
Contributor Author

@HyukjinKwon Commit squashed to one as you've requested.

@HyukjinKwon
Copy link
Member

Yup, it looks triggering fine - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1822-master although I wonder why check mark does not appear. I think it is not specific to this PR but rather AppVeyor itself though ..

@SparkQA
Copy link

SparkQA commented Oct 6, 2017

Test build #82508 has finished for PR 19370 at commit 5f52c79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit/possible suggestion.
could we merge this to branch-2.2?

rem

rem Path to Python script finding SPARK_HOME
set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fails rather uglily if python is not in the path? should we add some checks for that?

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested and it looks going to give a message like this:

C:\...>pyspark
'python' is not recognized as an internal or external command, operable program or batch file.

which seems roughly fine though it's bit uglily.

So, googled a possible approach, for example, https://superuser.com/a/718194. However, seems where does not recognise an absolute path, for example, C:\...\Python27\python.exe. So, looks we should make a combination with exists keyword.

@jsnowacki if you are active now and know a simple better way, we could definitely try. Or, probably we could go ahead as is too ..

Copy link
Contributor Author

@jsnowacki jsnowacki Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only idea I have at the moment is adding is something along the line:

rem If there is python installed, try to use the root dir as SPARK_HOME
where %PYTHON_RUNNER%
if %ERRORLEVEL% neq 0 (
	echo %PYTHON_RUNNER% wasn't found 
	if "x%SPARK_HOME%"=="x" (
		set SPARK_HOME=%~dp0..
	)
)

I need to add it though in a main code not in the conditional subpblock as it doesn't work inside a block of code due to this CMD limitations regarding env variable evaluation.

@SparkQA
Copy link

SparkQA commented Nov 19, 2017

Test build #84004 has finished for PR 19370 at commit c0138a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Yup, +1 for going ahead with branch-2.2.

@felixcheung
Copy link
Member

felixcheung commented Nov 20, 2017 via email

@jsnowacki
Copy link
Contributor Author

I've added this check using where; it is at least some protection against missing python with some more meaningful information returned.

if "x%SPARK_HOME%"=="x" (
set SPARK_HOME=%~dp0..
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems tab is used instead. Let's match it with spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Nov 20, 2017

Test build #84038 has finished for PR 19370 at commit ed379b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if %ERRORLEVEL% neq 0 (
echo %PYTHON_RUNNER% wasn't found; Python doesn't seem to be installed
if "x%SPARK_HOME%"=="x" (
set SPARK_HOME=%~dp0..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there is a hidden tab ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, sorry. Now it should be fine.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84068 has finished for PR 19370 at commit 84d468b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84077 has finished for PR 19370 at commit a4d516f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if "x%SPARK_HOME%"=="x" (
set SPARK_HOME=%~dp0..
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here from the last commit is:

  • now PYTHON_RUNNER can't be an absolute path as where does not work with it

    ERROR: Invalid pattern is specified in "path:pattern".
    C:\Python27\python.exe wasn't found; Python doesn't seem to be installed
    
  • It print out the output from where:

    C:\...>pyspark
    C:\cygwin\bin\python
    C:\Python27\python.exe
    ...
    
  • and the error message looks not more useful than the previous one:

    python wasn't found; Python doesn't seem to be installed
    

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 22, 2017

I suggest this:

rem Path to Python script finding SPARK_HOME
set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py

rem Default to standard python interpreter unless told otherwise
set PYTHON_RUNNER=python
rem If PYSPARK_DRIVER_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_DRIVER_PYTHON%" =="x" (
  set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
)
rem If PYSPARK_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_PYTHON%" =="x" (
  set PYTHON_RUNNER=%PYSPARK_PYTHON%
)

rem If there is python installed, trying to use the root dir as SPARK_HOME
where %PYTHON_RUNNER% > nul 2>$1
if %ERRORLEVEL% neq 0 (
  if not exist %PYTHON_RUNNER% (
    if "x%SPARK_HOME%"=="x" (
      echo Missing Python executable '%PYTHON_RUNNER%', defaulting to '%~dp0..' for SPARK_HOME ^
environment variable. Please install Python or specify the correct Python executable in ^
PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
      set SPARK_HOME=%~dp0..
    )
  )
)

rem Only attempt to find SPARK_HOME if it is not set.
if "x%SPARK_HOME%"=="x" (
  if not exist "%FIND_SPARK_HOME_PYTHON_SCRIPT%" (
    rem If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
    rem need to search the different Python directories for a Spark installation.
    rem Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
    rem spark-submit in another directory we want to use that version of PySpark rather than the
    rem pip installed version of PySpark.
    set SPARK_HOME=%~dp0..
  ) else (
    rem We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
    for /f "delims=" %%i in ('%PYTHON_RUNNER% %FIND_SPARK_HOME_PYTHON_SCRIPT%') do set SPARK_HOME=%%i
  )
)

I manually tested each branch. This addresses the concern in #19370 (comment). The error message shows like:

C:\...>pyspark
Missing Python executable 'C:\foo\bar.exe', defaulting to 'C:\Python27\Scripts\.
.' for SPARK_HOME environment variable. Please install Python or specify the cor
rect Python executable in PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment va
riable to detect SPARK_HOME safely.
C:\...>pyspark
Missing Python executable 'foo', defaulting to 'C:\Python27\Scripts\..' for SPAR
K_HOME environment variable. Please install Python or specify the correct Python
 executable in PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to d
etect SPARK_HOME safely.

@HyukjinKwon
Copy link
Member

In if not exist "%FIND_SPARK_HOME_PYTHON_SCRIPT%" (, I switched the condition to be matched with

elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then

@jsnowacki
Copy link
Contributor Author

Thanks for looking into it again. I've followed your suggestions and updated PR. It also seems to work for me now.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84119 has finished for PR 19370 at commit b58f740.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Build started: [SparkR] ALL PR-19370
Diff: master...spark-test:A5AAFE0C-D09F-4FBE-B834-811E8AC9FF06

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 23, 2017

Seems there is a conflict while backporting/cherry-picking to branch-2.2. @jsnowacki, mind opening a backporting PR to branch-2.2 please?

I think this is important for many Windows users and I guess relatively low-risky.

@asfgit asfgit closed this in b4edafa Nov 23, 2017
asfgit pushed a commit that referenced this pull request Nov 24, 2017
## What changes were proposed in this pull request?

This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in #19370 (comment).

Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.

First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.

Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.

## How was this patch tested?

Tested on local installation.

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes #19807 from jsnowacki/fix_spark_cmds_2.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?

This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in apache#19370 (comment).

Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.

First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.

Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.

## How was this patch tested?

Tested on local installation.

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes apache#19807 from jsnowacki/fix_spark_cmds_2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants