Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2894] spark-shell doesn't accept flags #1825

Closed
wants to merge 7 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Aug 7, 2014

As @sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801

@rxin
Copy link
Contributor

rxin commented Aug 7, 2014

Jenkins, add to whitelist.

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18100/consoleFull

@sryza
Copy link
Contributor

sryza commented Aug 7, 2014

This will allow spark-shell to take spark-submit options, but will remove its ability to take spark-shell-specific options (currently there's only one, "file"). I'm unclear on the best way to support both of these.

@pwendell
Copy link
Contributor

pwendell commented Aug 7, 2014

does the shell actually take flags? I didn't realize this when I OK'd #1801. If there are specific flags, we should trap them and pass them after spark-shell. For an example see the code in the SQL shell:
https://github.com/apache/spark/blob/master/bin/spark-sql#L66

@andrewor14
Copy link
Contributor

AFAIK I don't believe spark-shell takes in any application-specific arguments, at least not documented ones. @sryza What file are you referring to?

@sryza
Copy link
Contributor

sryza commented Aug 7, 2014

org.apache.spark.repl.SparkRunnerSettings

@andrewor14
Copy link
Contributor

ah... looks like we need some special logic to filter that one out here

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1825:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18100/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18118/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1825:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SPARK_SUBMIT_ARGS[@]} spark-shell ${SPARK_SHELL_ARGS[@]}
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SPARK_SUBMIT_ARGS[@]} spark-shell ${SPARK_SHELL_ARGS[@]}

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18118/consoleFull

@andrewor14
Copy link
Contributor

It would be good to verify whether pyspark is affected too (see java_gateway.py)

@JoshRosen
Copy link
Contributor

There seems to be a similar PySpark issue, too (I just ran into it during testing):

https://issues.apache.org/jira/browse/SPARK-2110

args="$@"
inLoadFileFlag=0

for arg in $args;do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, could you use the same approach that we use in spark-sql script?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.K. I'll check it.

@liancheng
Copy link
Contributor

Maybe instead of filtering out all spark-shell (and pyspark) options, we can do the opposite as what I did in bin/spark-sql: filter out all spark-submit options, and pass all others to the main class. At least we have control on the spark-submit option list.

I'll try to write a utility script to filter out spark-submit options today.

@andrewor14
Copy link
Contributor

@liancheng However that means every time we want to change a spark-submit config, we need to change it in multiple places, which might make things harder to maintain. For spark-shell, aren't we only expecting one type of argument? I'm not super familiar with this functionality of passing a settings file to spark-shell, but I don't think this file path matches --*, so maybe we can filter this argument out this way. This might be simpler than enumerating all of spark-submit's flags.

@liancheng
Copy link
Contributor

@andrewor14 Ah OK, I only had a glance at repl.Main and SparkILoop before writing my last comment and thought spark-shell might accept all standard Scala REPL options by mistake.

@sarutak Thanks for help fixing this. I'd also suggest what @pwendell suggested, i.e. using a similar while/case structure in bin/spark-sql and sbin/start-thriftserver.sh to filter out the -i <init-file> option, can be easier to read. And we need to do the same thing for pyspark.

stty icanon echo > /dev/null 2>&1
else
extractLoadFileOpt "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this call before the if statement to cover the Cygwin branch. And please use 4-space indentation here :)

@liancheng
Copy link
Contributor

@JoshRosen I'm checking pyspark, but I'm not very familiar with this. Would you mind to help confirming are there any other Python/IPython/PySpark specific command line options that need to be handled?

@JoshRosen
Copy link
Contributor

@liancheng ipython has a ton of command line options that someone might hypothetically want to use (just take a look at ipython --help). Maybe try things out with a few of these options? I'm not sure which pyspark options users rely on.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18190/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1825:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_ARGS[]} spark-shell ${CLIARGS[]}
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_ARGS[]} spark-shell ${CLI_ARGS[]}

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18190/consoleFull

case $1 in
-i)
CLI_ARGS+=($1); shift
CLI_ARGS+=($1); shift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, included tabs.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18197/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1825:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_ARGS[]} spark-shell ${CLIARGS[]}
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_ARGS[]} spark-shell ${CLI_ARGS[]}

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18197/consoleFull

@andrewor14
Copy link
Contributor

This failed a flaky test. Retest this please

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18211/consoleFull

@liancheng
Copy link
Contributor

@JoshRosen Thanks, then I think gathering all spark-submit options rather than options of spark-shell/pyspark or any other potential applications can be more scalable, since we have control over these options. I opened a WIP branch for this, and it seems that we can fix both pyspark (Python, IPython and the deprecated pyspark app.py style) and spark-shell easily.

@SparkQA
Copy link

SparkQA commented Aug 9, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18254/consoleFull

exit 1;
fi
SUBMISSION_OPTS+=($1); shift
SUBMISSION_OPTS+=($1); shift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following case fails because this line doesn't handle quoted string with spaces properly:

./bin/pyspark app.py --master spark://lian-laptop.local:7077 --name "awesome name"

A possible fix is to replace this line with the trick bin/pyspark uses:

whitespace="[[:space:]]"
i=$1
if [[ $1 =~ \" ]]; then i=$(echo $1 | sed 's/\"/\\\"/g'); fi
if [[ $1 =~ $whitespace ]]; then i=\"$1\"; fi
SUBMISSION_OPTS+=($i); shift

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that. Newer PR is modified that.
I think the main reason is $1 is not double-quoted in utils.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm afraid only adding double quotes doesn't help, the case I mentioned above still fails, and the pyspark trick can be helpful. And we only need to deal with the argument part of those options that need an argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually double quotes should be sufficient here, since it preserves whitespaces by default. The only reason why we needed the special handling for pyspark is that we need to pass the string literal in as an environment variable, otherwise python won't know how to split our arguments. We don't need to do the same here because this is already split by bash.

@liancheng
Copy link
Contributor

Hey @sarutak, thanks for fixing this :) Did some tests locally and the only issue I found is the quoted string case I just commented. Otherwise LGTM.

APPLICATION_OPTS=()
while (($#)); do
case "$1" in
--master | --deploy-mode | --class | --name | --jars | --py-files | --files | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a note here that says something like All changes here must be reflected in SparkSubmitArguments.scala

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add a similar comment in SparkSubmitArguments.scala too. It would be bad if these two go out of sync

@andrewor14
Copy link
Contributor

I have tested this locally and I was able to get both spark shell and pyspark working as before. I also tried it with quoted arguments and backslashes and the arguments are still propagated properly. I haven't tried the settings file for spark-shell, however. Pending a few comments this LGTM, and we should get this in quickly if possible.

Added comments in utils.py and SparkSubmitArguments.scala to notice someone that both of the files should be modified simultaneously
@SparkQA
Copy link

SparkQA commented Aug 9, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18257/consoleFull

Added comments in utils.py and SparkSubmitArguments.scala to notice someone that both of the files should be modified simultaneously
@SparkQA
Copy link

SparkQA commented Aug 9, 2014

QA tests have started for PR 1825. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18261/consoleFull

@andrewor14
Copy link
Contributor

LGTM, thanks @sarutak and @liancheng.

@pwendell
Copy link
Contributor

pwendell commented Aug 9, 2014

Okay cool - we can merge this once the tests pass.

@pwendell
Copy link
Contributor

Tests passed so I'm going to merge it

asfgit pushed a commit that referenced this pull request Aug 10, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1715, Closes #1864, and Closes #1861

Closes #1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@asfgit asfgit closed this in 4f4a988 Aug 10, 2014
@JoshRosen
Copy link
Contributor

Does this work with Python? I ask because

IPYTHON=1 ./bin/pyspark --no-banner --no-confirm-exit

doesn't seem to respect my ipython flags. Am I using this right?

@andrewor14
Copy link
Contributor

You need to set them through IPYTHON_OPTS

@JoshRosen
Copy link
Contributor

Doh, I forgot about that. I tried a bunch of other Python configurations and in every case the behavior seems to match 1.0.2, which is great. Thanks!

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1715, Closes apache#1864, and Closes apache#1861

Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@sarutak sarutak deleted the SPARK-2894 branch April 11, 2015 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants