[SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf #14959

zjffdu · 2016-09-05T09:32:02Z

What changes were proposed in this pull request?

The root cause that we would ignore SparkConf when launching JVM is that SparkConf require JVM to be created first. https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L106
In this PR, I would defer the launching of JVM until SparkContext is created so that we can pass SparkConf to JVM correctly.

How was this patch tested?

Use the example code in the description of SPARK-17387,

$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyspark import SparkContext
>>> from pyspark import SparkConf
>>> conf = SparkConf().set("spark.driver.memory", "4g")
>>> sc = SparkContext(conf=conf)

And verify the spark.driver.memory is correctly picked up.

...op/ -Xmx4g org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=4g pyspark-shell

SparkQA · 2016-09-05T11:08:18Z

Test build #64933 has finished for PR 14959 at commit f977761.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-05T11:42:52Z

Test build #64935 has finished for PR 14959 at commit 8ade76d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-05T13:05:51Z

Test build #64937 has finished for PR 14959 at commit 83a2c0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-09-06T18:58:51Z

python/pyspark/conf.py

@@ -101,13 +101,31 @@ def __init__(self, loadDefaults=True, _jvm=None, _jconf=None):
            self._jconf = _jconf
        else:
            from pyspark.context import SparkContext
-            SparkContext._ensure_initialized()


Just a thought, would it be possible to just add the --conf args to the env variable before this is called if the jvm hasn't been created yet? Then you wouldn't need to propagate the confs all the way through to launch_gateway.

Do you mean to set env variable PYSPARK_SUBMIT_ARGS ?

Nevermind, this wouldn't even be possible here. I didn't realize that SparkConf can be created first, launching the jvm, and then configuration is added after.

vanzin · 2016-09-12T21:09:55Z

The conf code looks kinda nasty with the checks for whether a JVM has been set or not... I guess part of it is mandatory because otherwise this wouldn't work, but in particular, I'm not so sure the _set_jvm code is necessary.

If you just say self._conf = SparkConf(_jvm=self._jvm) in SparkContext, it should maintain the current behavior. Especially since the Scala SparkContext clones the original user config - and if I read your code correctly, you're not doing that here.

vanzin · 2016-09-12T21:10:37Z

Also someone else more familiar with pyspark (I know Holden has already looked), maybe @davies, should take a look.

zjffdu · 2016-09-18T05:23:30Z

@vanzin Thanks for your reviews. I just update the PR, but don't get your following statement mean. Can you explain it ? Thanks

Especially since the Scala SparkContext clones the original user config - and if I read your code correctly, you're not doing that here.

SparkQA · 2016-09-18T05:47:15Z

Test build #65551 has finished for PR 14959 at commit d85bf36.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-09-19T17:23:06Z

It means that if you do this:

val conf = new SparkConf()
val sc = new SparkContext(conf)

The internal SparkConf of the context will not be the same instance as conf. With the changes I reviewed, in the python case, the internal conf of the context would be the same instance as the user's, which is different behavior.

vanzin · 2016-09-19T18:47:13Z

python/pyspark/java_gateway.py

@@ -50,13 +50,18 @@ def launch_gateway():
        # proper classpath and settings from spark-env.sh
        on_windows = platform.system() == "Windows"
        script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
+        command = [os.path.join(SPARK_HOME, script)]
+        if conf and conf.getAll():
+            conf_items = [['--conf', '%s=%s' % (k, v)] for k, v in conf.getAll()]


No point in using list comprehension if it requires more code...

for k, v in conf.getAll(): command += ['--conf', '%s=%s' % (k, v)]

Correct, will fix it.

zjffdu · 2016-09-20T05:57:25Z

The internal SparkConf of the context will not be the same instance as conf.

@vanzin This is the existing implementation that python is different from scala. But I think it is correct. I guess the reason why in scala the internal SparkConf of the SparkContext is not the same instance as conf is to make sure changing SparkConf after SparkContext is created would not take effect. pyspark is the same in this perspective. Although in pyspark the internal SparkConf of SparkContext is the same instance as conf, changing conf after SparkContext is created would not take effect as it is guaranteed in scala side.

SparkQA · 2016-09-20T06:19:12Z

Test build #65639 has finished for PR 14959 at commit f3287aa.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T06:54:04Z

Test build #65641 has finished for PR 14959 at commit b07c574.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T12:03:36Z

Test build #65649 has finished for PR 14959 at commit 5e524f1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-09-20T15:59:38Z

Are these tests flaky or is the failure related to this change? (Other PRs seem to be passing, so probably the latter?)

SparkQA · 2016-09-21T04:12:22Z

Test build #65699 has finished for PR 14959 at commit ad47e3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-09-21T04:32:44Z

I just fixed the unti test, @vanzin @davies @holdenk Mind to help review it ? Thanks

vanzin · 2016-09-22T17:15:47Z

LGTM. Let's see if others have any comments.

BryanCutler · 2016-09-26T20:08:45Z

python/pyspark/conf.py

+            self._jconf.set(key, unicode(value))
+        else:
+            # Don't use unicode for self._conf, otherwise we will get exception when launching jvm.
+            self._conf[key] = value


Could you just cast the value to unicode here also? Then it would be consistent with the Java class and you wouldn't need to change the doctest above.

As I mentioned in the comment, unicode would cause exception when launching jvm using spark-submit

hmm, I didn't get any exception. Here is what I tried

$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python Python 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark import SparkContext >>> from pyspark import SparkConf >>> conf = SparkConf().set("spark.driver.memory", "4g") >>> sc = SparkContext(conf=conf) Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/09/26 16:53:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/09/26 16:53:58 WARN Utils: Your hostname, resolves to a loopback address: 127.0.1.1; using *.*.*.* instead (on interface ) 16/09/26 16:53:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address >>> conf.get("spark.driver.memory") u'4g' >>> sc.getConf().get("spark.driver.memory") u'4g'

How would you hit this line if you are using spark-submit? Maybe I'm missing something

Fixed, It might be my last previous commits' issue.

BryanCutler · 2016-09-26T20:11:10Z

python/pyspark/conf.py


    def set(self, key, value):
        """Set a configuration property."""
-        self._jconf.set(key, unicode(value))
+        # Try to set self._jconf first if JVM is created, set self._conf if JVM is not created yet.
+        if self._jconf:


minor: I think it's generally better and would be more consistent with the rest of the code if you made these checks against a None value, here and other places in this PR. For example:

if self._jconf is not None:

BryanCutler · 2016-09-26T20:26:54Z

python/pyspark/conf.py

+                pairs.append((elem._1(), elem._2()))
+        else:
+            for k, v in self._conf.items():
+                pairs.append((k, v))


This could be simplified:

if self._jconf is not None: return [(elem._1(), elem._2()) for elem in self._jconf.getAll()] else: return self._conf.items()

BryanCutler · 2016-09-26T20:28:07Z

python/pyspark/conf.py

+        if self._jconf:
+            return self._jconf.toDebugString()
+        else:
+            return '\n'.join('%s=%s' % (k, v) for k, v in self._conf.items())


maybe add a unit test to make sure the 2 ways to make a debug string are the same?

They may be different, because _jconf has the extra configuration in jvm side (like spark-defaults.conf), while self._conf only has the configuration in python side.

BryanCutler · 2016-09-26T20:33:59Z

python/pyspark/java_gateway.py

@@ -41,7 +41,7 @@ def can_convert_list(self, obj):
 ListConverter.can_convert = can_convert_list


-def launch_gateway():
+def launch_gateway(conf=None):


It might be helpful to have a docstring with param description of conf here

BryanCutler · 2016-09-26T20:37:53Z

python/pyspark/conf.py

        return self

    def setExecutorEnv(self, key=None, value=None, pairs=None):
        """Set an environment variable to be passed to executors."""
        if (key is not None and pairs is not None) or (key is None and pairs is None):
            raise Exception("Either pass one key-value pair or a list of pairs")
        elif key is not None:
-            self._jconf.setExecutorEnv(key, value)
+            self.set("spark.executorEnv." + key, value)


since you duplicated the property prefix "spark.executorEnv." from the Scala file, it might be good to make a unit test that ensures calling setExecutorEnv in both Python and Scala/Java actually sets a property with the same prefix.

BryanCutler · 2016-09-26T20:39:54Z

I added a few comments @zjffdu. I also tested this PR out and looks good

vanzin · 2016-10-04T16:43:42Z

@zjffdu could you update the PR? Thanks!

holdenk · 2016-10-05T19:19:29Z

I'd also love to see this updated so we can finish the review and make it easier for people to use PySpark this way :)

…ark-submit ignores user conf

SparkQA · 2016-10-09T02:13:31Z

Test build #66587 has finished for PR 14959 at commit dbc4bb4.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-09T02:56:04Z

Test build #66588 has finished for PR 14959 at commit 1972714.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-10-09T03:07:10Z

@vanzin @holdenk @BryanCutler PR is updated, please help review.

holdenk · 2016-10-09T03:19:06Z

Awesome, thanks for updating. I'm at PyData this weekend so will be a bit slow on my end.

BryanCutler · 2016-10-10T17:36:17Z

python/pyspark/context.py

@@ -121,7 +121,15 @@ def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
    def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
                 conf, jsc, profiler_cls):
        self.environment = environment or {}
-        self._conf = conf or SparkConf(_jvm=self._jvm)
+        # java gateway must have been launched at this point.
+        if conf is not None and conf._jconf:


Maybe also change this to and conf._jconf is not None

BryanCutler · 2016-10-10T17:36:49Z

python/pyspark/conf.py

+        if self._jconf is not None:
+            self._jconf.set(key, unicode(value))
+        else:
+            # Don't use unicode for self._conf, otherwise we will get exception when launching jvm.


you can remove this comment now

BryanCutler · 2016-10-10T17:38:35Z

Thanks for the updates @zjffdu, just 2 minor comments otherwise LGTM!

SparkQA · 2016-10-11T01:56:52Z

Test build #66700 has finished for PR 14959 at commit d692e71.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-10-11T20:11:18Z

retest this please

SparkQA · 2016-10-11T20:55:59Z

Test build #66758 has finished for PR 14959 at commit d692e71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-10-11T21:56:05Z

LGTM, merging to master.

…ark-submit ignores user conf ## What changes were proposed in this pull request? The root cause that we would ignore SparkConf when launching JVM is that SparkConf require JVM to be created first. https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L106 In this PR, I would defer the launching of JVM until SparkContext is created so that we can pass SparkConf to JVM correctly. ## How was this patch tested? Use the example code in the description of SPARK-17387, ``` $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python Python 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark import SparkContext >>> from pyspark import SparkConf >>> conf = SparkConf().set("spark.driver.memory", "4g") >>> sc = SparkContext(conf=conf) ``` And verify the spark.driver.memory is correctly picked up. ``` ...op/ -Xmx4g org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=4g pyspark-shell ``` Author: Jeff Zhang <zjffdu@apache.org> Closes apache#14959 from zjffdu/SPARK-17387.

BryanCutler reviewed Sep 6, 2016
View reviewed changes

zjffdu force-pushed the SPARK-17387 branch from 83a2c0b to d85bf36 Compare September 18, 2016 05:23

vanzin reviewed Sep 19, 2016

View reviewed changes

zjffdu force-pushed the SPARK-17387 branch from d85bf36 to f3287aa Compare September 20, 2016 06:12

zjffdu force-pushed the SPARK-17387 branch from f3287aa to b07c574 Compare September 20, 2016 06:29

zjffdu closed this Sep 20, 2016

zjffdu reopened this Sep 20, 2016

zjffdu force-pushed the SPARK-17387 branch from b07c574 to 5e524f1 Compare September 20, 2016 11:41

BryanCutler reviewed Sep 26, 2016

View reviewed changes

zjffdu added 6 commits October 9, 2016 09:52

[SPARK-17387][PYSPARK] Creating SparkContext() from python without sp…

787ed07

…ark-submit ignores user conf

fix code style

19eb240

fix unit test

7bca619

address comment

325ba71

fix unit test

005a90e

address comments

dbc4bb4

zjffdu force-pushed the SPARK-17387 branch from ad47e3f to dbc4bb4 Compare October 9, 2016 02:07

fix code style

1972714

BryanCutler reviewed Oct 10, 2016

View reviewed changes

address comments

d692e71

asfgit closed this in 5b77e66 Oct 11, 2016

[SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf #14959

[SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf #14959

Conversation

zjffdu commented Sep 5, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 5, 2016

SparkQA commented Sep 5, 2016

SparkQA commented Sep 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Sep 12, 2016

vanzin commented Sep 12, 2016

zjffdu commented Sep 18, 2016

SparkQA commented Sep 18, 2016

vanzin commented Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zjffdu commented Sep 20, 2016 • edited Loading

SparkQA commented Sep 20, 2016

SparkQA commented Sep 20, 2016

SparkQA commented Sep 20, 2016

vanzin commented Sep 20, 2016

SparkQA commented Sep 21, 2016

zjffdu commented Sep 21, 2016 • edited Loading

vanzin commented Sep 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Sep 26, 2016

vanzin commented Oct 4, 2016

holdenk commented Oct 5, 2016

SparkQA commented Oct 9, 2016

SparkQA commented Oct 9, 2016

zjffdu commented Oct 9, 2016

holdenk commented Oct 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Oct 10, 2016

SparkQA commented Oct 11, 2016

vanzin commented Oct 11, 2016

SparkQA commented Oct 11, 2016

vanzin commented Oct 11, 2016

zjffdu commented Sep 5, 2016 •

edited

Loading

vanzin commented Sep 19, 2016 •

edited

Loading

zjffdu commented Sep 20, 2016 •

edited

Loading

zjffdu commented Sep 21, 2016 •

edited

Loading