[SPARK-5161] Parallelize Python test execution #7031

JoshRosen · 2015-06-26T04:06:28Z

This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times. Parallelism is now configurable by passing the -p or --parallelism flags to either dev/run-tests or python/run-tests (the default parallelism is 4, but I've successfully tested with higher parallelism).

To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests.

SparkQA · 2015-06-26T04:20:36Z

Test build #35820 has finished for PR 7031 at commit 78fd0be.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T05:32:55Z

Test build #35823 has finished for PR 7031 at commit 6120860.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T06:56:45Z

Test build #35834 has finished for PR 7031 at commit 3064eb1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T07:32:00Z

Test build #35838 has finished for PR 7031 at commit d81119e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

JoshRosen · 2015-06-26T07:33:50Z

Interesting, it looks like the test failures are not a direct consequence of the parallelization since they still occur even if I only use one thread:

======================================================================
ERROR: test_termination_sigterm (__main__.DaemonTests)
Ensure that daemon and workers terminate on SIGTERM.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1451, in test_termination_sigterm
    self.do_termination_test(lambda daemon: os.kill(daemon.pid, SIGTERM))
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1424, in do_termination_test
    daemon = Popen([sys.executable, daemon_path], stdin=PIPE, stdout=PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

======================================================================
ERROR: test_termination_stdin (__main__.DaemonTests)
Ensure that daemon and workers terminate when stdin is closed.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1446, in test_termination_stdin
    self.do_termination_test(lambda daemon: daemon.stdin.close())
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1424, in do_termination_test
    daemon = Popen([sys.executable, daemon_path], stdin=PIPE, stdout=PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

----------------------------------------------------------------------
Ran 97 tests in 102.677s

FAILED (errors=2)
   Random listing order was used

JoshRosen · 2015-06-26T07:34:47Z

Also, it looks like my attempt to disable the Web UI in the Python tests was unsuccessful, since the unit-tests.log file still contains a bunch of port-contention errors while starting the UI:

java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:444)
    at sun.nio.ch.Net.bind(Net.java:436)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
    at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.server.Server.doStart(Server.java:293)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:228)
    at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:238)
    at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:238)
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1985)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1976)
    at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238)
    at org.apache.spark.ui.WebUI.bind(WebUI.scala:117)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:448)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:448)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:448)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

SparkQA · 2015-06-26T08:34:22Z

Test build #35841 has finished for PR 7031 at commit 29ab78d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T09:44:14Z

Test build #35847 has finished for PR 7031 at commit 2b51724.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

JoshRosen · 2015-06-26T16:34:32Z

These tests are now failing since we're now attempting to run the PySpark ML tests with PyPy, but that won't work because we don't have numpy for PyPy. Let me see if I can come up with a reasonably clean way to handle this.

SparkQA · 2015-06-26T17:00:54Z

Test build #35866 has finished for PR 7031 at commit c022b47.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

JoshRosen · 2015-06-26T17:54:23Z

Jenkins, retest this please.

SparkQA · 2015-06-26T18:41:57Z

Test build #35871 has finished for PR 7031 at commit f960ee5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T19:37:01Z

Test build #35880 has finished for PR 7031 at commit 5f2d295.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T19:47:55Z

Test build #35877 has finished for PR 7031 at commit 3279c34.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-26T21:08:37Z

Test build #35883 has finished for PR 7031 at commit facfafe.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

SparkQA · 2015-06-27T01:57:48Z

Test build #35890 has finished for PR 7031 at commit 17e52c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Module(object):

JoshRosen · 2015-06-27T04:37:14Z

Yay, this passed tests! I think that we should merge my other PR, #6967, first, then loop back to review the parallelism-specific changes here. With 4 threads / processes of parallelism, the Python tests ran in ~10 minutes, whereas they usually take 30+ minutes in Jenkins. In principle, + our tests will support a much higher level of parallelism, so I bet we could cut this down to 5 minutes or less.

JoshRosen · 2015-06-27T04:37:47Z

It's also worth figuring out why the Spark UI doesn't seem to be disabled properly according to unit-tests.logs, since problems there could be a source of flakiness once we aggressively bump up the parallelism.

JoshRosen · 2015-06-28T07:10:56Z

I wonder if I can just add a touch core/target/test-reports/dummy-test-report.xml to Jenkins in order to prevent the build from failing for PRs that don't run JVM tests. Let me try...

JoshRosen · 2015-06-28T07:11:35Z

Jenkins, retest this please.

SparkQA · 2015-06-28T07:28:47Z

Test build #35925 has finished for PR 7031 at commit 110cd9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-28T07:30:36Z

Looks like that didn't help:

Build step 'Publish JUnit test result report' changed build result to UNSTABLE
Finished: UNSTABLE

JoshRosen · 2015-06-28T07:45:38Z

Jenkins, retest this please.

SparkQA · 2015-06-28T08:04:08Z

Test build #35927 has finished for PR 7031 at commit 110cd9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-29T18:31:42Z

Jenkins, retest this please.

SparkQA · 2015-06-29T18:52:46Z

Test build #36011 has finished for PR 7031 at commit 110cd9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-29T22:30:05Z

python/run-tests.py

    duration = time.time() - start_time
+    with LOG_FILE_LOCK:
+        with open(LOG_FILE, 'ab') as log_file:
+            per_test_output.seek(0)


TODO: only need to print the log when the test fails.

davies · 2015-06-29T23:14:14Z

LGTM, once you finished all the TODOs.

…tests

SparkQA · 2015-06-30T00:30:58Z

Test build #36056 has finished for PR 7031 at commit d4ded73.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-30T00:40:59Z

Test build #36057 has finished for PR 7031 at commit feb3763.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-30T01:34:53Z

Jenkins, retest this please.

SparkQA · 2015-06-30T01:38:20Z

Test build #36062 has finished for PR 7031 at commit feb3763.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-30T01:43:56Z

Jenkins retest this please

SparkQA · 2015-06-30T03:59:50Z

Test build #36063 has finished for PR 7031 at commit feb3763.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-30T04:03:41Z

Hey @davies, I've addressed the TODOs, so this should be ready for a final look and sign-off.

davies · 2015-06-30T04:32:17Z

LGTM, merging this into master, thanks!

Right now, all the python tests will finished in about 10 minutes, cool!

This patch fixes a bug introduced in #7031 which can cause Jenkins to incorrectly report a build with failed Python tests as passing if an error occurred while printing the test failure message. Author: Josh Rosen <joshrosen@databricks.com> Closes #7112 from JoshRosen/python-tests-hotfix and squashes the following commits: c3f2961 [Josh Rosen] Hotfix for bug in Python test failure reporting

SparkQA · 2015-06-30T07:17:18Z

Test build #981 timed out for PR 7031 at commit feb3763 after a configured wait of 175m.

JoshRosen mentioned this pull request Jun 27, 2015

[SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system #6967

Closed

JoshRosen added 11 commits June 27, 2015 23:21

Initial attempt at parallelizing Python test execution

af4cef4

Temporarily disable JVM tests so we can test Python speedup in Jenkins.

037b686

Disable Spark UI in Python tests

9129027

Temporarily disable parallelism to debug a failure

8309bfe

Skip MLLib tests for PyPy

87cb988

Fix lazy logging warnings in Prospector checks

866b5b9

Python 3 fix

5552380

Bump up parallelism.

a2b9094

Log Python --version output for each executable.

9e31127

Also log python_implementation

cd13db8

Fix universal_newlines for Python 3

110cd9d

JoshRosen reviewed Jun 29, 2015
View reviewed changes

JoshRosen added 3 commits June 29, 2015 16:53

Merge remote-tracking branch 'origin/master' into parallelize-python-…

1bacf1b

…tests

Make parallelism configurable via dev/run-tests

a2717e1

Logging improvements

d4ded73

JoshRosen added 2 commits June 29, 2015 17:32

Only log output from failed tests

f87ea81

Re-enable other tests

feb3763

JoshRosen changed the title ~~[SPARK-5161] [WIP] Parallelize Python test execution~~ [SPARK-5161] Parallelize Python test execution Jun 30, 2015

asfgit closed this in 7bbbe38 Jun 30, 2015

JoshRosen deleted the parallelize-python-tests branch June 30, 2015 04:38

JoshRosen mentioned this pull request Jun 30, 2015

[SPARK-5161] [HOTFIX] Fix bug in Python test failure reporting #7112

Closed

[SPARK-5161] Parallelize Python test execution #7031

[SPARK-5161] Parallelize Python test execution #7031

Conversation

JoshRosen commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

JoshRosen commented Jun 26, 2015

JoshRosen commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

JoshRosen commented Jun 26, 2015

SparkQA commented Jun 26, 2015

JoshRosen commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 26, 2015

SparkQA commented Jun 27, 2015

JoshRosen commented Jun 27, 2015

JoshRosen commented Jun 27, 2015

JoshRosen commented Jun 28, 2015

JoshRosen commented Jun 28, 2015

SparkQA commented Jun 28, 2015

JoshRosen commented Jun 28, 2015

JoshRosen commented Jun 28, 2015

SparkQA commented Jun 28, 2015

JoshRosen commented Jun 29, 2015

SparkQA commented Jun 29, 2015

JoshRosen Jun 29, 2015

Choose a reason for hiding this comment

davies commented Jun 29, 2015

SparkQA commented Jun 30, 2015

SparkQA commented Jun 30, 2015

JoshRosen commented Jun 30, 2015

SparkQA commented Jun 30, 2015

JoshRosen commented Jun 30, 2015

SparkQA commented Jun 30, 2015

JoshRosen commented Jun 30, 2015

davies commented Jun 30, 2015

SparkQA commented Jun 30, 2015