Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5161] Parallelize Python test execution #7031

Closed
wants to merge 16 commits into from

Conversation

JoshRosen
Copy link
Contributor

This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times. Parallelism is now configurable by passing the -p or --parallelism flags to either dev/run-tests or python/run-tests (the default parallelism is 4, but I've successfully tested with higher parallelism).

To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35820 has finished for PR 7031 at commit 78fd0be.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35823 has finished for PR 7031 at commit 6120860.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35834 has finished for PR 7031 at commit 3064eb1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35838 has finished for PR 7031 at commit d81119e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@JoshRosen
Copy link
Contributor Author

Interesting, it looks like the test failures are not a direct consequence of the parallelization since they still occur even if I only use one thread:

======================================================================
ERROR: test_termination_sigterm (__main__.DaemonTests)
Ensure that daemon and workers terminate on SIGTERM.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1451, in test_termination_sigterm
    self.do_termination_test(lambda daemon: os.kill(daemon.pid, SIGTERM))
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1424, in do_termination_test
    daemon = Popen([sys.executable, daemon_path], stdin=PIPE, stdout=PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

======================================================================
ERROR: test_termination_stdin (__main__.DaemonTests)
Ensure that daemon and workers terminate when stdin is closed.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1446, in test_termination_stdin
    self.do_termination_test(lambda daemon: daemon.stdin.close())
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 1424, in do_termination_test
    daemon = Popen([sys.executable, daemon_path], stdin=PIPE, stdout=PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

----------------------------------------------------------------------
Ran 97 tests in 102.677s

FAILED (errors=2)
   Random listing order was used

@JoshRosen
Copy link
Contributor Author

Also, it looks like my attempt to disable the Web UI in the Python tests was unsuccessful, since the unit-tests.log file still contains a bunch of port-contention errors while starting the UI:

java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:444)
    at sun.nio.ch.Net.bind(Net.java:436)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
    at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.server.Server.doStart(Server.java:293)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:228)
    at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:238)
    at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:238)
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1985)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1976)
    at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238)
    at org.apache.spark.ui.WebUI.bind(WebUI.scala:117)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:448)
    at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:448)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:448)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35841 has finished for PR 7031 at commit 29ab78d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35847 has finished for PR 7031 at commit 2b51724.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@JoshRosen
Copy link
Contributor Author

These tests are now failing since we're now attempting to run the PySpark ML tests with PyPy, but that won't work because we don't have numpy for PyPy. Let me see if I can come up with a reasonably clean way to handle this.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35866 has finished for PR 7031 at commit c022b47.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35871 has finished for PR 7031 at commit f960ee5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35880 has finished for PR 7031 at commit 5f2d295.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35877 has finished for PR 7031 at commit 3279c34.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35883 has finished for PR 7031 at commit facfafe.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@SparkQA
Copy link

SparkQA commented Jun 27, 2015

Test build #35890 has finished for PR 7031 at commit 17e52c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Module(object):

@JoshRosen
Copy link
Contributor Author

Yay, this passed tests! I think that we should merge my other PR, #6967, first, then loop back to review the parallelism-specific changes here. With 4 threads / processes of parallelism, the Python tests ran in ~10 minutes, whereas they usually take 30+ minutes in Jenkins. In principle, + our tests will support a much higher level of parallelism, so I bet we could cut this down to 5 minutes or less.

@JoshRosen
Copy link
Contributor Author

It's also worth figuring out why the Spark UI doesn't seem to be disabled properly according to unit-tests.logs, since problems there could be a source of flakiness once we aggressively bump up the parallelism.

@JoshRosen
Copy link
Contributor Author

I wonder if I can just add a touch core/target/test-reports/dummy-test-report.xml to Jenkins in order to prevent the build from failing for PRs that don't run JVM tests. Let me try...

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 28, 2015

Test build #35925 has finished for PR 7031 at commit 110cd9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Looks like that didn't help:

Build step 'Publish JUnit test result report' changed build result to UNSTABLE
Finished: UNSTABLE

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 28, 2015

Test build #35927 has finished for PR 7031 at commit 110cd9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #36011 has finished for PR 7031 at commit 110cd9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

duration = time.time() - start_time
with LOG_FILE_LOCK:
with open(LOG_FILE, 'ab') as log_file:
per_test_output.seek(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: only need to print the log when the test fails.

@davies
Copy link
Contributor

davies commented Jun 29, 2015

LGTM, once you finished all the TODOs.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36056 has finished for PR 7031 at commit d4ded73.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen JoshRosen changed the title [SPARK-5161] [WIP] Parallelize Python test execution [SPARK-5161] Parallelize Python test execution Jun 30, 2015
@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36057 has finished for PR 7031 at commit feb3763.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36062 has finished for PR 7031 at commit feb3763.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36063 has finished for PR 7031 at commit feb3763.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Hey @davies, I've addressed the TODOs, so this should be ready for a final look and sign-off.

@davies
Copy link
Contributor

davies commented Jun 30, 2015

LGTM, merging this into master, thanks!

Right now, all the python tests will finished in about 10 minutes, cool!

@asfgit asfgit closed this in 7bbbe38 Jun 30, 2015
@JoshRosen JoshRosen deleted the parallelize-python-tests branch June 30, 2015 04:38
asfgit pushed a commit that referenced this pull request Jun 30, 2015
This patch fixes a bug introduced in #7031 which can cause Jenkins to incorrectly report a build with failed Python tests as passing if an error occurred while printing the test failure message.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7112 from JoshRosen/python-tests-hotfix and squashes the following commits:

c3f2961 [Josh Rosen] Hotfix for bug in Python test failure reporting
@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #981 timed out for PR 7031 at commit feb3763 after a configured wait of 175m.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants