Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel hangs forever #4322

Closed
snnn opened this issue Dec 19, 2017 · 4 comments
Closed

Bazel hangs forever #4322

snnn opened this issue Dec 19, 2017 · 4 comments
Labels
P1 I'll work on this now. (Assignee required) type: bug

Comments

@snnn
Copy link
Contributor

snnn commented Dec 19, 2017

Description of the problem / feature request / question:

While running tensorflow bazel CI build on Windows, it hangs forever.

If possible, provide a minimal example to reproduce the problem:

I'm compiling meteorcloudy's TF code, with some private changes
meteorcloudy/tensorflow@7004ed6

Environment info

  • Operating System:
    Windows 2012

  • Bazel version (output of bazel info release):
    master

  • If bazel info release returns "development version" or "(@non-git)", please tell us what source tree you compiled Bazel from; git commit hash is appreciated (git rev-parse HEAD):
    1892677

Have you found anything relevant by searching the web?

No

Anything else, information or logs or outputs that would be helpful?

2017-12-18T15:29:10.7622256Z [22 / 223] [-----] Creating source manifest for //py_test_dir/tensorflow/contrib/tpu:tpu_sharding_test
2017-12-18T15:29:16.1979521Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 5s ... (3 actions, 0 running)
2017-12-18T15:29:21.7488945Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 10s ... (3 actions, 0 running)
2017-12-18T15:29:45.9451427Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 35s ... (3 actions, 0 running)
2017-12-18T15:30:02.1773225Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 51s ... (3 actions, 0 running)
2017-12-18T15:30:23.2801547Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 72s ... (3 actions, 0 running)
2017-12-18T15:30:50.7138377Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 99s ... (3 actions, 0 running)
2017-12-18T15:31:08.5452320Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 117s ... (3 actions, 0 running)
2017-12-18T15:31:49.5587383Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 158s ... (3 actions, 0 running)
2017-12-18T15:32:16.2173170Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 185s ... (3 actions, 0 running)
2017-12-18T15:32:46.8764355Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 215s ... (3 actions, 0 running)
2017-12-18T15:33:22.1321670Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 251s ... (3 actions, 0 running)
2017-12-18T15:34:43.2210535Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 332s ... (3 actions, 0 running)
2017-12-18T15:35:35.9291300Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 385s ... (3 actions, 0 running)
2017-12-18T15:36:36.5452699Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 445s ... (3 actions, 0 running)
2017-12-18T15:37:46.2516264Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 515s ... (3 actions, 0 running)
2017-12-18T15:39:06.4143889Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 595s ... (3 actions, 0 running)
2017-12-18T15:40:38.6017663Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 687s ... (3 actions, 0 running)
2017-12-18T15:42:24.6180003Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 793s ... (3 actions, 0 running)
2017-12-18T15:44:26.5353667Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 915s ... (3 actions, 0 running)
2017-12-18T15:46:46.7421908Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 1055s ... (3 actions, 0 running)
2017-12-18T15:49:27.9808387Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 1217s ... (3 actions, 0 running)
2017-12-18T15:52:33.4037328Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 1402s ... (3 actions, 0 running)
2017-12-18T15:56:06.6393597Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 1615s ... (3 actions, 0 running)
2017-12-18T16:00:11.8661423Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 1860s ... (3 actions, 0 running)
2017-12-18T16:04:53.8721997Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 2142s ... (3 actions, 0 running)
2017-12-18T16:10:18.1807411Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 2467s ... (3 actions, 0 running)
2017-12-18T16:16:31.1338106Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 2840s ... (3 actions, 0 running)
2017-12-18T16:23:40.0313931Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 3269s ... (3 actions, 0 running)
2017-12-18T16:31:53.2646125Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 3762s ... (3 actions, 0 running)
2017-12-18T16:41:20.4827648Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 4329s ... (3 actions, 0 running)
2017-12-18T16:52:12.7843389Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 4981s ... (3 actions, 0 running)
2017-12-18T17:04:42.9320509Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 5731s ... (3 actions, 0 running)
2017-12-18T17:19:05.6018694Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 6594s ... (3 actions, 0 running)
2017-12-18T17:35:37.6730599Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 7586s ... (3 actions, 0 running)
2017-12-18T17:54:38.5562300Z [328 / 913] [-----] Creating runfiles tree bazel-out/x64_windows-py3-opt/bin/py_test_dir/tensorflow/contrib/learn/saved_model_export_utils_test.exe.runfiles; 8727s ... (3 actions, 0 running)
2017-12-18T18:12:25.5424781Z 2017-12-18 10:12:25
2017-12-18T18:12:25.5424781Z Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode):
2017-12-18T18:12:25.5424781Z 
2017-12-18T18:12:25.5434771Z "skyframe-evaluator 46" #663 prio=5 os_prio=0 tid=0x000000005d18e000 nid=0x2974 waiting on condition [0x00000000606ae000]
2017-12-18T18:12:25.5434771Z    java.lang.Thread.State: WAITING (parking)
2017-12-18T18:12:25.5434771Z 	at sun.misc.Unsafe.park(Native Method)
2017-12-18T18:12:25.5444782Z 	- parking to wait for  <0x00000003c0521cd0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
2017-12-18T18:12:25.5444782Z 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
2017-12-18T18:12:25.5444782Z 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
2017-12-18T18:12:25.5444782Z 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
2017-12-18T18:12:25.5454794Z 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
2017-12-18T18:12:25.5454794Z 	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
2017-12-18T18:12:25.5454794Z 	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.vfs.JavaIoFileSystem.createDirectory(JavaIoFileSystem.java:230)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.vfs.Path.createDirectory(Path.java:1034)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.vfs.Path.createDirectory(Path.java:1019)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.vfs.FileSystemUtils.createDirectoryAndParentsWithCache(FileSystemUtils.java:703)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.vfs.FileSystemUtils.createDirectoryAndParents(FileSystemUtils.java:663)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.createOutputDirectories(SkyframeActionExecutor.java:703)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:800)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:111)
2017-12-18T18:12:25.5464783Z 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:678)
@aehlig
Copy link
Contributor

aehlig commented Dec 19, 2017

Thanks for you bug report and the stack trace!

As it hangs waiting for a lock, we might have to reconsider the locking strategy in use (I didn't find an obvious lock order violation; it seems that we only every ask for the parent lock while holding a lock on a child directory).

/cc @tomlu who last changed the locking strategy for that part of the code in 82e68b7.

This may or may not be related to #4306.
/cc @meteorcloudy as he is looking into the deadlocks during bootstrap on windows.

@aehlig aehlig added P1 I'll work on this now. (Assignee required) type: bug labels Dec 19, 2017
@snnn
Copy link
Contributor Author

snnn commented Dec 19, 2017

I can 100% reproduce this bug. I can provide more information if you need.

@buchgr
Copy link
Contributor

buchgr commented Dec 19, 2017

@snnn does it happen more frequently if you increase --jobs (parallelism)?

@tomlu @aehlig @ulfjack

In the code we use guava's Striped class instead of a hash map. This class can give you the same lock for different object's i.e. the same lock for the child and the parent directory. From the documentation

The guarantee provided by this class is that equal keys lead to the same lock (or semaphore), i.e. if
(key1.equals(key2)) then striped.get(key1) == striped.get(key2) (assuming Object.hashCode() is
correctly implemented for the keys). Note that if key1 is not equal to key2, it is not guaranteed that
striped.get(key1) != striped.get(key2); the elements might nevertheless be mapped to the same lock.
The lower the number of stripes, the higher the probability of this happening.

The locks are guaranteed to be re-entrant, but it's possible that on two different threads createDirectory gets two locks in reverse order i.e.
test1234

The number of stripes is set to 64, so if you run with lots of jobs for a while it may not be that unlikely to get into that state.

https://google.github.io/guava/releases/19.0/api/docs/com/google/common/util/concurrent/Striped.html

@tomlu
Copy link
Contributor

tomlu commented Dec 19, 2017 via email

luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    Fixes #4322, #4306.

    *** Reason for rollback ***

    Introduces a deadlock (see bazelbuild/bazel#4322)

    *** Original change description ***

    Make FileSystem operate on LocalPath instead of Path.

    PiperOrigin-RevId: 179549866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

4 participants