Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in indexer #81

Closed
jonahgraham opened this issue Oct 3, 2022 · 10 comments · Fixed by #173
Closed

Deadlock in indexer #81

jonahgraham opened this issue Oct 3, 2022 · 10 comments · Fixed by #173
Assignees
Milestone

Comments

@jonahgraham
Copy link
Member

Something in the recent changes to #79 has caused the indexer to deadlock more readily than before.

Running PDOMCPPBugsTest on the build machine locks up, and it does when running (but not debugging) a significant percentage of the time.

@jonahgraham jonahgraham self-assigned this Oct 3, 2022
@jonahgraham jonahgraham added this to the 11.0.0 milestone Oct 3, 2022
@jonahgraham
Copy link
Member Author

I have had two runs now with identical stack traces, so pretty sure this is what is going on.

Extracts of key threads:

"main" #1 prio=6 os_prio=0 cpu=6380.63ms elapsed=721.72s tid=0x00007f29ec0267b0 nid=0x1eaa7c waiting for monitor entry  [0x00007f29f128e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Object.wait(java.base@17.0.3/Native Method)
	- waiting on <no object reference available>
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.joinIndexer(PDOMManager.java:1173)
	- locked <0x00000000e8e07aa0> (a [Z)
	at org.eclipse.cdt.core.testplugin.util.BaseTestCase5.waitForIndexer(BaseTestCase5.java:165)
	at org.eclipse.cdt.core.testplugin.util.BaseTestCase.waitForIndexer(BaseTestCase.java:234)
	at org.eclipse.cdt.internal.pdom.tests.PDOMCPPBugsTest.test191679(PDOMCPPBugsTest.java:227)


"Worker-5: Update Monitor" #56 prio=5 os_prio=0 cpu=19.67ms elapsed=712.00s tid=0x00007f2848000e70 nid=0x1eab17 waiting for monitor entry  [0x00007f2878487000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.isIndexerIdle(PDOMManager.java:751)
	- waiting to lock <0x00000000e2035f38> (a java.util.ArrayDeque)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager$7.done(PDOMManager.java:1153)
	- locked <0x00000000e8e07aa0> (a [Z)
	at org.eclipse.core.internal.jobs.JobListeners$$Lambda$127/0x0000000800da52f0.notify(Unknown Source)

"Worker-0: Notify Index Change Listeners" #43 prio=5 os_prio=0 cpu=709494.63ms elapsed=716.35s tid=0x00007f29ec8e7520 nid=0x1eaadc runnable  [0x00007f29a49f3000]
   java.lang.Thread.State: RUNNABLE
	at org.eclipse.core.internal.jobs.JobManager.schedule(JobManager.java:1353)
	at org.eclipse.core.internal.jobs.InternalJob.schedule(InternalJob.java:390)
	at org.eclipse.core.runtime.jobs.Job.schedule(Job.java:654)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.enqueue(PDOMManager.java:712)
	- locked <0x00000000e2035f38> (a java.util.ArrayDeque)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager$3.run(PDOMManager.java:1012)
	- locked <0x00000000e2036538> (a java.util.HashMap)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)


"Worker-2: Building" #52 prio=5 os_prio=0 cpu=28.64ms elapsed=712.01s tid=0x00007f28c0001820 nid=0x1eab13 waiting for monitor entry  [0x00007f287888b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.handlePostBuildEvent(PDOMManager.java:1499)
	- waiting to lock <0x00000000e2036538> (a java.util.HashMap)
	at org.eclipse.cdt.internal.core.pdom.CModelListener.resourceChanged(CModelListener.java:166)

jonahgraham added a commit to jonahgraham/cdt that referenced this issue Oct 3, 2022
The 2022-12 M1 build from Platform may have a regression that is causing
CDT build to fail more often due to deadlock in the indexer.

This commit uses a post M1 build of Platform to try pick up fixes
that have been committed to Platform since M1 was released.

Fixes eclipse-cdt#81
@jonahgraham
Copy link
Member Author

I think this is the same regression identified in eclipse-platform/eclipse.platform#193 - I will try updating to newer I-build, and if that works we may need to discuss what to do about the M1 release this week.

@jonahgraham
Copy link
Member Author

jonahgraham commented Oct 3, 2022

With the updated target platform it looks like the problem isn't resolved, but the Java deadlock detector code can identify it now at least:

Found one Java-level deadlock:
=============================
"main":
  waiting to lock monitor 0x00007f37957fa400 (object 0x00000000e9980f88, a [Z),
  which is held by "Worker-11: C/C++ Indexer"

"Worker-11: C/C++ Indexer":
  waiting to lock monitor 0x00007f358801b240 (object 0x00000000e09dbbf0, a java.util.ArrayDeque),
  which is held by "Worker-7: Notify Index Change Listeners"

"Worker-7: Notify Index Change Listeners":
  waiting to lock monitor 0x00007f359c000e70 (object 0x00000000e09dbcf8, a java.util.concurrent.ConcurrentLinkedQueue),
  which is held by "Worker-11: C/C++ Indexer"

Full jstack output jstack.txt

jonahgraham added a commit to jonahgraham/cdt that referenced this issue Oct 3, 2022
The 2022-12 M1 build from Platform may have a regression that is causing
CDT build to fail more often due to deadlock in the indexer.

This commit uses 2022-09 R build of Platform to avoid the problem.

Note that this won't resolve the issue in Platform, but will allow
CDT to be built.

Fixes eclipse-cdt#81
@jonahgraham
Copy link
Member Author

CDT's immediate issue is fixed (hopefully) with #82 - but that really just hides the problem hoping that Platform resolves the issue in eclipse-platform/eclipse.platform#193

@jukzi
Copy link

jukzi commented Oct 4, 2022

the deadlock described in #81 (comment)
is a deadlock between
org.eclipse.core.internal.jobs.InternalJob.eventQueue
and
org.eclipse.cdt.internal.core.pdom.PDOMManager.idleCondition

CDT should not synchronize in the JobChangeAdapter
https://github.com/eclipse-cdt/cdt/blob/main/core/org.eclipse.cdt.core/parser/org/eclipse/cdt/internal/core/pdom/PDOMManager.java#L1152

AND outside the while loop
https://github.com/eclipse-cdt/cdt/blob/main/core/org.eclipse.cdt.core/parser/org/eclipse/cdt/internal/core/pdom/PDOMManager.java#L1165

you could just synchronize while waiting
https://github.com/eclipse-cdt/cdt/blob/main/core/org.eclipse.cdt.core/parser/org/eclipse/cdt/internal/core/pdom/PDOMManager.java#L1173

to be notified.

"Worker-11: C/C++ Indexer":
  waiting to lock monitor 0x00007f358801b240 (object 0x00000000e09dbbf0, a java.util.ArrayDeque),
  which is held by "Worker-7: Notify Index Change Listeners"

"Worker-7: Notify Index Change Listeners":
  waiting to lock monitor 0x00007f359c000e70 (object 0x00000000e09dbcf8, a java.util.concurrent.ConcurrentLinkedQueue),
  which is held by "Worker-11: C/C++ Indexer"
===================================================

"Worker-11: C/C++ Indexer":
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.isIndexerIdle(PDOMManager.java:751)
	- waiting to lock <0x00000000e09dbbf0> (a java.util.ArrayDeque)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager$7.done(PDOMManager.java:1153)
	- locked <0x00000000e9980f88> (a [Z)
	at org.eclipse.core.internal.jobs.JobListeners$$Lambda$127/0x0000000800da5228.notify(Unknown Source)
	at org.eclipse.core.internal.jobs.JobListeners.sendEvent(JobListeners.java:63)
	at org.eclipse.core.internal.jobs.JobListeners.sendEvents(JobListeners.java:49)
	- locked <0x00000000e09dbcf8> (a java.util.concurrent.ConcurrentLinkedQueue)
	at org.eclipse.core.internal.jobs.JobManager.endJob(JobManager.java:736)
	at org.eclipse.core.internal.jobs.WorkerPool.endJob(WorkerPool.java:117)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:83)
"Worker-7: Notify Index Change Listeners":
	at org.eclipse.core.internal.jobs.JobListeners.sendEvents(JobListeners.java:48)
	- waiting to lock <0x00000000e09dbcf8> (a java.util.concurrent.ConcurrentLinkedQueue)
	at org.eclipse.core.internal.jobs.JobManager.schedule(JobManager.java:1351)
	at org.eclipse.core.internal.jobs.InternalJob.schedule(InternalJob.java:392)
	at org.eclipse.core.runtime.jobs.Job.schedule(Job.java:654)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager.enqueue(PDOMManager.java:712)
	- locked <0x00000000e09dbbf0> (a java.util.ArrayDeque)
	at org.eclipse.cdt.internal.core.pdom.PDOMManager$3.run(PDOMManager.java:1012)
	- locked <0x00000000e09dc218> (a java.util.HashMap)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)

@jonahgraham
Copy link
Member Author

Thank you @jukzi for the analaysis. I have no doubt that there is some improvements to the indexer joining code (and this issue needs to be reopened) but it was written more than a decade ago. So the behaviour change in Platform has regressed CDT and I can't see where CDT violated the API, but perhaps it is because CDT depended on undocumented behaviour.

Anyway, I will see if I can provide a reduced test case, which will either show the remaining issue in platform, or at least allow me to understand the problem more fully. The existing code is way too complicated within CDT, and frustratingly the code tends not to lock up when running under the debugger. It fails quite often (not 100% of the time) when I run the tests in the IDE, but it has never failed when debugging.

@jonahgraham jonahgraham reopened this Oct 4, 2022
@jukzi
Copy link

jukzi commented Oct 4, 2022

Well the timing in debugging is different, but you could for sure reproduce this obvious deadlock with breakpoints in the synchronized blocks. I understand this deadlock. I will also think about how he platform could get rid of it's synchronize.

@jukzi
Copy link

jukzi commented Oct 4, 2022

Me blaming PDOMManager.idleCondition was wrong. Sorry. The problem is
PDOMManager.fTaskQueue
which is synchronized while the Job is scheduled. And also used in the listeners

That should be done outside the synchronize. Same pattern is already used in

probably same problem in

@jukzi
Copy link

jukzi commented Oct 20, 2022

BTW that code is totally wrong:
https://github.com/eclipse-cdt/cdt/blob/main/core/org.eclipse.cdt.core/parser/org/eclipse/cdt/internal/core/pdom/PDOMManager.java#L1149

		JobChangeAdapter listener = new JobChangeAdapter() {
			@Override
			public void done(IJobChangeEvent event) {
				synchronized (idleCondition) {
					if (isIndexerIdle()) {
						idleCondition[0] = true;
						idleCondition.notifyAll();
					}
				}
			}
		};
		Job.getJobManager().addJobChangeListener(listener);

It adds the listener to every job so that any(!) job that completes will set the idleCondition even if other jobs still running. And it synchronizes all(!) jobs done(). Thats why all these unrelated threads:
"Worker-0: Reporting encoding changes.",
"Worker-2: Notify Index Change Listeners",
"Worker-3: C/C++ Indexer"
"Worker-4: Update Monitor"
"Worker-6: Update Job"
"Worker-7: Java problems decoration calculation..."
...
wait for the same synchronized (idleCondition)

What probably is wanted here is to only listen for fIndexerJob to be done. So you could instead add and remove the job to/from only that Job.

@jonahgraham
Copy link
Member Author

You may be right - that heavy joinIndexer code is mostly used in tests to make sure everything is idle.

jonahgraham added a commit to jonahgraham/cdt that referenced this issue Oct 26, 2022
This is an experiment as recommended in
eclipse-platform/eclipse.platform#193 (comment)
to see if latest I-build works
jonahgraham added a commit to jonahgraham/cdt that referenced this issue Nov 7, 2022
This is an experiment as recommended in
eclipse-platform/eclipse.platform#193 (comment)
to see if latest I-build works
jonahgraham added a commit to jonahgraham/cdt that referenced this issue Nov 21, 2022
jonahgraham added a commit to jonahgraham/cdt that referenced this issue Nov 21, 2022
jonahgraham added a commit to jonahgraham/cdt that referenced this issue Nov 21, 2022
Because of
[changes](https://www.eclipse.org/eclipse/news/4.26/platform_isv.php#JobManager_Implementation)
in Eclipse Platform where the jobmanager's behaviour changes (within
the API), the consumers of the jobmanager can deadlock due to incorrect
assumptions.

In particular, where we call job.schedule(), the callback can happen
in different threads to the IJobChangeListener's. As CDT was holding
a lock while calling schedule that is also required in those
listeners, we need to no longer lock when calling schedule.

As the code already dealt with the case of when there was a delay
between the job.schedule() and where & when it was run, we can
move the schedule call out of the synchronized block.

Fixes eclipse-cdt#81
jonahgraham added a commit to jonahgraham/cdt that referenced this issue Nov 21, 2022
Because of
[changes](https://www.eclipse.org/eclipse/news/4.26/platform_isv.php#JobManager_Implementation)
in Eclipse Platform where the jobmanager's behaviour changes (within
the API), the consumers of the jobmanager can deadlock due to incorrect
assumptions.

In particular, where we call job.schedule(), the callback can happen
in different threads to the IJobChangeListener's. As CDT was holding
a lock while calling schedule that is also required in those
listeners, we need to no longer lock when calling schedule.

As the code already dealt with the case of when there was a delay
between the job.schedule() and where & when it was run, we can
move the schedule call out of the synchronized block.

Fixes eclipse-cdt#81
jonahgraham added a commit that referenced this issue Nov 21, 2022
Because of
[changes](https://www.eclipse.org/eclipse/news/4.26/platform_isv.php#JobManager_Implementation)
in Eclipse Platform where the jobmanager's behaviour changes (within
the API), the consumers of the jobmanager can deadlock due to incorrect
assumptions.

In particular, where we call job.schedule(), the callback can happen
in different threads to the IJobChangeListener's. As CDT was holding
a lock while calling schedule that is also required in those
listeners, we need to no longer lock when calling schedule.

As the code already dealt with the case of when there was a delay
between the job.schedule() and where & when it was run, we can
move the schedule call out of the synchronized block.

Fixes #81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants