Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during shutdown #50

Closed
axeluhl opened this issue May 6, 2022 · 53 comments · Fixed by #51
Closed

Deadlock during shutdown #50

axeluhl opened this issue May 6, 2022 · 53 comments · Fixed by #51

Comments

@axeluhl
Copy link

axeluhl commented May 6, 2022

Sorry to report this against an old version, and please resolve if this has been fixed in newer versions which should then urge us to finally upgrade... We're still using 3.16.100v20201030-1916 and are experiencing an occasional deadlock during container shutdown. Here are the two competing threads:

Thread [main] (Suspended)
waiting for: AtomicReference (id=24)
Object.wait(long) line: not available [native method]
EquinoxBundle$SystemBundle$EquinoxSystemModule(SystemModule).waitForStop(long) line: 173
EquinoxBundle$SystemBundle.waitForStop(long) line: 306
Equinox.waitForStop(long) line: 217
EclipseStarter.shutdown() line: 457
EclipseStarter.run(String[], Runnable) line: 274
NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method]
NativeMethodAccessorImpl.invoke(Object, Object[]) line: 62
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 43
Method.invoke(Object, Object...) line: 498
Main.invokeFramework(String[], URL[]) line: 653
Main.basicRun(String[]) line: 590
Main.run(String[]) line: 1461
Main.main(String[]) line: 1434

and

Thread [Framework stop - Equinox Container: 8a2dbb34-2363-442e-84e4-4e7b4ea1ff4f] (Suspended)
Unsafe.park(boolean, long) line: not available [native method]
Unsafe.park(boolean, long) line: 1079
LockSupport.park(Object) line: 175
ReentrantLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt() line: 836
ReentrantLock$NonfairSync(AbstractQueuedSynchronizer).acquireQueued(AbstractQueuedSynchronizer$Node, int) line: 870
ReentrantLock$NonfairSync(AbstractQueuedSynchronizer).acquire(int) line: 1199
ReentrantLock$NonfairSync.lock() line: 245
ReentrantLock.lock() line: 321
ZipBundleFile(CloseableBundleFile).close() line: 340
ClasspathEntry.close() line: 257
ClasspathManager.close() line: 152
EquinoxClassLoader(ModuleClassLoader).close() line: 457
BundleLoader.close() line: 359
EquinoxContainerAdaptor.invalidateWiring(ModuleWiring, ModuleLoader) line: 264
ModuleWiring.invalidate0(boolean) line: 363
ModuleWiring.unload() line: 352
ModuleContainer.unloadModules() line: 1253
ModuleContainer.close() line: 1197
EquinoxBundle$SystemBundle$EquinoxSystemModule(SystemModule).stop(Module$StopOptions...) line: 220
EquinoxBundle$SystemBundle$EquinoxSystemModule$1.run() line: 220
Thread.run() line: 838

The former waits on the AtomicReference bound to variable stopEvent. The latter tries to obtain the openLock which is held by the former thread. The latter thread would notify the monitor the former thread is waiting on in the finally block that follows in the stop(...) method, but it doesn't reach there because it is hung up at the lock.

This avoids clean shut-down which in turn causes our tests to hang sporadically.

@axeluhl
Copy link
Author

axeluhl commented May 6, 2022

This was migrated from Bugzilla issue 579799. Thomas Watsons reaction there was:

"Can you get a jstack of the process when it is hung so we can get a complete picture of the locks involved here. The only way I can see this happening is if an InputStream from a bundle entry got leaked and not closed.

We've recently moved to github for the source/issues of the project. Could you move this to an issue at https://github.com/eclipse-equinox/equinox.framework?"

@axeluhl
Copy link
Author

axeluhl commented May 6, 2022

I'll do a "jstack -l" with the pid of the hung process the next time it occurs. Do you also need the -m option for mixed mode?

@laeubi
Copy link
Member

laeubi commented May 6, 2022

@axeluhl I'm wondering what holds you back in upgrading? Given we find the cause and fixing it, it is very unlikely it gets released as an "old" version, looking at stacks from older releases is also not very comfortable. So it seems at the end you will need to upgrade anyways, so I think best would be to take the opportunity to do so and report back with a fresh stack.

@axeluhl
Copy link
Author

axeluhl commented May 6, 2022

I certainly see (that's why I already apologized in the report ;-) ) that reporting against old versions per se is not ideal. However, over the past decade that our project runs, keeping bumping all components has created massive efforts, often due to incompatibilities, deprecations and sometimes even regressions. We have since focused those upgrades on security-related and feature-related releases. Apart from the issue reported here we were simply content with the Equinox release used. Things are further complicated by our platform requiring Java 8 for other reasons, with challenges in upgrading even to 11 for now, limiting our options with simply bumping the versions of everything. I'll still try an Equinox upgrade.

Do you have any indication as to whether the deadlock issue reported here may have been fixed in newer versions?

@laeubi
Copy link
Member

laeubi commented May 6, 2022

Not really but things are improved every day, and as no one else reported such an issue about 2 years... But that's just a guess :-)

@axeluhl
Copy link
Author

axeluhl commented May 6, 2022

Here goes the jstack output. Hope this helps.
jstack.txt

@axeluhl
Copy link
Author

axeluhl commented May 6, 2022

@axeluhl I'm wondering what holds you back in upgrading?

@laeubi, which release do you suggest we upgraded to? It seems 3.18.0.v20220426-2016 is what the latest Eclipse version is currently using.

@laeubi
Copy link
Member

laeubi commented May 7, 2022

You will find the latest releases here:
https://wiki.eclipse.org/Simultaneous_Release

@axeluhl
Copy link
Author

axeluhl commented May 7, 2022

I've managed to upgrade our product to use org.eclipse.osgi_3.18.0.v20220426-2016 after adding the new requirements such as *.util.{function|measurement|position|promise}. I've looked at the code that was causing the deadlock in 3.16 and it seems that nothing has changed there regarding how the "forStop" reference is used to wait/notify. Therefore I wouldn't be too surprised if we'll see the same problem with 3.18.

I'll trigger a few builds to see if the problem can be reproduced with 3.18, and if, I'll collect another "jstack -l" output.

@laeubi
Copy link
Member

laeubi commented May 7, 2022

@axeluhl thats great, I think this will make things much more easier to adress. Maybe you also like to use jstack -l -e in the command to get as much info as possible.

@axeluhl
Copy link
Author

axeluhl commented May 7, 2022

There is no "-e" in a Java 8's jstack as far as I can see...

@laeubi
Copy link
Member

laeubi commented May 7, 2022

Okay that might be new to java 11, but I htink you can use a J11 jstack to observer a J8 vm?

@axeluhl
Copy link
Author

axeluhl commented May 7, 2022

It happened again with Equinox 3.18. I captured jstack output with the SAP JVM 8's jstack as well as an openjdk-11's jstack with "-e -l".
jstack.txt
jstack-11.txt

@tjwatson
Copy link
Contributor

tjwatson commented May 9, 2022

Thanks for the jstack. The threads involved in the deadlock look identical between Java 11 and Java 8.

"main" #1 prio=5 os_prio=0 cpu=1090.00 [reset 1090.00] ms elapsed=885.78 [reset 885.78] s allocated=90798480 B (86.59 MB) [reset 90798480 B (86.59 MB)] defined_classes=2303
io= file i/o: 73092804/74954728 B, net i/o: 0/0 B, files opened:175, socks opened:0  [reset file i/o: 73092804/74954728 B, net i/o: 0/0 B, files opened:175, socks opened:0 ] 
tid=0x00007f8114167000 nid=0x51b2 / 20914  pthread-id=140192499160832 in Object.wait()   [_thread_blocked (_at_safepoint), stack(0x00007f811c0ce000,0x00007f811c1ce000)] [0x00007f811c1cb000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(J)V(Native Method)
	- waiting on <0x00000005f02c59f0> (a java.util.concurrent.atomic.AtomicReference)
	at org.eclipse.osgi.container.SystemModule.waitForStop(J)Lorg/eclipse/osgi/container/ModuleContainerAdaptor$ContainerEvent;(SystemModule.java:173)
	- waiting to re-lock in wait() <0x00000005f02c59f0> (a java.util.concurrent.atomic.AtomicReference)
	at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle.waitForStop(J)Lorg/osgi/framework/FrameworkEvent;(EquinoxBundle.java:312)
	at org.eclipse.osgi.launch.Equinox.waitForStop(J)Lorg/osgi/framework/FrameworkEvent;(Equinox.java:217)
	at org.eclipse.core.runtime.adaptor.EclipseStarter.shutdown()V(EclipseStarter.java:458)
	at org.eclipse.core.runtime.adaptor.EclipseStarter.run([Ljava/lang/String;Ljava/lang/Runnable;)Ljava/lang/Object;(EclipseStarter.java:274)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Method.java:498)
	at org.eclipse.equinox.launcher.Main.invokeFramework([Ljava/lang/String;[Ljava/net/URL;)V(Main.java:653)
	at org.eclipse.equinox.launcher.Main.basicRun([Ljava/lang/String;)V(Main.java:590)
	at org.eclipse.equinox.launcher.Main.run([Ljava/lang/String;)I(Main.java:1461)
	at org.eclipse.equinox.launcher.Main.main([Ljava/lang/String;)V(Main.java:1434)

   Locked ownable synchronizers:
	- <0x00000005f406a440> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)


"Framework stop - Equinox Container: 35a9b7e6-836f-4fde-8481-3ae56def1ccd" #50 prio=5 os_prio=0 cpu=80.00 [reset 80.00] ms elapsed=882.71 [reset 882.71] s allocated=10868992 B (10.37 MB) [reset 10868992 B (10.37 MB)] defined_classes=36
io= file i/o: 67235/1037677 B, net i/o: 0/0 B, files opened:-88, socks opened:0  [reset file i/o: 67235/1037677 B, net i/o: 0/0 B, files opened:-88, socks opened:0 ] 
tid=0x00007f811487d000 nid=0x51ef / 20975  pthread-id=140191663589120 waiting on condition   [_thread_blocked (_at_safepoint), stack(0x00007f80ea3f0000,0x00007f80ea4f1000)] [0x00007f80ea4ef000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(ZJ)V(Native Method)
	- parking to wait for  <0x00000005f406a440> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at sun.misc.Unsafe.park(ZJ)V(Unsafe.java:1079)
	at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()Z(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Ljava/util/concurrent/locks/AbstractQueuedSynchronizer$Node;I)Z(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(I)V(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock()V(ReentrantLock.java:245)
	at java.util.concurrent.locks.ReentrantLock.lock()V(ReentrantLock.java:321)
	at org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.close()V(CloseableBundleFile.java:343)
	at org.eclipse.osgi.internal.loader.classpath.ClasspathEntry.close()V(ClasspathEntry.java:257)
	at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.close()V(ClasspathManager.java:150)
	at org.eclipse.osgi.internal.loader.ModuleClassLoader.close()V(ModuleClassLoader.java:447)
	at org.eclipse.osgi.internal.loader.BundleLoader.close()V(BundleLoader.java:358)
	at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor.invalidateWiring(Lorg/eclipse/osgi/container/ModuleWiring;Lorg/eclipse/osgi/container/ModuleLoader;)V(EquinoxContainerAdaptor.java:264)
	at org.eclipse.osgi.container.ModuleWiring.invalidate0(Z)V(ModuleWiring.java:360)
	at org.eclipse.osgi.container.ModuleWiring.unload()V(ModuleWiring.java:349)
	at org.eclipse.osgi.container.ModuleContainer.unloadModules()V(ModuleContainer.java:1255)
	at org.eclipse.osgi.container.ModuleContainer.close()V(ModuleContainer.java:1199)
	at org.eclipse.osgi.container.SystemModule.stop([Lorg/eclipse/osgi/container/Module$StopOptions;)V(SystemModule.java:220)
	at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle$EquinoxSystemModule$1.run()V(EquinoxBundle.java:226)
	at java.lang.Thread.run()V(Thread.java:838)

   Locked ownable synchronizers:
	- None

The mystery is why the main thread still holds the 0x00000005f406a440 lock. The intention of the CloseableBundleFile is to only hold that lock for very short windows and to never call out to foreign code while holding the lock. There are two patterns in the code where the lock is held:

  1. locking the re-entrant lock directly like in CloseableBundleFile.incrementReference where it is immediately unlocked in a finally block:
	void incrementReference() {
		openLock.lock();
		try {
			referenceCount += 1;
		} finally {
			openLock.unlock();
		}
	}
  1. locking with a call to CloseableBundleFile.lockOpen where it is released also in a finally block with a call to releaseOpen if the bundle file was opened successfully like in `CloseableBundleFile.getEntry:
	public BundleEntry getEntry(String path) {
		if (!lockOpen()) {
			return null;
		}
		try {
			return findEntry(path);
		} finally {
			releaseOpen();
		}
	}

The main thread here is not in the middle of any of these methods that hold the lock for the "short" window of time. That means that somehow this lock is escaping out of the methods that acquire the lock somehow.

Some possible reasons I can think that this deadlock could be happening:

  1. There is a call to lockOpen that is not ensuring the lock is released in a finally block with a call to releaseOpen. All calls that I see to lockOpen are correctly released in a finally block.
  2. There is a bug in lockOpen that is causing it to return false (indicating the BundleFile was not opened and locked successfully) but the openLock was indeed acquired and kept by the calling thread.

The second reason seems more plausible because the lockOpen is not simple code, but I've yet to detect where the hole is in that method that would cause it to return false but still hold the lockOpen lock.

The 3rd, less likely, possibility is that the JVM has a bug where it is not correctly decrementing and releasing the re-entrant lock when we call java.util.concurrent.locks.ReentrantLock.unlock(). Do you only see this on the SAP JVM?

@axeluhl
Copy link
Author

axeluhl commented May 9, 2022

Good point, I can try a different JVM. Of course, the problem doesn't reproduce reliably, but I'd say that among 10 builds we have two or so locking up at this point. So as we get 20 or 30 builds through you could say the problem would likely have been caused by the VM.

However, I would be pretty surprised by this as apart from extended stack traces, better on-the-fly debugging and an improved approach to profiling (that gets activated only, AFAIK, if a profiler is actually attached) this JVM is using the upstream HotSpot code.

@tjwatson
Copy link
Contributor

tjwatson commented May 9, 2022

However, I would be pretty surprised by this as apart from extended stack traces, better on-the-fly debugging and an improved approach to profiling (that gets activated only, AFAIK, if a profiler is actually attached) this JVM is using the upstream HotSpot code.

I agree, but had to mention it anyway. Most all cases where I wanted to blame the JVM ended up not actually being a bug in the JVM. Especially around deadlocks.

@axeluhl
Copy link
Author

axeluhl commented May 9, 2022

I don't yet understand all the details of why the openLock is unlocked before calling mruListApplyBackPressure() in CloseableBundleFile.open(boolean), but what I can see is that (as per design of the ReentrantLock) there can be re-entrances, e.g., through extractDirectory(...) which obtains the openLock and then calls getFile(String, boolean) which again calls lockOpen and through it open(boolean).

When in such a re-entrant scenario the openLock.unlock() statement is called to apply back pressure, the thread may still hold the lock. If the back pressure application works only if the lock is entirely released by the current thread then this could violate this assumption.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

I was able to remote-attach to another hung VM. The bundle file that locks up while trying to close is
.../bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar
All "Debug" flags are set to false.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

I've created a build slave image using
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
now. Let's see if any of the upcoming build jobs will stall again. This would help exclude the "SAP JVM 8" suspicion.

If you think that any specific form of logging in a patched version could help understand the issue better, I could try to replace the original 3.18 JAR from our target platform and inject a patched on from a separate p2 repo which would then be used during the build.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

However, I would be pretty surprised by this as apart from extended stack traces, better on-the-fly debugging and an improved approach to profiling (that gets activated only, AFAIK, if a profiler is actually attached) this JVM is using the upstream HotSpot code.

I double-checked the SAP JVM / JDK sources, and guess what: there are patches in the ReentrantLock class, both, in the tryRelease and the unlock methods. They seem minor and have to do with producing synchronization traces:

    public void unlock() {
        // SAPJVM JS 2010-02-04: Synchronization trace
        // Notify registered hooks that we have unlocked a reentrant lock
        if (ProfilingControl.isSynchronizationTraceEnabled() && sync.hasQueuedThreads()) {
            AccessController.doPrivileged(new PrivilegedAction<Object>() {
                public Object run() {
                    ProfilingControl.notifyReentrantLockUnlock(sync);
                    return null;
                }
            });
        }

        sync.release(1);
    }

and

        protected final boolean tryRelease(int releases) {
            int c = getState() - releases;
            // SAPJVM JS 2010-04-08: Add message to exception and allow thread dump to be triggered.
            if (Thread.currentThread() != getExclusiveOwnerThread()) {
                Thread owner = getExclusiveOwnerThread();
                String property = java.security.AccessController.doPrivileged(
                        new sun.security.action.GetPropertyAction("com.sap.jvm.concurrent.locks.stackdump"));
                if (property != null) {
                    final Thread t = Thread.currentThread();
                    AccessController.doPrivileged(new PrivilegedAction<Void>() {
                        @Override
                        public Void run() {
                            try {
                                VmNativesJavaBase.executeAttachOperation("stacktrace",
                                                                         "-l",
                                                                         "Illegal release of lock in thread \"" + t.getName() + "\":",
                                                                         "dump");
                            } catch (IOException e) {
                                // we have to ignore it here
                            }
                            return null;
                        }
                    });
                }
                String message;
                if (owner == null) {
                    message = "Not owner (currently no owner)";
                }
                else {
                    message = "Not owner (current owner thread \"" + owner.getName() + "\")";
                }
                throw new IllegalMonitorStateException(message);
            }
            boolean free = false;
            if (c == 0) {
                free = true;
                setExclusiveOwnerThread(null);
            }
            setState(c);
            return free;
        }

I think that neither the com.sap.jvm.concurrent.locks.stackdump nor the ProfilingControl.isSynchronizationTraceEnabled() are active here, but I'll still examine these patches more closely now.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

The lock-up occurred with openjdk-8 now, too. Find attached the jstack outputs, once obtained with a Java-11 jstack including the -e and -l options, and once obtained with a Java-8 jstack (only with -l). It seems exactly the same pattern, so I guess we can say the SAP JVM 8 is not the culprit.
jstack.txt
jstack-11.txt

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

FYI, with openjdk-8 the problem showed "more consistently." Four out of four builds got stuck at the same place.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

With all Debug flags being false, I cannot see how this could have happened. However, if a RuntimeException were thrown in

    if (debug.DEBUG_BUNDLE_FILE_OPEN) {
	Debug.println("OPENED bundle file - " + toString()); //$NON-NLS-1$
    }

then close already would have been set to false, the finally clause in open(boolean) would hence not unlock, and the catch (IOException e) in lockOpen wouldn't be triggered. Hence, the exception would propagate to a caller of lockOpen where we're still before the try/finally clause and hence the lock would remain locked as the releaseOpen() in the finally clause would not be reached.

It would probably also mean that such a RuntimeException bubbles up all the way and leaves the entire container in a somewhat unusable state. Not sure if this needs treatment here.

@tjwatson
Copy link
Contributor

If this is causing you particular pain in a continuous integration build you may want to consider disabling the bundle file closer to see if that helps. Set the config.ini setting osgi.bundlefile.limit=0. I'm not sure it will 100% of your problem, but it will reduce the number of open/close calls during the operation of the framework.

If you think that any specific form of logging in a patched version could help understand the issue better, I could try to replace the original 3.18 JAR from our target platform and inject a patched on from a separate p2 repo which would then be used during the build.

We should add some general trace to the CloseableBundleFile to see if we can detect where the lock is escaping. We can get that in an I-Build and then you could pick it up from there to test with?

I was able to remote-attach to another hung VM. The bundle file that locks up while trying to close is
.../bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar

Thanks, I figured this was from a .cp jar file which indicates that it was for a JAR on the Bundle-Classpath. These inner JAR bundle files get closed while calling ModuleWiring.unload() which shutsdown the bundle class loader. That is consistent with the stacktrace that shows the close operation.

I don't yet understand all the details of why the openLock is unlocked before calling mruListApplyBackPressure() in CloseableBundleFile.open(boolean), but what I can see is that (as per design of the ReentrantLock) there can be re-entrances, e.g., through extractDirectory(...) which obtains the openLock and then calls getFile(String, boolean) which again calls lockOpen and through it open(boolean).

When in such a re-entrant scenario the openLock.unlock() statement is called to apply back pressure, the thread may still hold the lock. If the back pressure application works only if the lock is entirely released by the current thread then this could violate this assumption.

That is a good observation, but still does not explain why the main thread is holding that re-entrant lock because it is not in the middle of any CloseableBundleFile method calls. The design is supposed to release any locks obtained on exit of the method.

There has to be some code path that is allowing a CloseableBundleFile method to exit while still holding the lock count it obtained on entry. With that in mind we can look to adding trace that tries to print out information on locks. I warn that the trace will likely be noisy and that in of itself may cause the problem to not re-occur.

@tjwatson
Copy link
Contributor

With all Debug flags being false, I cannot see how this could have happened. However, if a RuntimeException were thrown in

A runtime exception anywhere in the try block of org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.open(boolean) would be pretty bad when true is passed. Do you have any eclipse log with some exceptions?

We may need to consider catching all exceptions in that try and forcing the lock to be released and then re-throw the exception to the caller.

@laeubi
Copy link
Member

laeubi commented May 10, 2022

We may need to consider catching all exceptions in that try and forcing the lock to be released and then re-throw the exception to the caller.

Probably also Errors (OOM?)

@tjwatson
Copy link
Contributor

Probably also Errors (OOM?)

Yes, I intended that to be all Throwable.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

Do you have any eclipse log with some exceptions?

No, sorry, it was just a hypothesis based on inspecting the code.

@axeluhl
Copy link
Author

axeluhl commented May 10, 2022

We should add some general trace to the CloseableBundleFile to see if we can detect where the lock is escaping. We can get that in an I-Build and then you could pick it up from there to test with?

Please provide me with the link to the JAR, and I can try to set up our build environment such that it uses Equinox from a local p2 repo instead of the official Eclipse repo we're using normally for our build.

There has to be some code path that is allowing a CloseableBundleFile method to exit while still holding the lock count it obtained on entry. With that in mind we can look to adding trace that tries to print out information on locks. I warn that the trace will likely be noisy and that in of itself may cause the problem to not re-occur.

That is of course understood; any measurement may falsify your experiment ;-). Yet, I guess you could start by trying to trace the most likely cases first, such as when the open(boolean) method terminates with an exception and doesn't unlock due to the state of keepLock and close; or when lockOpen terminates abnormally but not with an IOException. Both should not occur under normal circumstances; if we then still see a lock-up and haven't seen any of these two traces, two possible reasons can already be excluded.

@laeubi
Copy link
Member

laeubi commented May 10, 2022

You might want to build the repro locally with

mvn clean package -Pbuild-individual-bundles -DskipTests=true

don't forget to increment the version of the bundle you change to make sure you don't accidentally use the wrong version in your tests.

@tjwatson
Copy link
Contributor

If you could apply #51 locally and build/test that would be helpful.

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

Trying to build I'm getting

Caused by: java.lang.UnsupportedClassVersionError: org/eclipse/tycho/p2maven/repository/P2ArtifactRepositoryLayout has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

What is the minimum JDK version required to build this? Can I be sure to get a bundle JAR that works with Java 8 (byte code version 52)?

@vogella
Copy link
Contributor

vogella commented May 11, 2022

Trying to build I'm getting

What is the minimum JDK version required to build this? Can I be sure to get a bundle JAR that works with Java 8 (byte code version 52)?

Java 11 is requried to build. Individual bundles can still be at Java 8 but your runtime for the build needs to be on Java 11.

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

If this is causing you particular pain in a continuous integration build you may want to consider disabling the bundle file closer to see if that helps. Set the config.ini setting osgi.bundlefile.limit=0. I'm not sure it will 100% of your problem, but it will reduce the number of open/close calls during the operation of the framework.

This sounds like a possible workaround, getting our build failure rates down. But how can I specify a config.ini option for an eclipse-test-plugin test execution during a Tycho/Maven build? I think the config.ini file is generated by Tycho, and I haven't found a way yet to specify additional properties to generate into that file before launching the test environment.

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

Thanks, Lars. With 11 I get half way through, failing with the macosx stuff (not surprisingly ;-) ), but the org.eclipse.osgi bundle was built successfully, so I guess I have what I need to build a temporary target platform with this patched bundle JAR that I can use to try a build. Stay tuned.

main:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] rt.equinox.framework 4.24.0-SNAPSHOT ............... SUCCESS [05:41 min]
[INFO] org.eclipse.osgi 3.18.0-SNAPSHOT ................... SUCCESS [04:37 min]
[INFO] org.eclipse.osgi.compatibility.state 1.2.600-SNAPSHOT SUCCESS [  1.813 s]
[INFO] org.eclipse.osgi.services 3.10.200-SNAPSHOT ........ SUCCESS [  4.548 s]
[INFO] org.eclipse.osgi.util 3.7.0-SNAPSHOT ............... SUCCESS [  1.316 s]
[INFO] org.eclipse.equinox.supplement 1.10.500-SNAPSHOT ... SUCCESS [  5.005 s]
[INFO] org.eclipse.equinox.launcher 1.6.400-SNAPSHOT ...... SUCCESS [  4.240 s]
[INFO] org.eclipse.equinox.launcher.cocoa.macosx 1.2.500-SNAPSHOT FAILURE [  0.299 s]
[INFO] org.eclipse.equinox.launcher.cocoa.macosx.x86_64 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.equinox.launcher.cocoa.macosx.aarch64 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.equinox.launcher.gtk.linux.aarch64 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.equinox.launcher.gtk.linux.ppc64le 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.equinox.launcher.gtk.linux.x86_64 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.equinox.launcher.win32.win32.x86_64 1.2.500-SNAPSHOT SKIPPED
[INFO] org.eclipse.osgi.tests 3.16.700-SNAPSHOT ........... SKIPPED
[INFO] org.eclipse.equinox.executable 3.8.1700-SNAPSHOT ... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  30:53 min
[INFO] Finished at: 2022-05-11T11:12:42+02:00
[INFO] ------------------------------------------------------------------------

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

I built a 3.18.1 version from 311f312 that I'm now using as part of my target platform. A product build is running in our CI, and I'm waiting for it to reach the critical eclipse-test-plugin.

Should it lock up again, where would you recommend I looked for the output generated by

	generation.getBundleInfo().getStorage().getAdaptor().publishContainerEvent(eventType, r.getRevisions().getModule(), e);

?

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

It didn't lock up. Could be conincidence, could be the fix. Of course, my question above was inappropriate. If the line I quoted was executed then the lock would now have been released, so no lock-up would be expected. Yet, a log may be expected. However, I'm yet to understand where to look for such a log. The build output says:

[INFO] --- tycho-surefire-plugin:1.6.0:test (default-test) @ {test-fragment-name} ---
[INFO] Expected eclipse log file: {path-to-test-fragment-folder}/bin/work/data/.metadata/.log

but I cannot even see the build produce the data or .metadata folder. Any hints?

@tjwatson
Copy link
Contributor

Perhaps @mickaelistria has some advise.

For our I-Builds we use --consoleLog option which pipes the log to standard out and the standard output is captured. For example, https://download.eclipse.org/eclipse/downloads/drops4/I20220510-1800/testresults/ep424I-unit-cen64-gtk3-java11_linux.gtk.x86_64_11/org.eclipse.osgi.tests.AutomatedTests.txt

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

I see -consolelog in your log output. I've added this option to our property for the tycho-surefire-plugin configuration. Let's see what comes out.

@tjwatson
Copy link
Contributor

My merge of #51 closed this issue. If the issue persists then please open a new issue. Thanks for testing. The I-Build tomorrow should have this fix included if you want to download that to test with also.

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

Does this project build a p2 update site from where the new release can be obtained? I guess it will be a while until the Eclipse release train picks this up, right? Where do the I-Build results come out?

@axeluhl
Copy link
Author

axeluhl commented May 11, 2022

Ah, I think I found it. https://download.eclipse.org/equinox/, and there the "Integration Builds" section.

@axeluhl
Copy link
Author

axeluhl commented May 12, 2022

Tested with the JAR built from the pull request, and tested with the JAR from the integration build https://download.eclipse.org/equinox/drops/I20220511-1800/index.php, both with -consolelog and bundle close debug output activated through a .options file. The problem hasn't occurred so far, but I neither have seen any exception log that would indicate that the change actually took effect.

In the meantime another build based on 3.18.0.v20220426-2016 got stuck, this time with bundle close debug output activated. The end of that log looks like this:

CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.osgi.util.function/1.2.0.202109301733/org.osgi.util.function-1.2.0.202109301733.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.ui.workbench/3.106.2.v20150204-1030/org.eclipse.ui.workbench-3.106.2.v20150204-1030.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.ui.navigator.resources/3.4.700.v20140407-1004/org.eclipse.ui.navigator.resources-3.4.700.v20140407-1004.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.ui.ide.application/1.0.501.v20141111-1238/org.eclipse.ui.ide.application-1.0.501.v20141111-1238.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.ui.views.properties.tabbed/3.6.0.v20140519-0906/org.eclipse.ui.views.properties.tabbed-3.6.0.v20140519-0906.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.hamcrest.core/1.3.0.v20180420-1519/org.hamcrest.core-1.3.0.v20180420-1519.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.swt.gtk.linux.x86/3.103.2.v20150203-1351/org.eclipse.swt.gtk.linux.x86-3.103.2.v20150203-1351.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.swt.gtk.linux.x86_64/3.103.2.v20150203-1351/org.eclipse.swt.gtk.linux.x86_64-3.103.2.v20150203-1351.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.swt.win32.win32.x86/3.103.2.v20150203-1351/org.eclipse.swt.win32.win32.x86-3.103.2.v20150203-1351.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.swt.win32.win32.x86_64/3.103.2.v20150203-1351/org.eclipse.swt.win32.win32.x86_64-3.103.2.v20150203-1351.jar
CLOSED bundle file - /home/hudson/.m2/repository/p2/osgi/bundle/org.eclipse.swt.cocoa.macosx.x86_64/3.103.2.v20150203-1351/org.eclipse.swt.cocoa.macosx.x86_64-3.103.2.v20150203-1351.jar
CLOSED bundle file - /home/hudson/.m2/repository/org/eclipse/tycho/org.eclipse.tycho.surefire.osgibooter/1.6.0/org.eclipse.tycho.surefire.osgibooter-1.6.0.jar
CLOSED bundle file - /home/hudson/.m2/repository/org/eclipse/tycho/org.eclipse.tycho.surefire.junit4/1.6.0/org.eclipse.tycho.surefire.junit4-1.6.0.jar
CLOSED bundle file - /home/hudson/workspace/bug5723/java/com.sap.sse.gwt.test/bin/work/configuration/org.eclipse.osgi/3/0/.cp/lib/gwt-dev.jar
CLOSED bundle file - /home/hudson/workspace/bug5723/java/com.sap.sse.gwt.test/bin/work/configuration/org.eclipse.osgi/6/0/.cp/lib/gwt-user.jar
CLOSED bundle file - /home/hudson/workspace/bug5723/java/com.sap.sse.gwt.test/bin/work/configuration/org.eclipse.osgi/5/0/.cp/lib/gwt-servlet.jar
CLOSED bundle file - /home/hudson/workspace/bug5723/java/com.sap.sse.gwt.test/bin/work/configuration/org.eclipse.osgi/5/0/.cp/lib/gwt-servlet-deps.jar

Attaching a debugger again shows that junit.jar from .../bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar being the culprit.

It does remain a bit mysterious, and I somehow feel I should try to branch from the old version and try to log or at least record a stack for where which lock has been obtained exclusively. While at it I could also try to record uncaught exceptions in lockOpen(). Maybe with some luck I can get to the bottom of this...

@axeluhl
Copy link
Author

axeluhl commented May 12, 2022

With the patched version I see a NoClassDefFoundError pretty much where "usually" things lock up:

Running .....client.async.AsyncActionsExecutorTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.302 s <<< FAILURE! - in ......client.async.AsyncActionsExecutorTest
......client.async.AsyncActionsExecutorTest  Time elapsed: 0.294 s  <<< ERROR!
java.lang.NoClassDefFoundError: junit/framework/TestFailure
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)

However, this time (and maybe it's a side effect of the log output and the timing change?) things don't lock up. Is it conceivable that a NoClassDefFoundError is thrown somewhere during the lockOpen() call path?

@tjwatson
Copy link
Contributor

However, this time (and maybe it's a side effect of the log output and the timing change?) things don't lock up. Is it conceivable that a NoClassDefFoundError is thrown somewhere during the lockOpen() call path?

I imagine that is happening as a side-effect of the framework shutting down and the bundle class loaders are getting disabled from loading new classes. That is assuming junit is getting loaded from a bundle. I don't see how that NCDFE could happen during lockOpen because that is framework code. That code does not call out to foreign code that maybe located in a bundle. So I don't see how such a class loading error could happen in that call.

@axeluhl
Copy link
Author

axeluhl commented May 12, 2022

I think I was able to recover the stack trace where the lock was obtained on the junit.jar bundle that now keeps it from closing:

CloseableBundleFile.open:162
CloseableBundleFile.lockOpen:79
CloseableBundleFile.getInputStream:479
ZipBundleEntry.getInputStream:57
BundleEntry.getBytes:94
ClasspathManager.findClassImpl:627
ClasspathManager.findLocalClassImpl:607
ClasspathManager.findLocalClassImpl:587
ClasspathManager.findLocalClass:566
ModuleClassLoader.findLocalClass:335
BundleLoader.findLocalClass:397
BundleLoader.findClass0:500
BundleLoader.findClass:416
ModuleClassLoader.loadClass:415
RunNotifier.fireTestRunFinished:100
JUnit4Provier.invoke:164
NativeMethodAccessorImpl.invoke0:-2
NativeMethodAccessorImpl.invoke:62
DelegatingMethodAccessorImpl.invoke:43
Method.invoke:498
ReflectionUtils.invokeMethodWithArray2:206
ProviderFactory.invoke:161
ProviderFactory.invokeProvider:84
OsgiSurefireBooter.run:113
HeadlessTestApplication.run:21
NativeMethodAccessorImpl.invoke0:-2
NativeMethodAccessorImpl.invoke:62
DelegatingMethodAccessorImpl.invoke:43
Method.invoke:498
EclipseAppContainer.callMethodWithException:593
EclipseAppHandle.run:205
EclipseAppLauncher.runApplication:136
EclipseAppLauncher.start:104
EclipseStarter.run:402
EclipseStarter.run:255
NativeMethodAccessorImpl.invoke0:-2
NativeMethodAccessorImpl.invoke:62
DelegatingMethodAccessorImpl.invoke:43
Method.invoke:498
Main.invokeFramework:653
Main.basicRun:590
Main.run:1461
Main.main:1434

Excuse the format, but I obtained this by stepping through a locked-up thread using an Eclipse remote debugger, noting frame by frame.

I then found this in the extended logs:

OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is falseCaught Throwable while trying to lockOpen /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar

Note the lack of a newline between the "false" the the "Caught Throwable". Here is the lockOpen() method with my extended Debug output:

	/**
	 * Checks if the bundle file is open
	 * @return true if the bundle file is open and locked
	 */
	protected boolean lockOpen() {
		try {
			open(true);
			return true;
		} catch (IOException e) {
			if (generation != null) {
				ModuleRevision r = generation.getRevision();
				if (r != null) {
					ContainerEvent eventType = ContainerEvent.ERROR;
					// If the revision has been removed from the list of revisions then it has been deleted
					// because the bundle has been uninstalled or updated
					if (!r.getRevisions().getModuleRevisions().contains(r)) {
						// instead of filling the log with errors about missing files from
						// uninstalled/updated bundles just give it an info level
						eventType = ContainerEvent.INFO;
					}
					generation.getBundleInfo().getStorage().getAdaptor().publishContainerEvent(eventType, r.getRevisions().getModule(), e);
				}
			}
			// TODO not sure if throwing a runtime exception is better
			// throw new RuntimeException("Failed to open bundle file.", e);
			return false;
		} catch (Throwable t) {
			if (debug.DEBUG_BUNDLE_FILE_OPEN) {
				Debug.println("Caught Throwable while trying to lockOpen " + toString()); //$NON-NLS-1$
				Debug.printStackTrace(t);
			}
			return false;
		}
	}

And here the extended open(boolean) method:

	/**
	 * Opens this bundle file.
	 * @param keepLock true if the open lock should be retained
	 * @throws IOException
	 */
	private void open(boolean keepLock) throws IOException {
		openLock.lock();
		try {
			if (closed) {
				boolean needBackPressure = mruListAdd();
				if (needBackPressure) {
					// release lock before applying back pressure
					openLock.unlock();
					try {
						mruListApplyBackPressure();
					} finally {
						// get lock back after back pressure
						openLock.lock();
					}
				}
				// check close again after getting open lock again
				if (closed) {
					// always add again if back pressure was applied in case
					// the bundle file got removed while releasing the open lock
					if (needBackPressure) {
						mruListAdd();
					}
					// This can throw an IO exception resulting in closed remaining true on exit
					doOpen();
					closed = false;
					if (debug.DEBUG_BUNDLE_FILE_OPEN) {
						Debug.println("OPENED bundle file - " + toString()); //$NON-NLS-1$
					}
				}
			} else {
				mruListUse();
			}
		} catch (Throwable t) {
			if (debug.DEBUG_BUNDLE_FILE_OPEN) {
				Debug.println("OPENED bundle file " + toString() + " with exception " + t.getMessage()); //$NON-NLS-1$ //$NON-NLS-2$
				Debug.printStackTrace(t);
			}
			throw t;
		} finally {
			if (!keepLock || closed) {
				openLock.unlock();
			} else {
				lastStackWhereOpenLockWasObtainedAndKept = Thread.currentThread().getStackTrace();
				if (debug.DEBUG_BUNDLE_FILE_OPEN) {
					Debug.println(
							"OPENED bundle file " + toString() + " and keeping lock because keepLock is " + keepLock //$NON-NLS-1$ //$NON-NLS-2$
						+ " and closed is " + closed); //$NON-NLS-1$
				}
			}
		}
	}

It's too bad the Debug.printStackTrace(t); doesn't seem to have produced any output. I wonder why. To at least get a hunch of the type of exception thrown I'll try to more carefully print at least the exception type, and then in a separate statement try to walk the stack frames one by one instead of logging with the Debug.printStackTrace(t). Maybe some StackOverflowError or OutOfMemoryError? Something that depletes resources to the point that even the Debug.printStackTrace(t) doesn't terminate normally.

@axeluhl
Copy link
Author

axeluhl commented May 12, 2022

It was a StackOverflowError. Here goes the log:

OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/143/0/.cp/jars/surefire-booter-2.22.0.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/143/0/.cp/jars/surefire-api-2.22.0.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/143/0/.cp/jars/surefire-api-2.22.0.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
Caught Throwable java.lang.StackOverflowError with message null trying to lockOpen /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jarOPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is falseCaught Throwable java.lang.StackOverflowError with message null trying to lockOpen /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jarOPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is falseCaught Throwable java.lang.StackOverflowError with message null trying to lockOpen /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar
java.lang.StackOverflowErrorOPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false
OPENED bundle file /home/hudson/workspace/equinox-3.18.1/java/....test/bin/work/configuration/org.eclipse.osgi/36/0/.cp/junit.jar and keeping lock because keepLock is true and closed is false

This explains why the pull request that handles any Throwable now properly cleans up the lock. It doesn't really explain why while running the Tycho eclipse-test-plugin tests the class loaders run astray like this. But that's really another issue to investigate.

Thanks again for the fix, and hoping to see this soon in a p2 repo we can include in our target platform.

@axeluhl
Copy link
Author

axeluhl commented May 12, 2022

FYI, the code with extensive debugging is here: https://github.com/axeluhl/equinox.framework/tree/closeable-bundle-logging

@axeluhl
Copy link
Author

axeluhl commented May 13, 2022

Increasing the stack size to 100m (-Xss100m) for the test run doesn't avoid the StackOverflowError, so this has to be something more fundamental about how class loading or bundle file handling recurses. I want to try to somehow record the full stack during the StackOverflowError in order to understand what's really causing this.

@axeluhl
Copy link
Author

axeluhl commented May 13, 2022

The call stack depth at the occurrence of the StackOverflowError is 102400. I forgot that when I'm looking at threads through the debugger and all threads are locked up, I cannot evaluate any expressions...

@tjwatson
Copy link
Contributor

Perhaps tycho is replacing System.out and that is causing some kind of endless recursion. I think we have some problematic capturing of System.out in org.eclipse.osgi.internal.debug.Debug:

	/**
	 * The PrintStream to print debug messages to.
	 */
	public static PrintStream out = System.out;

If someone calls java.lang.System.setOut(PrintStream) we would still use the previous PrintStream. Maybe the previous one starts causing issues?

But you have seen the issue without any trace enabled so that seems like it would rule out issues with accessing the out in Debug.

@axeluhl
Copy link
Author

axeluhl commented May 13, 2022

Finally:

java.lang.StackOverflowError
        at sun.nio.cs.SingleByte$Encoder.encodeLoop(SingleByte.java:219)
        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:579)
        at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:271)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
        at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
        at java.io.PrintStream.write(PrintStream.java:526)
        at java.io.PrintStream.print(PrintStream.java:669)
        at java.io.PrintStream.println(PrintStream.java:823)
        at java.lang.Throwable$WrappedPrintStream.println(Throwable.java:784)
        at java.lang.Throwable.printStackTrace(Throwable.java:689)
        at java.lang.Throwable.printStackTrace(Throwable.java:677)
        at org.eclipse.osgi.internal.debug.Debug.printStackTrace(Debug.java:378)
        at org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.open(CloseableBundleFile.java:128)
        at org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.lockOpen(CloseableBundleFile.java:80)
        at org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.getEntry(CloseableBundleFile.java:304)
        at org.eclipse.osgi.internal.loader.classpath.ClasspathEntry.findEntry(ClasspathEntry.java:195)
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findClassImpl(ClasspathManager.java:621)
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClassImpl(ClasspathManager.java:607)
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClassImpl(ClasspathManager.java:587)
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClass(ClasspathManager.java:566)
        at org.eclipse.osgi.internal.loader.ModuleClassLoader.findLocalClass(ModuleClassLoader.java:335)
        at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:397)
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass0(BundleLoader.java:500)
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:416)
        at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:168)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:415)
        at junit.framework.TestResult.addError(TestResult.java:37)
        at junit.framework.TestResult.runProtected(TestResult.java:148)
        at junit.framework.TestResult.run(TestResult.java:125)
        at junit.framework.TestCase.run(TestCase.java:129)
        at com.google.gwt.junit.client.GWTTestCase.run(GWTTestCase.java:247)
        at junit.framework.TestCase.run(TestCase.java:121)
        ...
        at junit.framework.TestCase.run(TestCase.java:121)
        at ....client.async.AsyncActionsExecutorTest.runTest(AsyncActionsExecutorTest.java:63)
        at junit.framework.TestCase.runBare(TestCase.java:141)
        at junit.framework.TestResult$1.protect(TestResult.java:122)
        at junit.framework.TestResult.runProtected(TestResult.java:142)
        at junit.framework.TestResult.run(TestResult.java:125)
        at junit.framework.TestCase.run(TestCase.java:129)
        at com.google.gwt.junit.client.GWTTestCase.run(GWTTestCase.java:247)
        at junit.framework.TestCase.run(TestCase.java:121)
        at ....client.async.AsyncActionsExecutorTest.runTest(AsyncActionsExecutorTest.java:63)
        at junit.framework.TestCase.runBare(TestCase.java:141)
        at junit.framework.TestResult$1.protect(TestResult.java:122)
        at junit.framework.TestResult.runProtected(TestResult.java:142)
        at junit.framework.TestResult.run(TestResult.java:125)
        at junit.framework.TestCase.run(TestCase.java:129)
        at com.google.gwt.junit.client.GWTTestCase.run(GWTTestCase.java:247)
        at junit.framework.TestCase.run(TestCase.java:121)
        at ....client.async.AsyncActionsExecutorTest.runTest(AsyncActionsExecutorTest.java:63)
        at junit.framework.TestCase.runBare(TestCase.java:141)
        at junit.framework.TestResult$1.protect(TestResult.java:122)
        at junit.framework.TestResult.runProtected(TestResult.java:142)
        at junit.framework.TestResult.run(TestResult.java:125)
        at junit.framework.TestCase.run(TestCase.java:129)
        at com.google.gwt.junit.client.GWTTestCase.run(GWTTestCase.java:247)
        at junit.framework.TestCase.run(TestCase.java:121)
        at ....client.async.AsyncActionsExecutorTest.runTest(AsyncActionsExecutorTest.java:63)
        at junit.framework.TestCase.runBare(TestCase.java:141)
        at junit.framework.TestResult$1.protect(TestResult.java:122)
        at junit.framework.TestResult.runProtected(TestResult.java:142)
        at junit.framework.TestResult.run(TestResult.java:125)
        at junit.framework.TestCase.run(TestCase.java:129)
        at com.google.gwt.junit.client.GWTTestCase.run(GWTTestCase.java:247)

So this is really some JUnit problem we need to take a closer look at, fortunately independent of this bug. And it all makes sense now: The SOE causing the lock to not be released, eventually leading to the deadlock.

@tjwatson
Copy link
Contributor

@axeluhl thanks for all the analysis and testing on this one! Looking back over this issue I think the most likely place the SOE was happening was in the call to org.eclipse.osgi.storage.bundlefile.CloseableBundleFile.mruListUse() for the case where closed equals false. This seems more likely because in the default case you were seeing issues without any trace enabled. The call to CloseableBundleFile.mruListUse() adds a few more frames to the stack and, in my opinion, is the place where the SOE is more likely to happen.

I don't expect any more investigation is needed to confirm that. Just placing this comment here in case we need to remember some of the details later.

I'm not sure what project to suggest with a report on the actual cause of the SOE ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants