Skip to content
This repository has been archived by the owner on Nov 5, 2022. It is now read-only.

java.io.NotSerializableException at com.cloudbees.groovy.cps.impl.ForInLoopBlock$ContinuationImpl.itr #105

Closed
nevskrem opened this issue Mar 10, 2020 · 10 comments · Fixed by jenkinsci/workflow-cps-plugin#361
Labels

Comments

@nevskrem
Copy link

We do regularly face above issue in one of our pipelines.
Could you please have a look at #104

@bgshacklett
Copy link

A Jenkins environment which I'm currently working on has been experiencing this issue for the last two days after upgrading from version 2.78 to version 2.80 of the workflow-cps (Pipeline: Groovy) plug-in, which upgraded the version of groovy-cps from 1.31 to 1.32. Downgrading the plug-in fixed the problem.

The behavior has been quite strange, and I haven't been able to produce a reliable repro, yet. It was most notable around any step which touches the workspace. Steps like writeFile, fileExists, etc. were affected. It seems as if an error is being thrown close to the step, but is being caught and encapsulated within other exceptions, and the real stack trace is being lost.

I'll try to get a repro generated when I can.

@dwnusbaum
Copy link
Member

Feel free to post the full stack trace from the exception if either of you have it. I am not really sure what could have broken this in 1.32. Maybe it could be related to the fact that now we eagerly expand methods with default parameters so that they are CPS-transformed?

@bgshacklett
Copy link

bgshacklett commented Apr 10, 2020

We just ran into the problem again after downgrading, so it looks like associating it with 1.32 was premature. Restarting the Jenkins master seems to resolve the problem temporarily with no changes to the pipeline code.

Here's a link to the stack trace:
https://gist.github.com/bgshacklett/b9ef6f4f692cb40ff3df0939add61bb6

It seems like something might be happening prior to the CPS-transformation. Is there any way for us to get more visibility at that level?

@bgshacklett
Copy link

Got some more detail. It looks like this was being caused by running collect on a LinkedList. I changed it to use a generic Groovy List and it seems to be working better. I'm having trouble understanding why a reboot would change the behavior, or why it would be intermittent at all. I guess maybe the workflow-cps plugin is serializing in a non-deterministic fashion?

I don't necessarily believe the behavior is wrong in this case, but it would be great to get better output for troubleshooting.

@dwnusbaum
Copy link
Member

With the hint about LinkedList I was able to reproduce this in a test case in workflow-cps:

@Test public void forInLoopSerialization() throws Exception {
    WorkflowJob p = jenkins.createProject(WorkflowJob.class);
    p.setDefinition(new CpsFlowDefinition(
            "def col = new LinkedList<>([1, 2, 3])\n" +
            "for (Integer i in col) {\n" +
            "  sleep(time: 500, unit: 'MILLISECONDS')\n" + // Force the Pipeline to be serialized.
            "}\n", false));
    jenkins.buildAndAssertSuccess(p);
}

I haven't spent any time debugging to understand the root cause, but at least one problem is that IteratorHack is not intercepting this call. Maybe we need to add another overload of iterator or listIterator to IteratorHack, or we need to relax the bypass conditions like this one because of the way that LinkedList.iterator chains through to LinkedList.listIterator through AbstractSequentialList, but I'm not sure. I would step through the call to ScriptBytecodeAdapter.invokeMethod0(null/*unused*/, col, "iterator") to understand what is happening internally.

I'm having trouble understanding why a reboot would change the behavior, or why it would be intermittent at all. I guess maybe the workflow-cps plugin is serializing in a non-deterministic fashion?

What Pipeline durability level are you using? If you are using MAX_SURVIVABILITY I would expect the behavior to be deterministic, but with PERFORMANCE_OPTIMIZED, the Pipeline is not serialized after every step executes, so the behavior may appear random.

@dwnusbaum
Copy link
Member

dwnusbaum commented May 1, 2020

I spent a bit of time debugging this today, and realized that the above test is actually flaky, so it makes sense that you all only see the issue intermittently. Adding a sleep to the beginning of the test did not seem to affect the flakiness.

I patched groovy-cps and workflow-cps to see if IteratorHack was being called at all when the test fails, and from what I can tell, the proximate cause is that sometimes this line calls IteratorHack.iterator, and sometimes it does not. I thought that maybe the problem was that the category methods were not present on the current thread in the cases where IteratorHack.iterator is not called, but as far as I can tell they are present in both cases. I suspected some kind of issue in Groovy itself, so I checked if there were any fixes to GroovyCategorySupport or related code in newer versions of Groovy that looked related, but didn't find anything obvious. I checked the call stacks inside of ForInLoopBlock.loopHead and they are identical in the passing and non-passing cases.

The next thing I would try is to put some breakpoints in MetaClassImpl.invokeMethod and related methods and debug the test case to try to figure out why the category method is not being found in some cases.

@bgshacklett
Copy link

Apologies for the delay in response, but I did verify that we are using MAX_SURVIVABILITY for our pipelines.

@dwnusbaum dwnusbaum added the bug label May 6, 2020
@dwnusbaum
Copy link
Member

dwnusbaum commented Jun 9, 2020

I had a little more time to debug today and was able to narrow things down by adding some breakpoints to MetaClassImpl.invokeMethod and the methods it calls.

In cases where the test in #105 (comment) passes, entry on MetaClassImpl.getMethods line 693 refers to List.iterator, whereas when the test fails, it refers to Deque.iterator. When entry is List.iterator, filterMatchingMethodForCategory correctly replaces the List.iterator call with the IteratorHack.iterator(List) call, and when entry is Deque.iterator, it does not (This is the correct behavior as far as I can tell from reading filterMatchingMethodForCategory and getMatchKindForCategory, in particular these conditions are false because Deque and List are unrelated).

So, it looks like MetaMethodIndex.getMethods is nondeterministic. I'll check to see if there any any changes to this class in newer versions of Groovy that look relevant.

The good news though is that I am pretty sure we can add a workaround in IteratorHack to fix the issue, either by adding overloads of iterator for Deque or by replacing the existing overloads for List and Set with a single one based on Collection (not sure if that would work though)

@dwnusbaum
Copy link
Member

dwnusbaum commented Jun 10, 2020

I tried reproducing the nondeterminism in raw Groovy but was unable to do so. This is the test case I used, which attempts to mimic the code in ForInLoopBlock:

import java.util.Deque;
import java.util.List;
import java.util.LinkedList;
import org.codehaus.groovy.runtime.GroovyCategorySupport;
import org.codehaus.groovy.runtime.ScriptBytecodeAdapter;

public class MyCategory {
    public static <T> Iterator<T> iterator(List<T> c) {
        throw new Exception("Using List category method");
    }

    public static <T> Iterator<T> iterator(Deque<T> c) {
        throw new Exception("Using Deque category method");
    }
}

def list = new LinkedList(Arrays.asList(1, 2, 3));
GroovyCategorySupport.use(MyCategory.class, { ->
    println(ScriptBytecodeAdapter.invokeMethod0(null, list, "iterator"));
});

In Groovy 2.4.15 and older, this always calls the list category method, and in Groovy 2.4.16 and newer, it always calls the deque category method. I looked at the commits between those versions in Groovy and nothing really stands out. Changing the reflective call to list.iterator() changes the behavior. Perhaps the issue can only be reproduced when using embedded Groovy from a Java program, I'm not sure.

I think I am done trying to figure out the exact root cause at this point, so I'll just file a PR to workflow-cps to add a Deque overload to IteratorHack along with a new test.

@bgshacklett
Copy link

Thank you for your efforts on this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
3 participants