-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool fails to process submissions > 16 #8
Comments
Hi Rob, Could you try and run this using the The only thing you'd have to change to make it work with this version is the initialization of the pool. It would have to look something like this:
If you encounter the same problem using the new version as well, what would help me identify the cause is trace logs. PP4J uses SLF4J so if you have a compatible logging framework to bind to it, you can set the logging level to I use Logback for the tests which makes logging configuration really simple. This is a good description of how to configure logging settings using SL4J and Logback: https://www.baeldung.com/logback. |
Hi Victor,
So 16 requests with a pool of 8. In the logs a successful response is denoted:
and fail:
|
FYI the correct output is:
The calculation takes up to 20 seconds. |
for the other scenario
So after 12 or so iterations it fails with a timeout:
I have some timeouts on the threads etc, but Ive extended those and see no difference, so I assume they are coming from your code? |
The reason for this whole app is that only a single matlab calculation is possible at a time in a process, but matlab only uses one CPU core per calc. So to scale you need to have one calc per CPU core, and run them in separate processes. The calculations come via an activemq JMS server, so lots arrive there, and activemq distributes them to a pool of these matlab-calc servers. Since activemq already queues them, in an optimal design the matlab-calc server will signal that its pool is all busy when a new calc arrives, and activemq will try another server. Is it possible to find the current depth of submissions from pp4j? Also I discovered that enabling debug in my code resulted in the calc outputting debug in the JavaProcess, which caused the process to assume it was complete and return. Turing it off solved that but its worth considering that a process may output several lines while running, then a final answer. Is there a way to access or control that? |
Hi Rob, Thanks for giving it a go with the new version and for uploading the log file. I'm looking through it right now. It seems to be an issue with decoding the tasks sent to the Java processes. I am not sure what triggers this yet as some processes can apparently execute two tasks without any problems. Out of curiosity, does it work now when you don't reuse the processes? EDIT: Ah, nevermind. I forgot to refresh and just saw your comments. |
I am not sure I know what you mean by the current depth of submissions. Can you elaborate on that, please? As for outputting stuff to the standard streams from tasks submitted to the Java process pool, it should not be a problem. The process output handler is invoked every time a line is output to the stream, but it will only consider the task complete if this line is a Base64 encoded, serialized instance of a special class used for encapsulating the results of tasks. If it is not, it should be just ignored. If that's not case, there is probably a bug somewhere. I would be keen to take a look at the logs of that run, if you happen to have them. |
I did consider it being a charset mismatch, but I don't see how that could happen. Both the pool and the processes are coded explicitly to use ISO_8859_1. |
When I send a new submission I get a trace line:
is there a
|
Ah, yes. You can invoke |
Re ISO_8859_1. My JVM (on linux) is probably using UTF-8 or UTF-16. But since its sending the same job each time that should cause the same failure each time. What about the 'space at the end', eg using trim() on the decode string? Maybe there is a spurious CR/LF or space happpening? |
All lines are trimmed before decoding, so unfortunately that's probably not it either. By the way, the timeouts when terminating the processes after each submission are caused by decoding/serialization issues as well. I can see in the logs that a clearly Base64 encoded string is printed to the process' standard out, yet the process output handler does not recognize it as a legit response, and therefore the task is considered to be executing indefinitely. |
Is this the same behaviour you observed when using v2.2? Because the serialization and encoding mechanism did change a bit, perhaps for the worse. |
Yes same with v2.2. That fits with something else I noticed before add timeouts to cleanup at my end. There were often 2-4 processes waiting indefinitely. |
Cool, thanks. Then it has probably been the same problem all along. I'll keep digging tomorrow to identify what goes wrong. |
Looking at https://github.com/ViktorC/PP4J/blob/master/src/main/java/net/viktorc/pp4j/impl/JavaObjectCodec.java JavaObjectCodec is a singleton and has encoder and decoder as attributes. So all instances are using the same instances of them. Wondering if its thread-safe. Std way of using it may be worth testing with
|
That's what it was like in v2.2 which suffers from the same problem. I changed to having them as members of the
Nevertheless, that's a good lead. I'm pretty sure the problem is somewhere within that class. |
I was trying various things yesterday. Made some progress.
Looks like this is due to the matlab native code after all. Basically I think it causes the JVM not to exit, causing the effects we see above. I have the java wrapper src code for the matlab natives, but not the deeper lib, so its difficult to see whats causing it. One solution might be to add a different way to recognise that a JVM is free. eg execute callable and forceCompletion when it returns, ignoring remnants? |
Or maybe get the callable to output a flag to stdout in a finally clause? |
I did try to recreate the problem to no avail. I even used JNI (with that simple native code I use for the tests), but everything worked fine. I suspect JNI is the catalyst, though. The standard streams are redirected in the Java processes maintained by the pool so if you submit a task that prints to JNI also has a peculiar way of crashing the JVM if there is an error in the native code. As you suggested, it might also just corrupt it instead of completely blowing it up. |
The slave Java process sends back a response to the pool if either the Callable's execution completes or an exception is thrown. If the callable completes, the response will contain its result, and if an exception is thrown, it will contain the exception (which then you can access wrapped in an So I guess this response functions as a flag. As for terminating the process, it is again done by the exchange of specific request and response object. |
While JNI is a dodgy beast, the actual JNI code used here is quite robust. If I run it in a single JVM it rarely causes any problems. I suspect it starts an Executor or some other thread, which holds the JVM open, or maybe its an anomaly of the in and out stream redirection? How does your lib decide when the task is complete? |
I built a new version that has somewhat better logging which might help us identify when and where it all goes south. If you have some time to rerun it again with the new version, I would gladly inspect the logs. |
Basically, it just waits for the callable to finish running or throw an exception. This is what happens in the Java process:
|
tried the new version - logging looks the same :-( |
The difference is subtle but important. Every process executor uses multiple child threads (one for taking submissions off the queue and executing them, one for listening to the process' standard out stream, one for timing periods of idleness, etc.). With the new logging, we should be able to identify which threads belong to which process executor and therefore work out exactly what was sent to which process and what was sent back in return. |
There should be log entries like |
ok, attached.
|
Thank you for the logs. I found some interesting things. I looked at a process executor that stalled. It first executed your task successfully and returned your JSON response. Then it proceeded to try and terminate the process (as expected given the
When an exception is thrown in the process, it is caught and returned within a This is definitely a resiliency issue that I can address in the library, but it only solves the problem of stalling executors. I am still not sure why the stream gets corrupted in the first place. 🤔 |
Wonder if the matlab JNI is interfering with the redirectStdOut? |
convert 057E7200 to UTF-8 and you get 爀 |
It's gotta go deeper than that. It's the Chinese disguised as Russians pretending to be Chinese. 💥 |
Hi Rob, I've built yet another version. If the stream corruption issue is deterministic, I reckon I could pinpoint the cause or at least identify the problematic submission from the logs of a run with this version. If the stream corruption is not specific to the tasks you submit to the pool, this version will allow all submissions to succeed at the cost of terminating the Java processes with the corrupted streams and spinning up new ones. |
Ive run it with the new jar, using . It now completes the jobs successfully. It still throws the stream errors but recovers ok. Log attached |
BTW Ive done some tests using the matlab calc directly (so single thread/CPU) and initial calcs are ~3500ms, dropping quickly to ~40ms, so hotspot improvement is significant. Re-using the JVM's is going to be important I think. |
You can see the effect of hotspot here, line *** drops to 500ms calcTime, but then reverts, presumably a new JVM
|
Hi Rob, Sorry for the late reply. Thank you for the logs! I looked through them but I couldn't find a pattern that would explain the decoding issues. It seems as if the processes were receiving input to their std in from sources other than the process pool. The error messages vary from The invalid header exceptions are especially interesting. At any rate, I built another version (to nobody's surprise) that includes the messages read from the std in stream in the error responses of the Java processes. This should allow us to see the messages sent to the processes by third parties and help us figure out where they come from. If you don't mind the hassle, the logs of a new run would be much appreciated once again. I also made a lot of general improvements to the library so as soon as we solve this issue or decide that the library is not at fault, I'll release a new, stable version. |
Logs attached, same issue :-( |
Some ideas:
|
Javadocs mention you must read out and err to avoid blocking and deadlocks. |
Some interesting methods in ObjectOutputStream:
also flush(), drain(), note close() does a flush(). |
This should not be a problem because both the Java process and the process executor use |
Looking at the logs, the problem is not caused by additional messages printed to the process' std in. It's the opposite. Messages sent to the process miss chunks of characters. Usually it is the first few dozen characters of the encoded message but sometimes it is a similarly sized chunk missing from the middle of the message. I tried to reproduce the issue by sending those same instructions to a Java process pool but they always came through in full. The only explanation for those missing chunks I can think of is something reading bytes from the std in streams of the Java processes. Do you think the native code you are executing might be doing that? |
I dont think it can be the matlab native code reading the IO. The native code is called explicitly later in the call() method. It is a transient static object so is not used in the deserialization process, although it is instantiated during the first Calc dehydration. That said it does write error messages to std out (err?). I will investigate further. If misc bits of the stream are missing then buffer overflows could be the cause. I'm not sure but I expect that the IO is javaObject>bytestream>base64String>System.out>process.out>process.in>etc I may be able to extend the buffer size (in linux) and see if that helps |
Yeah, you're right, it should not be able to interfere with the messages to stdin. 🤔 The IO chain looks something like this:
The process executor encodes the Callable and writes it to the stdin of the 'slave' process through a buffered writer instantiated after the startup of the process like In the mean time, the slave process is blocking waiting for a newline character to be printed to its stdin. It does so using a buffered reader like Given these incomplete messages, the slave process naturally fails to decode them. The decoding fails at one of two different points; either the messages are not valid Base64 strings anymore so the string can't even be decoded, or by chance the messages are still valid Base64 strings and the deserialization fails because the bytestream is missing information. |
If |
Hi,
Ive built a MatlabExecutor class that starts a static pool as follows:
It will eventually be a servlet, so it has a doPost(request,response) method, but at present it just runs via a test that calls the MatlabExecutor wrapped in a runnable. eg
the important part of the doPost method is:
Calc.java is:
So now the problems.
Ideally I want to reuse JVMs for efficiency to avoid startup delays.
Its possible there is a better way to do this, I'm open to suggestions.
The text was updated successfully, but these errors were encountered: