Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

masternerdguy · 2023-12-22T00:18:56Z

I am very new to APL, but I have been working on a proof-of-concept web application to demonstrate the potential use of Dyalog / Jarvis to others. However, I keep getting these seemingly random crashes - sometimes it only takes a minute, sometimes it takes half an hour, but Jarvis always seems to exit with a cryptic "OS / Undefined Name" error. I could be wrong, but I don't think this is related to the code I am running. I've tried hitting my endpoints with good data and error cases, but I can't find a pattern.

I've attached the error I've been receiving along with the code I am running under Jarvis. I am using the Jarvis docker container, and have included the command invoking it. Is this an issue with Jarvis itself, or just a problem with the code I'm running?

apl-birdie-app.zip
output.txt

For what its worth, I've tried this both under WSL and Xubuntu and it seems to behave the same in both cases.

Let me know if you have any questions :)

bpbecker · 2023-12-22T19:55:33Z

Thanks for reporting this... can you give me more information about the client that's accessing Jarvis? For instance, in your example it's failing on CON00000026, which means that there were 26 previous connections (CON0000000-CON00000025). I'm trying to find a consistent reproduction of the error - so any additional information you can provide might help! Thanks again!

/Brian

masternerdguy · 2023-12-23T05:19:07Z

Brian

Here is a screenshot of the setup:

As you can see, I have Firefox browsing to Jarvis on port 7777 with the "ListBirdies" endpoint selected. This endpoint receives an empty JSON payload, and I pressed the "Send" button about a dozen times (a few seconds apart each time) and then waited a couple of minutes. After waiting a couple of minutes, the error occurred.

I did notice that error contained an additional line about a socket closed while receiving data, so I restarted Jarvis and repeated the same procedure, which gave the original error:

Something I have observed is that CON00000000 seems to be the most frequent one mentioned, regardless of the number of requests.

To summarize, it can be reproduced by pressing "Send" about a dozen times, waiting a few seconds between requests, with an empty JSON document on the "ListBirdies" endpoint. After doing so, and waiting, it eventually crashes.

Let me know if you have any more questions!

bpbecker · 2023-12-23T16:20:01Z

Thanks for the feedback! Curiously, this only seems to happen when Jarvis is running in the Docker container. I've been able to replicate the symptoms running the container under Docker for Windows. I can run Jarvis directly under Ubuntu, Windows, even the Raspberry Pi without this issue showing up. The investigation continues...

bpbecker · 2024-01-12T20:36:07Z

Fixed in 9cbec10 (released in v1.16.3)

This was an interesting problem - after pouring over the Jarvis code, I couldn't find anything that might cause the conditions that result in this error. I was unable to recreate it on my laptop. However, when I ran it in a container under Docker for Windows, I was able to recreate it. It turned out that it has to do with when a connection times out. The default is to time out and close a connection after 30 seconds of inactivity. What happens is:

We track connections by maintaining a namespace containing namespaces for each connection and with an index matrix that provides for fast lookup and management.
We identify timed out connection(s)
We close them with Conga's Close method
We expunge the namespace(s) associated with the connection(s) and remove the entries in the index matrix
We drop back in the server listening loop

In the container, I hacked in some additional debugging/diagnostics and was able to find that there's a delay between when we close the connection (Conga marks it for closure at that point) and when Conga has actually cleaned up its object tree to remove the connection. I believe what was happening was that we drop back into the listening loop before Conga removed the connection and when the connection is actually removed, it's signalling a "Closed" event on a connection that no longer exists. When we get a "Close" event, we remove the connection from the tracking mechanism. But since the connection was previously removed, we were getting the value error.

I was able to witness the problem occurring on another person's system that wasn't running Docker. The reason I wasn't able to reproduce the problem on my laptop (without Docker) is that my laptop is fast enough for Conga to clean things up in its own thread before we drop back into the listening loop. So, the problem was only related to Docker in that Docker container created an environment where the timing would cause the circumstances to more consistently occur.

So, there's a two-pronged approach to addressing this:

add some paranoia code in Jarvis to detect when we're trying to remove a connection that no longer exists. This has been done and is sufficient to prevent the problem from occurring again.
fix Conga to not signal a "Close" event on a connection that's already been closed. This is pending further investigation by the Conga development team.

The Dyalog/Jarvis container is updated whenever we make a commit to the Jarvis master branch and as such now contains this fix.

Thanks for your help and patience!
/Brian

mkromberg · 2024-01-13T06:58:59Z

I just wanted to add that, although I agree that Conga should not signal a closed event, there is a small timing gap which would allow this problem to occur very infrequently without this behaviour. If the client closed the connection in the split-second before the server decides to close the connection due to inactivity. In this case the Close event might legitimately be in the Wait queue AFTER the server has closed the connection.

masternerdguy · 2024-01-14T20:31:41Z

Just to close the loop, I can confirm the issue is no longer occurring:

Thank you so much for fixing it and for the excellent explanation of the root cause!

bpbecker pushed a commit that referenced this issue Jan 12, 2024

v1.16.2 - Address #60

9cbec10

bpbecker closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

masternerdguy commented Dec 22, 2023

bpbecker commented Dec 22, 2023

masternerdguy commented Dec 23, 2023

bpbecker commented Dec 23, 2023

bpbecker commented Jan 12, 2024

mkromberg commented Jan 13, 2024

masternerdguy commented Jan 14, 2024

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

Comments

masternerdguy commented Dec 22, 2023

bpbecker commented Dec 22, 2023

masternerdguy commented Dec 23, 2023

bpbecker commented Dec 23, 2023

bpbecker commented Jan 12, 2024

mkromberg commented Jan 13, 2024

masternerdguy commented Jan 14, 2024