Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly random crashes related to "OSError" "Undefined name: CON000000XX" #60

Closed
masternerdguy opened this issue Dec 22, 2023 · 6 comments

Comments

@masternerdguy
Copy link

I am very new to APL, but I have been working on a proof-of-concept web application to demonstrate the potential use of Dyalog / Jarvis to others. However, I keep getting these seemingly random crashes - sometimes it only takes a minute, sometimes it takes half an hour, but Jarvis always seems to exit with a cryptic "OS / Undefined Name" error. I could be wrong, but I don't think this is related to the code I am running. I've tried hitting my endpoints with good data and error cases, but I can't find a pattern.

I've attached the error I've been receiving along with the code I am running under Jarvis. I am using the Jarvis docker container, and have included the command invoking it. Is this an issue with Jarvis itself, or just a problem with the code I'm running?

apl-birdie-app.zip
output.txt
Screenshot_2023-12-21_19-03-44

For what its worth, I've tried this both under WSL and Xubuntu and it seems to behave the same in both cases.

Let me know if you have any questions :)

@bpbecker
Copy link
Contributor

Thanks for reporting this... can you give me more information about the client that's accessing Jarvis? For instance, in your example it's failing on CON00000026, which means that there were 26 previous connections (CON0000000-CON00000025). I'm trying to find a consistent reproduction of the error - so any additional information you can provide might help! Thanks again!

/Brian

@masternerdguy
Copy link
Author

Brian

Here is a screenshot of the setup:

Screenshot_2023-12-23_00-05-07

As you can see, I have Firefox browsing to Jarvis on port 7777 with the "ListBirdies" endpoint selected. This endpoint receives an empty JSON payload, and I pressed the "Send" button about a dozen times (a few seconds apart each time) and then waited a couple of minutes. After waiting a couple of minutes, the error occurred.

I did notice that error contained an additional line about a socket closed while receiving data, so I restarted Jarvis and repeated the same procedure, which gave the original error:

Screenshot_2023-12-23_00-11-13

Something I have observed is that CON00000000 seems to be the most frequent one mentioned, regardless of the number of requests.

To summarize, it can be reproduced by pressing "Send" about a dozen times, waiting a few seconds between requests, with an empty JSON document on the "ListBirdies" endpoint. After doing so, and waiting, it eventually crashes.

Let me know if you have any more questions!

@bpbecker
Copy link
Contributor

Thanks for the feedback! Curiously, this only seems to happen when Jarvis is running in the Docker container. I've been able to replicate the symptoms running the container under Docker for Windows. I can run Jarvis directly under Ubuntu, Windows, even the Raspberry Pi without this issue showing up. The investigation continues...

bpbecker pushed a commit that referenced this issue Jan 12, 2024
@bpbecker
Copy link
Contributor

Fixed in 9cbec10 (released in v1.16.3)

This was an interesting problem - after pouring over the Jarvis code, I couldn't find anything that might cause the conditions that result in this error. I was unable to recreate it on my laptop. However, when I ran it in a container under Docker for Windows, I was able to recreate it. It turned out that it has to do with when a connection times out. The default is to time out and close a connection after 30 seconds of inactivity. What happens is:

  • We track connections by maintaining a namespace containing namespaces for each connection and with an index matrix that provides for fast lookup and management.
  • We identify timed out connection(s)
  • We close them with Conga's Close method
  • We expunge the namespace(s) associated with the connection(s) and remove the entries in the index matrix
  • We drop back in the server listening loop

In the container, I hacked in some additional debugging/diagnostics and was able to find that there's a delay between when we close the connection (Conga marks it for closure at that point) and when Conga has actually cleaned up its object tree to remove the connection. I believe what was happening was that we drop back into the listening loop before Conga removed the connection and when the connection is actually removed, it's signalling a "Closed" event on a connection that no longer exists. When we get a "Close" event, we remove the connection from the tracking mechanism. But since the connection was previously removed, we were getting the value error.

I was able to witness the problem occurring on another person's system that wasn't running Docker. The reason I wasn't able to reproduce the problem on my laptop (without Docker) is that my laptop is fast enough for Conga to clean things up in its own thread before we drop back into the listening loop. So, the problem was only related to Docker in that Docker container created an environment where the timing would cause the circumstances to more consistently occur.

So, there's a two-pronged approach to addressing this:

  • add some paranoia code in Jarvis to detect when we're trying to remove a connection that no longer exists. This has been done and is sufficient to prevent the problem from occurring again.
  • fix Conga to not signal a "Close" event on a connection that's already been closed. This is pending further investigation by the Conga development team.

The Dyalog/Jarvis container is updated whenever we make a commit to the Jarvis master branch and as such now contains this fix.

Thanks for your help and patience!
/Brian

@mkromberg
Copy link
Contributor

I just wanted to add that, although I agree that Conga should not signal a closed event, there is a small timing gap which would allow this problem to occur very infrequently without this behaviour. If the client closed the connection in the split-second before the server decides to close the connection due to inactivity. In this case the Close event might legitimately be in the Wait queue AFTER the server has closed the connection.

@masternerdguy
Copy link
Author

Just to close the loop, I can confirm the issue is no longer occurring:

Screenshot_2024-01-14_15-29-34

Thank you so much for fixing it and for the excellent explanation of the root cause!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants