Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupting stuck network processes #1356

Open
Mellvik opened this issue Jul 1, 2022 · 4 comments
Open

Interrupting stuck network processes #1356

Mellvik opened this issue Jul 1, 2022 · 4 comments
Labels
bug Defect in the product

Comments

@Mellvik
Copy link
Contributor

Mellvik commented Jul 1, 2022

My intention was to put this on the 0.7.0 list, but it may as well be considered a regular bug.

My memory tells me we've touched this one before, but Seraching the repository didn't give me any hints.

Anyway, stuck network processes hardly ever react to signals - kill -9 ... whatever. Say we do a 'net stop' while an incoming telnet is active and the telnetd process is stuck forever, and reboot is required.

There are many such examples- in particular when running elks-to-elks networking because of timeouts that eventually ends up in hung processes, but this is the easiest to reproduce.

Attacking this problem would be really helpful. And I'm interested in doing so. @ghaerr, any ideas about where to start?

-M

@Mellvik Mellvik added the bug Defect in the product label Jul 1, 2022
@ghaerr
Copy link
Owner

ghaerr commented Jul 1, 2022

This is going to be a really tough one - the biggest reason is that even if we "fix" the interruptible_sleep() that the processes are likely waiting on so that they "return out of the kernel" (and are killed), the kernel network routines will be left with their semaphores in possibly incorrect/bad states (e.g. a semaphore may be left in the always off or on position, which will fail when networking is "restarted".)

In order to see this, I advise for you to start by looking at elks/net/ipv4/af_inet.c: each of the "interruptible_sleep" routines will need to be unwound properly, on a case-by-case basis. Remember how sleep works, a process is effectively unscheduled until a wake_up call is made, except, if the process receives a signal, the process will wake, and its kernel task will continue; this means it will return from interruptible_sleep, and code can check current->signal to see whether the wakeup was a result of a wake_up call, or a signal. THEN, each individually-coded sleep can be possibly unwound properly.

Lets take inet_bind() in the above file to start: the rwlock semaphore is DOWN, so that would have to be reversed, and ktcp has been given a request, for which there has (presumably) not been a response. We also need to cancel tcpdev_clear_data_avail(). After all this, we could then allow the process to be killed, which occurs after the current system call is completed, but if ktcp is still running, it will be put into a bad state, so it needs to be killed also, and it may be sleeping in a similar kernel routine, with the same or different set of semaphores gated/unlocked/etc.

This whole dilemma is the reason many UNIXs and sometimes Linux still hang and require a reboot, despite repeated kill requests. In some sense, what is needed is a kind of "kernel reset" that resets all network variables to their starting state, and kills all associated process - a big kluge, and not always possible since the kernel doesn't "know" which processes might be "network" processes.

Take a look and we'll go from here after you've looked further at it. Frankly, even if we "fix" it, we really can't guarantee kernel correctness afterwards, which might cause very strange problems for ourselves and users after a kill.

@Mellvik
Copy link
Contributor Author

Mellvik commented Jul 2, 2022 via email

@ghaerr
Copy link
Owner

ghaerr commented Jul 2, 2022

enable ktcp to terminate all open connections when terminating, regardless of cause.

Well, all sockets opened are automatically closed by ktcp when it exits. I don't think this currently changes any processes sleeping on their own socket connections to the kernel though. However, knowing that there is no /dev/tcpdev open (meaning ktcp has exited) may allow for a killed process to exit its kernel task state more cleanly, without worry for ktcp corruption as mentioned above.

That would mean one would execute net stop, then possibly kill the remaining network processes... is that what you're doing now, when seeing the hanging processes?

@Mellvik
Copy link
Contributor Author

Mellvik commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Defect in the product
Projects
None yet
Development

No branches or pull requests

2 participants