Zombie ngrok processes #58

JennaSys · 2020-07-25T06:08:16Z

Not sure if this is a pyngrok issue or not, but it's something I'm running into and wanted to bring it to your attention. Feel free to close if it's a Python or ngrok binary issue, or if it's too much of a niche problem. It's more of just an OS housekeeping issue than a functional problem.

Describe the Bug
I have a long running application that always keeps an ngrok tunnel open using pyngrok. The tunnel occasionally drops and my application will automatically start a new one. Sometimes the ngrok process itself fails to respond, so the process is killed and then relaunched automatically with the next connect(). Over time, zombie ngrok processes accumulate (on linux) and remain in the process tree until the Python application (the parent process) is terminated or restarted.

Steps to Reproduce
I run a timer thread to check for connection every few minutes and restart the tunnel when necessary, but it can probably also be simulated with a loop and delay.
ngrok.connect()
kill ngrok process externally with linux kill statement (simulating a crash of the ngrok binary)
ngrok.disconnect() will result in a ngrok.exception.PyngrokNgrokURLError
ngrok.get_tunnels() will result in a ngrok.exception.PyngrokNgrokURLError
ngrok.kill() leaves defunct zombie process
ngrok.connect() starts new ngrok process

Restarting the Python program clears out all of the zombie ngrok processes (as it should).

Expected Behavior
Defunct ngrok processes should be removed by the calling process (Python?) at the time the child process exits.

Environment

OS: Linux Raspbian on ARM
Python Version: 3.5
pyngrok Version: 4.1.5

Additional Context
I'm not sure why the ngrok process randomly stops responding, but it happens at least every few days.
Also, this is running on a Raspberry Pi

The text was updated successfully, but these errors were encountered:

alexdlaird · 2020-07-25T17:40:18Z

I think I understand the scenario as it's being described, I'm not quite sure I understand the harm it's causing. Zombie processes aren't real processes, they're defunct and shouldn't consume any resources and should be harmless, and I don't think there's a good way for pyngrok to clean them up for you. Are you just saying you would like them cleaned up, or are they interfering with some other portion of your application?

Regarding the ngrok process dying every few days, my understanding is ngrok tunnels only live for some TTL that may be around that time (it may be different for paid versus free accounts). Perhaps that's what's happening? I cannot find this documented on their site now that I'm trying to find it, but I heard it mentioned in a tutorial fairly recently. Does this same behavior happen with tunnels created directly (i.e. outside of the pyngrok module)? If so, perhaps that's what happening.

alexdlaird · 2020-07-25T17:45:32Z

By the way, the test here (and the one below it) test for when an external command terminates our ngrok process. If the process is terminated elsewhere, I'm not sure we safely have much control over the zombie state within pyngrok, so we just move on and rebuild the state as it exists now. Perhaps you could work off this test though to validate a fix that works for you, if it's generic and safe enough, perhaps you could make a PR for it as well.

JennaSys · 2020-07-25T18:22:54Z

That's why I mentioned it's more of just an OS housekeeping issue than a functional problem. I just happen to have an application that could end up with hundreds of zombie processes before the parent app gets restarted. I doubt I'll run out of process IDs though, which I believe would be the only danger.

I don't think pyngrok is affecting the stability of ngrok itself. ngrok is what it is, and I've accepted to just live with it and work around it as it is worth the benefits it provides. The ngrok process seems to hang randomly from hours to days so I don't think it's related to a TTL setting.

Just for some background on how I noticed this, I used to just try and do a disconnect() then connect() again whenever the tunnel stopped working. But sometimes the connect() would not work again until I restarted the main app. Until recently I didn't realize the reason for this was that the ngrok process itself had stopped responding. I modified my code to specifically check for this condition by calling get_tunnels(), and do a kill() if get_tunnels() returns an error. So maybe pyngrok could return a more specific error if the ngrok process is not responding?

I'll look at the test code when I get a chance and see if there is anything that can be done.

JennaSys · 2020-07-25T18:36:03Z

Also fwiw, the code I currently have is working very well aside from the zombie processes accumulating. Here is the class method I use to stop the tunnels before starting them up again (I have 2 tunnels: http and ssh), it also addresses issue 53:

    def stop(self):
        log.info("Stopping web server at {}".format(self.public_url))

        # Shut down open tunnels
        for tunnel in (self.public_url, self.ssh_url):
            try:
                if len(tunnel) > 0:
                    ngrok.disconnect(tunnel, pyngrok_config=self.pyngrok_config)
            except (ngrok_exception.PyngrokNgrokHTTPError, ngrok_exception.PyngrokNgrokURLError) as e:
                log.error(e)
            except Exception as e:
                log.exception(e)

        # Test to see if ngrok process is active
        try:
            tunnels = ngrok.get_tunnels(pyngrok_config=self.pyngrok_config)
        except ngrok_exception.PyngrokNgrokURLError as e:
            log.warning("ngrok api is not responding!")
            log.error(e)
            tunnels = None

        # Kill ngrok process if it is not responding or if the tunnel list gets too big
        try:
            if tunnels is None:
                log.warning("Unable to obtain tunnel status - killing ngrok process")
                tunnels = []
                ngrok.kill(pyngrok_config=self.pyngrok_config)
            elif len(tunnels) > MAX_TUNNEL_COUNT:
                # There shouldn't be any open tunnels at this point so restart ngrok process to remove any remaining
                log.warning("Too many orphaned tunnels exist: {} - killing ngrok process".format(len(tunnels)))
                ngrok.kill(pyngrok_config=self.pyngrok_config)
        except Exception as e:
            log.exception(e)

        log.debug("ngrok tunnel count: {}".format(len(tunnels)))
        self.public_url = ''
        self.ssh_url = ''

On a side note, I started passing in the config to every pyngrok call to keep the log chatter down with
self.pyngrok_config = PyngrokConfig(monitor_thread=False)
Can this be a global setting?

alexdlaird · 2020-07-25T18:42:02Z

Yah, a few weeks after released that new feature, I wished I'd make monitor_thread default to False. But now it's out in the wild 🤷‍♂️. I'll consider changing its default on a future major release. However, you shouldn't need to pass it to every subsequent call, as after you create an NgrokProcess, the config is stored on that object. You'd only need to pass it to subsequent new NgrokProcess's. A global default config could be a reasonable consideration as well, I'll note that for the future.

Thanks for sharing the context. I'll close this ticket for now, but it's always helpful to have these threads for others to find if they experience similar issues!

JennaSys · 2020-07-25T18:49:54Z

I think because of this ngrok process issue, it kept turning back on when new processes were automatically started, so I just passed it in at every call.

alexdlaird · 2020-07-25T18:51:55Z

Oh that’s a good point. A global default wouldn’t be a bad idea.

alexdlaird · 2020-07-26T16:54:33Z

Hey, just doing some poking around on this today. Seems like we should be able to reproduce this in a until test using psutil, right? On OS X, however, I see the process disappear right away.

I can test further on a Linux machine at another time, but in the meantime, I wonder if this is the culprit here. Unsure if you have the code cloned locally, but if you made the change to use kill() instead of terminate() here and did a make local, it should install your local changes using pip. Then perhaps you could see if you're still getting zombie processes if we send SIGKILL instead.

Additionally, adding process.wait() after the terminate/kill may resolve the issue, per the Stack Overflow article I linked, as that should request the return code from the child process. You could also try adding the wait() here.

If you have the means to try that and either give you an luck, let me know and I can work on a more official solution!

JennaSys · 2020-07-26T17:28:54Z

Just can't let it go huh? 😁 I'll try some of those suggestions to see if it changes the behavior, and get back to you.

alexdlaird · 2020-07-26T17:37:17Z

🤷🏻‍♂️ Interesting as Ice Age: Dawn of the Dinosaurs is, I needed something else to occupy my brain whilst watching it with the little one.

JennaSys · 2020-07-26T20:18:01Z

I'm still working on getting my test environment fully set up , but I ran the process test suite and test_start_process_port_in_use actually results in a zombie ngrok process until the thread terminates. I haven't dug into the exact cause yet though. None of the other tests have that issue.

alexdlaird · 2020-07-26T20:34:08Z

Interesting. This leads me to suspect you're correct that a zombie process is the result of ngrok terminating itself early (i.e. port already in use on the second process for this test), or erroring/crashing. pyngrok then cleans up its own state but not the defunct process.

JennaSys · 2020-07-26T20:49:08Z

Adding ngrok_process.proc.wait() to the kill_process() method of process.py seemed to fix it in the test. I'll test that change in my app.

        try:
            ngrok_process.proc.kill()
            ngrok_process.proc.wait()
        except OSError as e:

alexdlaird · 2020-07-26T21:02:35Z

Awesome! Feel free to throw it in a PR if it works, I can cut a release later today if all looks good.

JennaSys · 2020-07-26T22:06:00Z

I think that fixes my original issue. It might ultimately depend on exactly how the ngrok process dies in production to determine for sure though, and that I'll have to test over the next few days by letting it run. I'll put together a PR in the meantime.
Prior to the fix, doing a sudo kill -SIGSTOP on the ngrok process left a zombie behind when ngrok was restarted. After the fix it cleans up properly.

alexdlaird · 2020-07-26T22:34:30Z

Adding some tests, then will cut a release this evening. But have a look at the changelog, also added the DEFAULT_PYNGROK_CONFIG, as requested. Huzzah!

alexdlaird · 2020-07-27T02:14:15Z

This has been resolved and merged. 4.1.8 has been released and should be available on PyPI shortly.

JennaSys added the bug Something isn't working label Jul 25, 2020

alexdlaird added invalid This doesn't seem right and removed bug Something isn't working labels Jul 25, 2020

alexdlaird closed this as completed Jul 25, 2020

JennaSys mentioned this issue Jul 26, 2020

Allow for cleanup of defunct ngrok process after kill() #59

Merged

11 tasks

alexdlaird added bug Something isn't working and removed invalid This doesn't seem right labels Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie ngrok processes #58

Zombie ngrok processes #58

JennaSys commented Jul 25, 2020 •

edited

alexdlaird commented Jul 25, 2020

alexdlaird commented Jul 25, 2020 •

edited

JennaSys commented Jul 25, 2020

JennaSys commented Jul 25, 2020

alexdlaird commented Jul 25, 2020 •

edited

JennaSys commented Jul 25, 2020

alexdlaird commented Jul 25, 2020

alexdlaird commented Jul 26, 2020 •

edited

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

JennaSys commented Jul 26, 2020 •

edited

alexdlaird commented Jul 26, 2020 •

edited

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

alexdlaird commented Jul 27, 2020

Zombie ngrok processes #58

Zombie ngrok processes #58

Comments

JennaSys commented Jul 25, 2020 • edited

alexdlaird commented Jul 25, 2020

alexdlaird commented Jul 25, 2020 • edited

JennaSys commented Jul 25, 2020

JennaSys commented Jul 25, 2020

alexdlaird commented Jul 25, 2020 • edited

JennaSys commented Jul 25, 2020

alexdlaird commented Jul 25, 2020

alexdlaird commented Jul 26, 2020 • edited

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

JennaSys commented Jul 26, 2020 • edited

alexdlaird commented Jul 26, 2020 • edited

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

JennaSys commented Jul 26, 2020

alexdlaird commented Jul 26, 2020

alexdlaird commented Jul 27, 2020

JennaSys commented Jul 25, 2020 •

edited

alexdlaird commented Jul 25, 2020 •

edited

alexdlaird commented Jul 25, 2020 •

edited

alexdlaird commented Jul 26, 2020 •

edited

JennaSys commented Jul 26, 2020 •

edited

alexdlaird commented Jul 26, 2020 •

edited