Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[watchdog] Provide additional watchdog actions and/or extension points #11388

Closed
antoniovicente opened this issue Jun 1, 2020 · 8 comments · Fixed by #12416 or #12636
Closed

[watchdog] Provide additional watchdog actions and/or extension points #11388

antoniovicente opened this issue Jun 1, 2020 · 8 comments · Fixed by #12416 or #12636
Assignees

Comments

@antoniovicente
Copy link
Contributor

The thread watchdog is already an important mechanism to detect and recover from coding errors that results in infinite loops, blocking API calls and very long computations in worker threads. There are a few simple improvements that would make the watchdog even more awesome:

  • Option to capture a 5sec to 10sec CPU profile after a series of watchdog misses or mega misses, and either write it to disk or make it available via admin interface. If writing to disk, provide parameter for max number of profiles to generate to avoid filling up the disk.
  • Option to capture and log the current stack of the watched thread or all thread stacks on mega miss.
  • Option to terminate the process by sending SIGABRT to the stuck thread instead calling PANIC on the guarddog thread.
  • Registration mechanism for additional callbacks to invoke on watchdog miss or megamiss which could be used to implement some of the prior ideas and/or integrate with third party systems. Callback arguments may include the list of threads that have experienced recent megamiss events and info about when they were last reported alive.
@KBaichoo
Copy link
Contributor

/assign @KBaichoo

@KBaichoo
Copy link
Contributor

For a general mechanism I'm proposing the following changes:

Bootstrap.proto

Message WatchDog {
        Enum WatchDogEvent {
                Abort
                Miss
                MegaMiss
        }
        
        Message WatchDogActions {
                String name;
                WatchDogEvent evt_type;
        }

        repeated WatchDogActions actions;
}

In GuardDogImpl:

  • Step() function needs to have vectors that’ll hold the (tid, last time touch) for the different miss categories
    • 3 vectors:
      • Miss
      • Mega Miss
      • MultiKill
    • Have the miss / megamiss callbacks occur outside the critical section, and the multi kill / kill inside
  • Need to have access to the registry mechanism and then create a mapping from ‘Event Type’: [Array of Callbacks]

The callbacks will have the following signature:
void Function(WatchDogEvent event, std::vector<std::pair<Thread::ThreadID, MonotonicTime>> tid_ltt_pairs, MonotonicTime now)

@antoniovicente
Copy link
Contributor Author

You may want to have the events be: MultiKill, Kill, Miss and Megamiss to match the current actions.

Getting the full list of threads that are in a miss/megamiss state seems very useful. Similarly, list of threads involved in the Kill or MultiKill event seems useful and would be consistent with the API used for miss/megamiss, even if the kind of operations I expect we would run on Kill/MultiKill possibly not requiring the thread id information.

@KBaichoo
Copy link
Contributor

Good Idea, that would add more granularity and be more consistent than Abort.

@KBaichoo
Copy link
Contributor

IIUC, the best way to provide the 'extensions' (for particular watchdog events) is via Factories and using utilities such as GetAndCheckFactory to then get the callbacks.

I don't see a class that derives from TypedFactory that best suits WatchDog, so I'll create a subclass off of that which all specific factories will derive from similar to NamedNetworkFilterConfigFactory.

@antoniovicente
Copy link
Contributor Author

@envoyproxy/api-shepherds for input

@mattklein123
Copy link
Member

+1 to using an extension/typed_config interface if we think that we will want this to be extensible in the future with different actions/events.

@KBaichoo
Copy link
Contributor

I've turned the Extension PR from a draft into an actual PR. Since it was getting quite big I've decided to implement one of the extensions in another PR.

For implementing CPU profiling based on WatchDogEvents:

  • Add WD’s dispatcher ref to the context that can be used by WDActions to plan callbacks

    • We’ll be able to trigger the action from the GuardDog’s step() which will StartProfiling, and can then schedule the stop() functionality on the GuardDog’s dispatcher. We need to schedule the stop() there as it’s possible we don’t invoke the action while profiling, and thus might not stop profiling.
  • A proto for the configuration for the WD action:

    • Profile durations -- duration on how long to run the profile for
    • Profile_path -- path to file to output the profiles on
    • Max_profiles_per_thread -- limits max number of profiles we’ll generate for a given thread to avoid filling disks
  • Implement a WatchDogProfileAction and Factory

    • Action tracks whether it’s running a profile current, as well as tids associated with profiles and counts

htuch pushed a commit that referenced this issue Aug 27, 2020
Added a watchdog extension that triggers profiling.

Risk Level: Medium (new extension that is optional)
Testing: Unit tests
Docs Changes: Included (added a reference to the generated extension proto.rst)
Release Notes: Included

Fixes #11388

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
clarakosi pushed a commit to clarakosi/envoy that referenced this issue Sep 3, 2020
Added a watchdog extension that triggers profiling.

Risk Level: Medium (new extension that is optional)
Testing: Unit tests
Docs Changes: Included (added a reference to the generated extension proto.rst)
Release Notes: Included

Fixes envoyproxy#11388

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Clara Andrew-Wani <candrewwani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants