Skip to content

Lifecycle of an SDK agent

Ryan Madsen edited this page Dec 8, 2015 · 1 revision

Lifecycle of an SDK agent

EOS SDK's power derives from its direct connection to Sysdb. If you are not yet familiar with EOS, you may find it useful to read an overview of the system's architecture before continuing. Because agents interact with Sysdb by manipulating state as opposed to calling functions, users can write clean, resilient, and reactive code much more naturally than relying on an RPC framework. This document will explain an agent's lifecycle and the common paradigms that result from this state-driven architecture.

Note that this tutorial is focused on long-lived agents where EOS SDK controls the underlying event loop. To write SDK scripts that simply get or set some state and then exit, or for programs that maintain their own event loop, see documentation on alternate event loops. Even for these applications, most of this document will remain relevant.

In the beginning...

An SDK agent, just like all agents written by Arista, are standard Linux processes. This means that you can interact with your agent just like any other process—send it kill signals, run it under gdb or strace to watch its activity, constrain its resource usage, and more. Even when run by EOS's agent manager, ProcMgr, your agent is simply executed. Thus, the first step in an agent's life is its main() function, or, for Python agents it is any code at the module level, which should follow best-practices and be fenced off by a if __name__ == "__main__": conditional. Starting at this point, the process has access to the full linux system and is free to make system calls, connect to sockets, read environment variables, spawn threads, etc. It is at this time that the agent should grab a handle of the top level eos::sdk() object and create one or more event handlers. When all necessary infrastructure for your agent has been constructed, the program should begin the sdk.main_loop(...), thereby initializing itself as an agent and starting the perpetually running event loop.

The event loop

The event loop is the core that your agent is built around: it is a constantly running loop that manages file descriptors, timers, and your agent's connection to Sysdb. When your agent writes a piece of state, the SDK immediately transforms that state to an internal representation and writes that state to a local copy of the state hierarchy. This write is added to an event queue. After you return control to the event loop, the write queue is drained and events are serialized to Sysdb. Sysdb then pushes the new state to all other interested agents, which can asynchronously react to the state to reprogram hardware, recalculate topologies, or simply set some other state in response.

Similarly, when Sysdb pushes out a notification about a state update to your agent, the event loop is notified and reads the state update off of the message queue. The SDK then transforms the message data into a stable representation and calls any of your agent's handlers that have subscribed to that state.

Your handlers are called directly from the event loop, which means they should perform no blocking actions. This is because when a handler blocks, no other events are processed (and no other handlers will be called) until the handler returns. If it blocks for long enough, the event loop's message queue can be overrun by too many incoming state updates from Sysdb, causing a fatal error. Thus, a handler should perform tasks asynchronously.

For example, a program written synchronously would perform a TCP request and wait until a response is received before continuing its logic. EOS agents, however, should use non-blocking sockets to first send the request, tell the event loop to watch the TCP socket's file descriptor, and then immediately return control to the event loop. When the file descriptor becomes readable, the event loop will notice, call your on_readable handler, and let you finish processing the response. Because the agent uses asynchronous logic, it never is "stuck" and instead is able to react to other events occurring on the switch.

It is important to note that all agents are single threaded, meaning that the event loop will never call two of your handlers simultaneously. If two events are triggered during the same cycle, there is no guarantee which handler will be called first, but in no circumstance will one handler start while another one is executing. This does not, however, limit your program from starting multiple threads. As long as only one thread is interacting with the SDK at a time, other threads are free to make system calls, network connections, and run as they please.

Initializing the loop

When your agent first calls sdk.main_loop(argc, argv), the first thing the SDK does is connect to Sysdb. It then checks which _mgr objects have been requested, and synchronizes those managers' state. This means that you must request all _mgrs before entering the main_loop, so the SDK can figure out which state to synchronize. Once all relevant state has been loaded, the SDK calls agent_handler's on_initialized method. This method is one that every agent should override, because it signals when managers and handlers can communicate with Sysdb. Before this callback is triggered, no other _handler callbacks will fire, nor will calls on any _mgrs succeed. Inside of the on_initialized callback, your agent should check any relevant pre-existing state in Sysdb, handle it appropriately, and then start watching for updates on any state it is interested in.

And the loop keeps running

Once all handlers that have overridden on_initialized return, the event loop begins processing events: whether they are updates from Sysdb, timers firing, or file descriptors becoming readable. At this point, your agent is idling until the SDK calls one of your handlers.

Say, for example, you override intf_handler's on_oper_status method, a function that lets you react to the status of an interface:

void on_oper_status(eos::intf_id_t intf, eos::oper_status_t status) {
   // Perform some action now that "intf" has changed to "status"
}

Now, let's pretend that your datacenter was infiltrated, and some nefarious ninja cut the cabling on Ethernet42. One EOS's built-in agents will notice the lost signal and update Sysdb, setting the status of Ethernet42 to notconnected. Sysdb will propagate this state to all interested agents, and, since you overrode on_oper_status (and grabbed a reference to the intf_mgr in the process), your SDK instance will receive the new state as well. The SDK then normalizes this state and calls your on_oper_status with the parameters intf=Ethernet42 and status=INTF_OPER_DOWN. Your agent is now free to react to this state and take whatever action it deems necessary. In this case, this agent may want to email your sysadmin and security, warning them that ninjas have possibly overrun the data center.