Skip to content
Scott Chiang edited this page May 4, 2014 · 13 revisions

Building fault-tolerant programs with Celluloid is a bit counterintuitive. This page attempts to explain the prerequisite building blocks you need to understand if you want to truly leverage Celluloid's approach to fault tolerance, then how you wire them together.

For starters, we suggest you acquaint yourself with Linking, Supervisors, and Supervision Groups. These are the foundation of fault tolerant programs. However, these alone are only a part of Celluloid's fault tolerance story. This page will attempt to explain how you use these primitives to build fault tolerant programs, even if you're too lazy to click all those links and read the associated pages.

The DeadActorError Problem

Many Celluloid programs will start off looking something like this:

require 'celluloid/autostart'

class Itchy
  include Celluloid

  def bite(thing)
    # ...
  end
end

class Scratchy
  include Celluloid

  def initialize
    @itchy = Actor[:itchy]
  end

  def bite(thing)
    @itchy.bite thing
  end
end

Celluloid::Actor[:itchy]    = Itchy.new
Celluloid::Actor[:scratchy] = Scratchy.new

However, there's a problem with this approach: the actor referenced as Actor[:itchy] can crash. This means when we call Scratchy#bite, there's a chance a DeadActorError can occur, especially since we've memoized @itchy as an instance variable and aren't looking it up using Actor[:itchy] every time. Fixing these errors in a deterministic way seems hard! How can we write code that is free of race conditions and can prevent these DeadActorErrors from occurring?

The Celluloid answer: give up on determinism. Let's take a pessimistic view that any part of our program can crash at any time. We need a way to not only tolerate DeadActorErrors, but minimize the number of times they occur in practice. This is the "Let It Crash" philosophy espoused by Erlang.

Tolerating Crashes Gracefully

To handle DeadActorErrors properly we need to build programs that are resilient to them.

In our previous example we had two actors: Itchy and Scratchy. The latter, Scratchy, depends on the former, Itchy.

To make our program more fault tolerant, we need to write Scratchy a little differently than before:

class Scratchy
  include Celluloid

  def initialize
    @itchy = Actor[:itchy]
    link @itchy
  end

  def bite(thing)
    @itchy.bite thing
  end
end

Next, we need to supervise Itchy and Scratchy using Celluloid Supervision Groups:

class MyGroup < Celluloid::SupervisionGroup
  supervise Itchy,    as: :itchy
  supervise Scratchy, as: :scratchy
end

MyGroup.run

We've basically done two things differently from the original example:

  1. Scratchy is now linked directly to Itchy, so that if Itchy crashes, Scratchy will die too. This should mostly eliminate DeadActorErrors that arise because Scratchy has a stale handle to Itchy. Hopefully Scratchy dies before it can even use the stale handle.
  2. Itchy and Scratchy are now supervised, so when they crash, they get restarted. Itchy is listed first in the SupervisionGroup, before Scratchy which depends on Itchy. This way we ensure that they get started up in the correct dependency ordering. It also means when we them shut down, Scratchy will get shut down first.

Future Work

To better ensure actors in the same SupervisionGroup are restarted in the same order, Celluloid should support "restart strategies" ala Erlang and Akka:

http://www.erlang.org/doc/design_principles/sup_princ.html#id71212

Celluloid presently supports only the "one for one" restart strategy, i.e. if a particular actor dies, only that actor is restarted:

one for one

However, to ensure that all services that are affected by a crash boot back up correctly, it could also support a "one for all" strategy where the entire SupervisionGroup is rebooted in the event of the crash of any member of the group:

one for all

Clone this wiki locally