# Ray Actors in Detail

© 2025, Anyscale. All Rights Reserved

This document provides an introduction to Ray Actors, which extend the Ray API from functions (tasks) to classes.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook </b>

<ol>
  <li>Overview and setup</li>
  <li>Simple actor submission (creating, executing, and getting results)</li>
  <li>Actor resource fulfillment and scheduling</li> 
  <li>Actor process failure</li>
  <li>Fault tolerance with Actors</li>
  <li>Multi-threading with Actors</li>
  <li>Asyncio with Actors</li>
  <li>Concurrency groups</li>
  <li>Actor pool abstraction</li>
</ol>
</div>

**Imports**


In [None]:
import asyncio
import json
import os
import sys
import tempfile
import time
import threading


import ray
from ray.util import ActorPool
from ray.util.placement_group import placement_group, remove_placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

## Simple actor submission (creating, executing, and getting results)

Actors extend the Ray API from functions (tasks) to classes.

An actor is a stateful worker. When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Similarly to Ray Tasks, actors support CPU and GPU compute as well as fractional resources.

Let's look at an example of an actor which maintains a running balance.


In [None]:
@ray.remote
class Accounting:
    def __init__(self):
        self.total = 0
    
    def add(self, amount):
        self.total += amount
        
    def remove(self, amount):
        self.total -= amount
        
    def total(self):
        return self.total

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#actors" target="_blank">Actor</a></strong> is a remote, stateful Python class.
</div>

Define an actor with the `@ray.remote` decorator and then use `<class_name>.remote()` ask Ray to construct and instance of this actor somewhere in the cluster.

We get an actor handle which we can use to communicate with that actor, pass to other code, tasks, or actors, etc.


In [None]:
acc = Accounting.remote()

We can send a message to an actor -- with RPC semantics -- by using `<handle>.<method_name>.remote()`


In [None]:
acc.total.remote()

Not surprisingly, we get an object ref back


In [None]:
ray.get(acc.total.remote())

We can mutate the state inside this actor instance


In [None]:
acc.add.remote(100)

In [None]:
acc.remove.remote(10)

In [None]:
ray.get(acc.total.remote())

### Activity: Linear Model Inference

<div class="alert alert-block alert-info">

__Activity: linear model inference__

* Create an actor which applies a model to convert Celsius temperatures to Fahrenheit
* The constructor should take model weights (w1 and w0) and store them as instance state
* A convert method should take a scalar, multiply it by w1 then add w0 (weights retrieved from instance state) and then return the result



In [None]:
# Hint: define the below as a remote actor
class LinearModel:
    def __init__(self, w0, w1):
        """Hint: store the weights"""

    def convert(self, celsius):
        """Hint: convert the celsius temperature to Fahrenheit."""

# Hint: create an instance of the LinearModel actor

# Hint: convert 100 Celsius to Fahrenheit

</div>


In [None]:
# Write your solution here

<div class="alert alert-block alert-info">

<details>

<summary> Click to see solution </summary>

```python
@ray.remote
class LinearModel:
    def __init__(self, w0, w1):
        self.w0 = w0
        self.w1 = w1

    def convert(self, celsius):
        return self.w1 * celsius + self.w0

model = LinearModel.remote(w1=9/5, w0=32)
ray.get(model.convert.remote(100))
``` 

</details>

</details>
</div>

## Actor resource fulfillment and scheduling

Actors reserve resources for their entire lifetime. Method calls (actor tasks) execute on the same worker process that hosts the actor.

In [None]:
@ray.remote(num_cpus=2)
class ModelServer:
    def __init__(self):
        self.ready = True

    def infer(self, x):
        return x * 2

    def task_a(self):
        return "a"

    def task_b(self):
        return "b"

srv = ModelServer.remote()
ray.get(srv.infer.remote(21))

### Actor creation process

When you call `.remote()` on an actor class, here's what happens:

1. **Registration**: The creating worker registers the actor with the GCS
   - Detached actors: synchronous registration (prevents name conflicts)
   - Non-detached actors: asynchronous registration (better performance)

2. **Scheduling**: Once dependencies are resolved, the GCS schedules the actor creation task using the distributed scheduling protocol (same as normal tasks)

3. **Buffering**: The creator can immediately submit method calls on the actor handle
   - Tasks are buffered locally until the actor is created
   - Handles can be passed to other tasks/actors before creation completes

4. **Notification**: When the actor is created, the GCS notifies all handle holders via pub-sub
   - Each handle caches the actor's RPC address
   - Buffered tasks are sent to the actor for execution

Here is a diagram illustrating the actor creation process:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/actor_creation.svg" alt="Actor Creation Process" style="width: 700px;"/>

### How actor placement is chosen

The GCS schedules actor creation using the same placement rules as normal tasks:

- **Data locality**: If constructor arguments include large `ObjectRef`s, prefer the node with the most bytes local
- **Default**: Use the caller's local raylet if resources are available

Once placed, the actor holds its resources for its entire lifetime.

### Actor task submission process

Actor handles contain the RPC address of the actor. Calling workers connect directly to this address to submit tasks.

Here is a diagram illustrating the actor task submission process:
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/actor_task_submission.svg" alt="Actor Creation Process" style="width: 700px;"/>

**Execution ordering:**
- Tasks from the **same submitter** follow strict FIFO order
- Tasks from **different submitters** have no guaranteed order
- If a task is blocked on dependencies, the actor can still execute tasks from other submitters

**Example:**

In [None]:
# Driver process submits tasks
srv.task_a.remote()  # Executes first
srv.task_b.remote()  # Executes second (waits for task_a to finish)

@ray.remote
def f(srv):
    # Worker process submits tasks
    srv.task_a.remote()  # May execute before or after any of driver's tasks

f.remote(srv)

## Actor Process Failure

Ray automatically restarts actors that crash unexpectedly using `max_restarts`. The actor's state is recreated by rerunning its constructor.

**Configuration options:**
- `max_restarts=0` (default): No restart
- `max_restarts=-1`: Infinite restarts
- `max_task_retries=0` (default): At-most-once execution - throws `RayActorError` immediately on actor failure
- `max_task_retries=-1`: At-least-once execution - retries tasks automatically once actor is restored

**Example: Actor that restarts after failure**

In [None]:
@ray.remote(max_restarts=4, max_task_retries=-1)
class Actor:
    def __init__(self):
        self.counter = 0

    def increment_and_possibly_fail(self):
        if self.counter == 10:
            os._exit(0)  # Exit after every 10 tasks
        self.counter += 1
        return self.counter

actor = Actor.remote()

# Executes 50 tasks across 5 actor lifetimes (10 tasks each)
for _ in range(50):
    counter = ray.get(actor.increment_and_possibly_fail.remote())
    print(counter)  # Prints 1-10 five times

# After 4 restarts, subsequent tasks raise RayActorError
for _ in range(10):
    try:
        ray.get(actor.increment_and_possibly_fail.remote())
    except ray.exceptions.RayActorError:
        print("FAILURE")  # Actor exhausted restarts

**Note:** With at-least-once semantics, retried methods may execute twice - once on the failed actor and again on the restarted actor.

## Fault-tolerant actors with state restoration
Actor memory is process-local. After a restart, you must restore state explicitly.

Here is an example that uses file-based checkpoints to achieve stateful fault tolerance.


In [None]:
@ray.remote(max_restarts=-1, max_task_retries=-1)
class ImmortalActor:
    def __init__(self, checkpoint_file):
        self.checkpoint_file = checkpoint_file

        if os.path.exists(self.checkpoint_file):
            # Restore from a checkpoint
            with open(self.checkpoint_file, "r") as f:
                self.state = json.load(f)
        else:
            self.state = {}

    def flaky_update(self, key, value):
        import random

        if random.randrange(10) < 5:
            sys.exit(1)

        self.state[key] = value

        # Checkpoint the latest state
        with open(self.checkpoint_file, "w") as f:
            json.dump(self.state, f)

    def get(self, key):
        return self.state[key]


checkpoint_dir = "/mnt/cluster_storage/"
actor = ImmortalActor.remote(os.path.join(checkpoint_dir, "checkpoint.json"))
ray.get(actor.flaky_update.remote("1", 1))
ray.get(actor.flaky_update.remote("2", 2))

assert ray.get(actor.get.remote("1")) == 1
assert ray.get(actor.get.remote("2")) == 2

### Detached actors
Make long-lived, globally named services resilient to driver exits.


In [None]:
svc = ModelServer.options(lifetime="detached", name="global_model", namespace="my-test").remote()
# Later (or from another driver):
svc = ray.get_actor(name="global_model", namespace="my-test")

### Killing actors

To kill an actor and prevent restarts, use `ray.kill` with `no_restart=True`.

In [None]:
ray.kill(svc, no_restart=True)

## Multithreaded actors

By default, an actor runs one method at a time. Increase parallelism with `max_concurrency` and ensure thread-safety.


In [None]:
@ray.remote(max_concurrency=8)
class Counter:
    def __init__(self):
        self.value = 0
        self._lock = threading.Lock()

    def add(self, x):
        time.sleep(0.1)  # simulate work
        with self._lock:
            self.value += x
            return self.value

c = Counter.remote()
refs = [c.add.remote(1) for _ in range(32)]
ray.get(refs)  # up to 8 run concurrently

Guidelines:
- Protect shared mutable state with locks or use immutable updates.
- Use higher `max_concurrency` for I/O-bound actors; keep modest for CPU-bound to avoid oversubscription.
- For CPU-heavy parallelism, prefer multiple actors to scale across cores/nodes.


## Async actors

Async actors run an asyncio event loop; methods declared with `async def` can interleave via `await` points. Concurrency is bounded by `max_concurrency`.


In [None]:
@ray.remote(max_concurrency=16)
class AsyncWorker:
    async def work(self, i):
        await asyncio.sleep(0.2)
        return i * i

aw = AsyncWorker.remote()
results = ray.get([aw.work.remote(i) for i in range(20)])

<div class="alert alert-info">
  <b>Tip:</b> Async actors avoid Python thread contention and can scale high-concurrency I/O. Set <code>max_concurrency</code> to the target in-flight operations.
</div>


## Concurrency groups

Concurrency groups let you assign different concurrency limits to different methods within the same actor. This is useful when you want some methods (like health checks) to remain responsive even while other methods are busy.

**Key concepts:**
- Define groups with their concurrency limits in the `@ray.remote` decorator
- Assign methods to groups using `@ray.method(concurrency_group="name")`
- Methods without a group go to the default group (limit: 1000 for async actors, 1 for threaded actors)
- Works with both async and threaded actors

<div class="alert alert-info">
  <b>Note:</b> For async actors, Ray creates a separate event loop for each concurrency group, providing true isolation between groups.
</div>

**Example:**

In [None]:
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncWorker:
    @ray.method(concurrency_group="io")
    async def fetch_data(self):
        await asyncio.sleep(1)
        return "data"
    
    @ray.method(concurrency_group="compute")
    async def process_data(self, data):
        await asyncio.sleep(2)
        return f"processed: {data}"
    
    async def health_check(self):
        return "healthy"

worker = AsyncWorker.remote()

# "fetch_data" limited to 2 concurrent calls
# "process_data" limited to 4 concurrent calls  
# "health_check" uses default group (up to 1000 concurrent calls)

You can also override the concurrency group at runtime:

In [None]:
# Use defined group
worker.fetch_data.remote()

# Override to use different group
worker.fetch_data.options(concurrency_group="compute").remote()

## ActorPool (simple worker pool over actors)

`ray.util.ActorPool` provides a lightweight way to manage a pool of homogeneous actors and submit many small jobs with automatic load balancing.

**Creating an actor pool:**

Here we create a pool of 4 actors, each with a `process` method that squares its input.

In [None]:
@ray.remote
class Worker:
    def process(self, x):
        return x * x

workers = [Worker.remote() for _ in range(4)]
pool = ActorPool(workers)

**Mapping over inputs:**

The `map` method automatically distributes work across available actors and collects results as they complete (unordered).

In [None]:
inputs = range(10)
results = list(pool.map(lambda a, x: a.process.remote(x), inputs))

**Incremental submission and retrieval:**

For more control, submit tasks one at a time and retrieve results as actors finish.

In [None]:
for x in range(10):
    pool.submit(lambda a, v: a.process.remote(v), x)

ready = [pool.get_next() for _ in range(10)]

When to use:
- Many short, similar actor method calls; you want automatic fair scheduling across a fixed set of actors.
- Simple replacement for manual round-robin over actor handles.

Prefer alternatives when:
- You need heterogeneous actors or topology (use multiple actor pools).
- You need backpressure/windowed submission (combine with `ray.wait` and/or use actors with queues).