# Exploring Ray API Calls

© 2019-2022, Anyscale. All Rights Reserved


This lesson explores a few of the other API calls you might find useful, as well as options that can be used with the API calls we've already learned. Additionally, we will walk through some tips and tricks for first time users.

> **Tip:** The [Ray Package Reference](https://docs.ray.io/en/latest/package-ref.html) in the [Ray Docs](https://docs.ray.io/en/latest/) is useful for exploring the API features we'll learn.

In [3]:
import ray, time, sys, logging
import numpy as np 
from pprint import pprint

In [4]:
if ray.is_initialized:
    ray.shutdown()
context = ray.init(logging_level=logging.ERROR)
pprint(context)

RayContext(dashboard_url='127.0.0.1:8266', python_version='3.8.13', ray_version='1.12.1', ray_commit='4863e33856b54ccf8add5cbe75e41558850a1b75', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-24_17-27-19_715650_95151/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-24_17-27-19_715650_95151/sockets/raylet', 'webui_url': '127.0.0.1:8266', 'session_dir': '/tmp/ray/session_2022-05-24_17-27-19_715650_95151', 'metrics_export_port': 64244, 'gcs_address': '127.0.0.1:63793', 'address': '127.0.0.1:63793', 'node_id': '0402eca80ad929f1431dae0365addb2dd8b672a2d5b68b2f2afcf162'})


The Ray Dashboard URL is printed above. Use it on your laptop.

In [5]:
print(f"Dashboard url: http://{context.address_info['webui_url']}")

Dashboard url: http://127.0.0.1:8266


## ray.init()

When we used [`ray.init()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.init), we used it to start Ray on our local machine. When the optional `address=...` argument is specified, the driver connects to the corresponding Ray cluster.

There are a lot of optional keyword arguments you can pass to `ray.init()`. Here are some of them. All options are described in the [documentation](https://ray.readthedocs.io/en/latest/package-ref.html#ray.init). 

| Name | Type | Example | Description |
| :--- | :--- | :------ | :---------- |
| `address` | `str` | `address='auto'` | The address of the Ray cluster to connect to. If this address is not provided, then this command will start Redis, a raylet, a plasma store, a plasma manager, and some workers. It will also kill these processes when Python exits. If the driver is running on a node in a Ray cluster, using `auto` as the value tells the driver to detect the the cluster, removing the need to specify a specific node address. |
| `num_cpus` | `int` | `num_cpus=4` | Number of CPUs the user wishes to assign to each _raylet_. |
| `num_gpus` | `int` | `num_gpus=1` | Number of GPUs the user wishes to assign to each _raylet_. |
| `resources` | `dictionary` | `resources={'resource1': 4, 'resource2': 16}` | Maps the names of custom resources to the quantities of those resources available. |
| `memory` | `int` | `memory=1000000000` | The amount of memory (in bytes) that is available for use by workers requesting memory resources. By default, this is automatically set based on the available system memory. |
| `object_store_memory` | `int` | `object_store_memory=1000000000` | The amount of memory (in bytes) for the object store. By default, this is automatically set based on available system memory, subject to a 20GB cap. |
| `log_to_driver` | `bool` | `log_to_driver=True` | If true, then the output from all of the worker processes on all nodes will be directed to the driver program. |
| `local_mode` | `bool` | `local_mode=True` | If true, the code will be executed serially. This is useful for debugging. |
| `ignore_reinit_error` | `bool` | `ignore_reinit_error=True` | If true, Ray suppresses errors from calling `ray.init()` a second time (as we've done in these notebooks). Ray won't be restarted. |
| `include_webui` | `bool` | `include_webui=False` | Boolean flag indicating whether or not to start the web UI, which displays the status of the Ray cluster. By default, or if this argument is `None`, then the UI will be started if the relevant dependencies are present. |
| `webui_host` | _address_ | `webui_host=1.2.3.4` | The host to bind the web UI server to. Can either be `localhost` (or `127.0.0.1`) or `0.0.0.0` (available from all interfaces). By default, this is set to `localhost` to prevent access from external machines. |
| `configure_logging` | `bool` | `configure_logging=True` | If true (default), configuration of logging is allowed here. Otherwise, the user may want to configure it separately. |
| `logging_level` | _Flag_ | `logging_level=logging.INFO` | The logging level, defaults to `logging.INFO`. Ignored unless "configure_logging" is true. |
| `logging_format` | `str` | `logging_format='...'` | The logging format to use, defaults to a string containing a timestamp, filename, line number, and message. See the Ray source code `ray_constants.py` for details. Ignored unless "configure_logging" is true. |
| `temp_dir` | `str` | `temp_dir=/tmp/myray` | If provided, specifies the root temporary directory for the Ray process. Defaults to an OS-specific conventional location, e.g., `/tmp/ray`. |

See also the documentation for [ray.shutdown()](https://ray.readthedocs.io/en/latest/package-ref.html#ray.shutdown), which is needed in some contexts.

## ray.is_initialized()

Is Ray [initialized](https://ray.readthedocs.io/en/latest/package-ref.html#ray.is_initialized)?

In [6]:
ray.is_initialized()

True

## @ray.remote()

We've used [@ray.remote](https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote) a lot. You can pass arguments when using it. Here are some of them.

| Name | Type | Example | Description |
| :--- | :--- | :------ | :---------- |
| `num_cpus` | `int` | `num_cpus=4` | The number of CPU cores to reserve for this task or for the lifetime of the actor. |
| `num_gpus` | `int` | `num_gpus=1` | The number of GPU cores to reserve for this task or for the lifetime of the actor. |
| `num_returns` | `int` | `num_returns=2` | (Only for tasks, not actors.) The number of object refs returned by the remote function invocation. |
| `runtime_env` | `map` | `runtime_env = {"working_dir": ".", "pip": ["requests"]}}` | The runtime environment to use for this job (see [Runtime environments](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for details. |
| `max_calls` | `int` | `max_calls=5` | Only for *remote tasks*. This specifies the maximum of times that a given worker can execute the given remote function before it must exit (this can be used to address memory leaks in third-party libraries or to reclaim resources that cannot easily be released, e.g., GPU memory that was acquired by TensorFlow). By default this is infinite. |
| `max_restarts` | `int` | `max_restarts=-1` | Only for *actors*. This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn't need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely. |
| `max_task_retries` | `int` | `max_task_retries=-1` | Only for *actors*. How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restarts limit. If set to n > 0, the system will retry the failed task up to n times, after which the task will throw a `RayActorError` exception upon `ray.get`. Note that Python exceptions are not considered system errors and will not trigger retries. |
| `max_retries` | `int` | `max_retries=-1` | Only for *remote functions*. This specifies the maximum number of times that the remote function should be rerun when the worker process executing it crashes unexpectedly. The minimum valid value is 0, the default is 4 (default), and a value of -1 indicates infinite retries. |

Here's an example with and without `num_return_vals`:

In [7]:
@ray.remote(num_returns=3)
def tuple3(one, two, three):
    return (one, two, three)

x_ref, y_ref, z_ref = tuple3.remote("a", 1, 2.2)
x, y, z = ray.get([x_ref, y_ref, z_ref])
print(f'({x}, {y}, {z})')

@ray.remote
def tuple3(one, two, three):
    return (one, two, three)

xyz_ref = tuple3.remote("a", 1, 2.2)
x, y, z = ray.get(xyz_ref)
print(f'({x}, {y}, {z})')

(a, 1, 2.2)
(a, 1, 2.2)


### @ray.method()

Related to `@ray.remote()`, [@ray.method()](https://ray.readthedocs.io/en/latest/package-ref.html#ray.method) allows you to specify the number of return values for a method in an actor, by passing the `num_returns` keyword argument. None of the other `@ray.remote()` keyword arguments are allowed. Here is an example:

In [9]:
@ray.remote
class Tupleator:
    @ray.method(num_returns=3)
    def tuple3(self, one, two, three):
        return (one, two, three)
    
tupleator = Tupleator.remote()
x_ref, y_ref, z_ref = tupleator.tuple3.remote("a", 1, 2.2)
x, y, z = ray.get([x_ref, y_ref, z_ref])
print(f'({x}, {y}, {z})')   

(a, 1, 2.2)


## ray.put()

We used [`ray.get`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.gett) a lot to retrieve objects and we used actor methods to retrieve state from an actor. You can actually put objects into the object store explicitly with [`ray.put`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.put), as shown in the following example:

In [10]:
ref = ray.put("Hello World!")
print(f'Object returned: {ray.get(ref)}')

Object returned: Hello World!


In [11]:
ref = ray.put(np.random.rand(2_000, 5_000))
print(f'Object returned: {ray.get(ref)}')

Object returned: [[0.92193964 0.851933   0.35130997 ... 0.12112416 0.20866389 0.07780157]
 [0.67046852 0.61188893 0.8130983  ... 0.03276641 0.54650956 0.38075173]
 [0.36338666 0.63422341 0.53683446 ... 0.83471161 0.6846704  0.00754574]
 ...
 [0.30825078 0.08873442 0.29099552 ... 0.45384649 0.64807062 0.01454769]
 [0.24976259 0.60217567 0.40592398 ... 0.11249347 0.58073364 0.72220796]
 [0.8478374  0.5418938  0.07727733 ... 0.35070766 0.20725226 0.15816054]]


There is an optional flag you can pass `weakref=True` (defaults to `False`). If true, Ray is allowed to evict the object while a reference to the returned ref still exists. This is useful if you are putting a lot of objects into the object store and many of them might not be needed in the future. It allows Ray to aggressively reclaim memory.

## Fetching Information

Many methods return information:

| Method | Brief Description |
| :----- | :---------------- |
| [`ray.get_gpu_ids()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.get_gpu_ids) | GPUs |
| [`ray.nodes()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.nodes) | Cluster nodes |
| [`ray.cluster_resources()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.cluster_resources) | All the available resources, used or not |
| [`ray.available_resources()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.available_resources) | Resources not in use |

In [12]:
print(f"""
ray.get_gpu_ids():          {ray.get_gpu_ids()}
ray.nodes():                {ray.nodes()}
ray.cluster_resources():    {ray.cluster_resources()}
ray.available_resources():  {ray.available_resources()}
""")


ray.get_gpu_ids():          []
ray.nodes():                [{'NodeID': '5df18d5eeef2d75e4ddca098078a662c0a831077794089ed7482faf2', 'Alive': True, 'NodeManagerAddress': '127.0.0.1', 'NodeManagerHostname': 'Juless-MacBook-Pro-16-inch-2019', 'NodeManagerPort': 59828, 'ObjectManagerPort': 59827, 'ObjectStoreSocketName': '/tmp/ray/session_2022-04-06_17-09-07_012306_14342/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2022-04-06_17-09-07_012306_14342/sockets/raylet', 'MetricsExportPort': 60893, 'alive': True, 'Resources': {'CPU': 12.0, 'object_store_memory': 2147483648.0, 'memory': 14545055335.0, 'node:127.0.0.1': 1.0}}]
ray.cluster_resources():    {'node:127.0.0.1': 1.0, 'memory': 14545055335.0, 'CPU': 12.0, 'object_store_memory': 2147483648.0}
ray.available_resources():  {'CPU': 12.0, 'node:127.0.0.1': 1.0, 'object_store_memory': 2067483391.0, 'memory': 14545055335.0}



Recall that we used `ray.nodes()[0]['Resources']['CPU']` in the second lesson to determine the number of CPU cores on our machines:

In [13]:
import json
ray.nodes()[0]['Resources']['CPU']

12.0

## Tips and Tricks for first-time users
First time users can trip upon certain API calls usage patterns. This short tip & triks will insure you against unexpected results. Below is a brief exploration of a handful of API calls and its best practice.

### Tip 1: Delay ray.get()

With Ray, all invocations of `.remote()` calls are aynchronous, meaning the operation is returned immediately with a promise/future object ID. This is key to achieving massive parallelism, as it allows a devloper to launnch many remote tasks, each returning a remote futre object ID. This object ID can be fetched with `ray.get`. Because `ray.get` is a blocking call, where and how often you use can affect the performance. 


In [5]:
@ray.remote
def do_some_work(x):
    time.sleep(1)
    return x

#### Bad usage
We are using `ray.get` inside a loop and blocking on each call of `.remote()`

In [11]:
%%time
results = [ray.get(do_some_work.remote(x)) for x in range(4)]
results

CPU times: user 22.9 ms, sys: 20.8 ms, total: 43.7 ms
Wall time: 4.02 s


[0, 1, 2, 3]

#### Good usage
We delay `ray.get` after all the tasks have been invoked and their references have been returned


In [12]:
%%time
results = ray.get([do_some_work.remote(x) for x in range(4)])
results

CPU times: user 7.61 ms, sys: 6.99 ms, total: 14.6 ms
Wall time: 1 s


[0, 1, 2, 3]

#### Takeway tip 1: 
Since `ray.get` is a blocking call, postpone its use only when you need object ID's value. If called eagerly, it can
affect the performance of your desired parallelism.

### Tip 2: Avoid tiny remote tasks
Ray APIs are general and simple to use. As a result, new comers natural intinct to parallelize all tasks, including tiny small ones, which can incur the overhead overtime. In short, if the Ray remote tasks are tiny and small, they may take longer to execute than their serial Python equivalents.

In [21]:
def tiny_task(x):
    time.sleep(0.0001)
    return x

In [22]:
%%time
results = [tiny_task(x) for x in range(100000)]

CPU times: user 392 ms, sys: 434 ms, total: 826 ms
Wall time: 13.4 s


Now convert this into Ray remote task

In [23]:
@ray.remote
def remote_tiny_task(x):
    time.sleep(0.0001)
    return x

In [24]:
%%time
result_ids = [remote_tiny_task.remote(x) for x in range(100000)]
results = ray.get(result_ids)

CPU times: user 30.1 s, sys: 8.54 s, total: 38.7 s
Wall time: 16.2 s


Surprisingly, not only Ray didn’t improve the execution time, but the Ray program is actually slower than the sequential program! What can we do to remedy it? What's going on?

Well, the issue here is that every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task.

One way to mitigate is to make the remote tasks "larger" in order to amortize invocation overhead. This is achieved by aggregating tasks into bigger chunks of 1000.


In [28]:
@ray.remote
def mega_work(start, end):
    return [tiny_task(x) for x in range(start, end)]

In [29]:
%%time
result_ids = []
[result_ids.append(mega_work.remote(x*1000, (x+1)*1000)) for x in range(100)]
results = ray.get(result_ids)

CPU times: user 44.9 ms, sys: 18.7 ms, total: 63.6 ms
Wall time: 1.18 s


A huge difference in execution time!

### Tip 3: Using ray.wait() with ray.get()

As we noted above, an idiomatic way of using `ray.get()` is delay fetching the object until you need them. Another way is to use with `ray.wait()` and only fetch values that already are available. 

Let's look at a simple example.

In [30]:
import numpy as np
@ray.remote
def make_array(n):
    time.sleep(n/10.0)
    return np.random.standard_normal(n)

Now define a task that can add two NumPy arrays together. The arrays need to be the same size, but we'll ignore any checking for this requirement.

In [31]:
@ray.remote
def add_arrays(a1, a2):
    time.sleep(a1.size/10.0)
    return np.add(a1, a2)

Now let's use `ray.wait` and `ray.get`

In [36]:
%%time

array_refs = [make_array.remote(n*10) for n in range(6)]
added_array_refs = [add_arrays.remote(ref, ref) for ref in array_refs]

arrays = []
waiting_refs = list(added_array_refs)  # Assign a working list to the full list of refs
while len(waiting_refs) > 0:           # Loop until all tasks have completed
    # Call ray.wait with:
    #   1. the list of refs we're still waiting to complete,
    #   2. tell it to return immediately as soon as one of them completes,
    #   3. tell it wait up to 10 seconds before timing out.
    ready_refs, remaining_refs = ray.wait(waiting_refs, num_returns=2, timeout=10.0)
    new_arrays = ray.get(ready_refs)
    arrays.extend(new_arrays)
    for array in new_arrays:
        print(f'{array.size}: {array}')
    waiting_refs = remaining_refs  # Reset this list; don't include the completed refs in the list again!
    
# print(f"\nall arrays: {arrays}")

0: []
10: [-1.7781109   0.88573117  0.86475269 -4.57406466 -5.06717503 -2.29265262
  2.86728976 -1.06289166  1.24273203  2.59255893]
20: [-0.53915161 -0.34215306  2.24751532  2.20941352 -1.12683093 -1.27652722
  0.87340067 -0.40808127 -1.38325214 -3.54937726  2.68672585  1.05317156
 -0.86190813 -2.7779964  -1.15920784 -0.06553598 -0.01776134  2.06111113
 -1.31447923  2.34664896]
30: [ 2.86070977  1.54727811  0.27623417 -0.92417809 -4.9424149   3.51745492
  1.92693489  2.55436313 -0.30649244 -2.15913155  0.46628628  2.94806825
  1.96059157  2.93584918 -1.73202179 -0.1542359   1.08681397 -1.53762225
 -1.59247656 -0.76548422 -3.43715225 -1.23557788  3.51236954 -0.16132308
  1.06301097 -1.12866998 -0.63942135 -1.12765888 -1.34416691 -3.07318492]
40: [-1.60077828 -1.22828887  0.03049364 -1.18360244  3.99246954 -0.09282656
  0.30693929  0.38435625 -7.96970459  2.32797177 -1.07742711  0.65223155
 -0.70776815 -0.76976502 -1.82593219 -1.10776212 -2.02381039 -0.23749677
  2.05909044  1.54057844 

### Homework 

Read some more [tricks and tips](https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html) in the documentation