# What is Ray?

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute.

Today’s ML workloads are increasingly compute-intensive. As convenient as they are, single-node development environments such as your laptop cannot scale to meet these demands.

Ray is a unified way to scale Python and AI applications from a laptop to a cluster.

With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be general-purpose, meaning that it can performantly run any kind of workload. If your application is written in Python, you can scale it with Ray, no other infrastructure required.


### Basic Ray Tutorial 

In the first part of the tutorial we showcase what ray can do to speed up code and functions.  We will show how a simple decorator function enables a standard written python function to be run in a parallelized manner and distributed across nodes.

The second part of this tutorial focuses on the cart-pole problem. A cart has a pole fixed with a movable lever in the middle of the cart. The cart slides along a frictionless surface. The goal is to keep the pole upright at all times. The test is how far back and forth the cart can move in order to prevent the pole from falling. The tutorial has been modified heavily so that it (i) runs in a jupyter notebook, (ii) demonstrates full capabilities of ray, and ray tune and (iii) breaks down the components of a RL project along with enhanced explainations of the code. We may modify this tutorial further to solve a different problem.

In the third part of the tutorial, we demonstrate how to create a custom reinforcement learning environment with the problem space of a robot walking down a corridor.

#### References:

Barto, A. G., Sutton, R. S. and Anderson, C. (1983), ‘Neuron-like adaptive elements that can solve difficult learning control problems’, IEEE Transactions on Systems, 5, Man, and Cybernetics 13, 834–846

Tune: A Research Platform for Distributed Model Selection and Training, Liaw, Richard and Liang, Eric and Nishihara, Robert and Moritz, Philipp and Gonzalez, Joseph E and Stoica, Ion, arXiv preprint arXiv:1807.05118}, 2018

Ray RLLib Documentation: [Ray RLLib Documentation](https://docs.ray.io/en/latest/rllib-training.html#getting-started)

Ray Tune Documentation: [Ray Tune Documentation](https://docs.ray.io/en/latest/tune/index.html)

Mastering Reinforcement Learning with Python, Enes Bilgin, Packt Publishing, 2020 [Buy MRL with Python](https://www.amazon.com/Mastering-Reinforcement-Learning-Python-next-generation/dp/1838644148/?tag=meastus-200)

Example of Calculating Pi using Ray [How to scale Python multiprocessing to a cluster with one line of code by Evan Oaks](https://medium.com/distributed-computing-with-ray/how-to-scale-python-multiprocessing-to-a-cluster-with-one-line-of-code-d19f242f60ff)

#### Checking Ray Version, Instantiating Ray Instances and Looking at Node Parameters

Its typically helpful to check the parameters for nodes to ensure that they are in good shape.  One can also navigate to the tab which says 'Ray Web UI) to look through the node pool and ray actors as well as memory.  These are advanced topics and are meant for trouble-shooting only. 


In this notebook we'll start with showing you how easy it is to use Ray to convert regular functions into ones that are parallelized and distributed across nodes.  Before we do anything though, let's check our version of Ray.  

In [1]:
! ray --version

ray, version 1.9.2
[0m

In [2]:
import ray
import os

if ray.is_initialized() == False:
   service_host = os.environ["RAY_HEAD_SERVICE_HOST"]
   service_port = os.environ["RAY_HEAD_SERVICE_PORT"]
   #_temp_dir='/domino/datasets/local/{}/'.format(os.environ['DOMINO_PROJECT_NAME']) #set to a dataset
   ray.util.connect(f"{service_host}:{service_port}")

Now let's check the health of the nodes, look at their CPU and GPU per node.  Here you can see each node, including the head node have seven GPUs (this may differ in your example depending on your environment).  It's a good idea to check this and plan for memory usuage with Ray.  If there isn't enough memory overhead for the code as written, a data channel error will shutdown.  There are advanced techniques to prevent this.  This happens regardless of the verison of Ray used, so make sure to check each time.

In [3]:
ray.nodes()

[{'NodeID': '6c1bd5d13c200f8a543af8fd923e5f221d8930a63f326c8a0d790c01',
  'Alive': True,
  'NodeManagerAddress': '10.0.122.170',
  'NodeManagerHostname': 'ray-633afdf3a2b933058843e442-ray-worker-0',
  'NodeManagerPort': 2385,
  'ObjectManagerPort': 2384,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-10-03_08-21-27_764991_1/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-10-03_08-21-27_764991_1/sockets/raylet',
  'MetricsExportPort': 62122,
  'alive': True,
  'Resources': {'memory': 2956748391.0,
   'CPU': 1.0,
   'node:10.0.122.170': 1.0,
   'object_store_memory': 1267177881.0}},
 {'NodeID': '99a7ad31904635a50ccfce44259e09f0df1de061c24095e1ebaa267d',
  'Alive': True,
  'NodeManagerAddress': '10.0.96.53',
  'NodeManagerHostname': 'ray-633afdf3a2b933058843e442-ray-head-0',
  'NodeManagerPort': 2385,
  'ObjectManagerPort': 2384,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-10-03_08-21-27_764991_1/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2

### What is Ray and what can it do?



Ray is a flexible distributed computing system available on Domino product on demand.  With Ray one can run code both in parallel or in distributed mode.  Parallel mode refers to running a function on several threads simultaneously in parallel.  This method can also be accomplished on multiple nodes at once (distributed computing).  One will notice that the wall clock time (which we compute below) differs from the compute time.  With multiple nodes or threading (running in a distributed fashion), the compute time is split among nodes.  Thus when we provide a 10 second 'sleep' we can see that the 10 seconds is distributed and so the wall clock time (the time we actually experience) is shorter than compute time.  This is part of the magic of parallel and distributed computing.  Let's take a closer look below.

## Creating remote objects

Put an object in Ray's object store, get it out and run the function
say want to add 10 million and after every million 5 seconds, total processing would be 50 seconds

Do this in ray, and have 3 ray workers, adding 1 million values each, 
after calculating 1 million each sleeps 5 seconds, then total processing takes less than six seconds
iterations in learning 
ml is already iterative, running partitions on each worker and at the distributed sequentially now paralellized
call without ray and then with ray
small amount of data, run and then kick off with same code but a larger data set, locally and in cloud testuse 10 workers, each sleeps 2 seconds, and see the difference

### Calculate Pi

### How to estimate PI using the Monte Carlo Method

Monte Carlo estimation 
Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. One of the basic examples of getting started with the Monte Carlo algorithm is the estimation of Pi. 

#Estimation of Pi:

The idea is to simulate random (x, y) points in a 2-D plane with domain as a square of side 2r units centered on (0,0). Imagine a circle inside the same domain with same radius r and inscribed into the square. 
We then calculate the ratio of number points that lied inside the circle and total number of generated points. Refer to the image below:

![image-2.png](attachment:image-2.png)

We know that the area of a circle divided by the area of the square is pi/4. For a very large number of generated points: 

![image.png](attachment:image.png)


Reference: https://www.geeksforgeeks.org/estimating-value-pi-using-monte-carlo/

In [6]:
import time

import math
import random
import time

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi(num_samples):
    start = time.time()
    num_inside = sample(num_samples)
    
    print("pi ~= {}".format((4*num_inside)/num_samples))

In [7]:
%%time

approximate_pi(10**8)

pi ~= 3.14158932
CPU times: user 1min 5s, sys: 3.84 ms, total: 1min 5s
Wall time: 1min 5s


In [8]:
import math
import random
import time

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi_parallel(num_samples):
    from multiprocessing.pool import Pool
    pool = Pool()
    
    start = time.time()
    num_inside = 0
    sample_batch_size = 100000
    for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
        num_inside += result
        
    print("pi ~= {}".format((4*num_inside)/num_samples))

In [9]:
%%time

approximate_pi_parallel(10**8)

pi ~= 3.14185292
CPU times: user 276 ms, sys: 60.5 ms, total: 337 ms
Wall time: 1min 54s


In [10]:
import math
import random
import time

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi_distributed(num_samples):
    from ray.util.multiprocessing.pool import Pool # NOTE: Only the import statement is changed.
    pool = Pool()
        
    start = time.time()
    num_inside = 0
    sample_batch_size = 100000
    for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
        num_inside += result
        
    print("pi ~= {}".format((4*num_inside)/num_samples))


In [11]:
%%time

approximate_pi_distributed(10**8)

pi ~= 3.14152188
CPU times: user 119 ms, sys: 20.4 ms, total: 140 ms
Wall time: 38.8 s


Notice in the above examples the compute time differs and the wall clock time for the compute differs. However keep in mind Ray is only using three workers to calculate pi in this simple example becasue we started the cluster with three workers.  If the cluster is started with more workers, it will speed up the calculations every further. 

In [14]:
ray.shutdown()

### A Note about Warnings

Version 1.6 is validated on the latest edition of Domino, but you see here we chose to use the latest stable version of Ray, 1.9.  This version will occasionally throw some warning about depreciation for future versions of Ray or Pytorch because it is the newest stable version of Ray.  Don't worry too much about the warnings, they will not change the procedures followed to run the code.
