# Ray Tune Knowledge Share
___

## Requirements
* Access to a ROSA Cluster.
* This notebook and accompanying files in your workbench's directory.
* The CodeFlare SDK installed.

## Background
Ray Tune is a python library for experimentation execution and hyperparameter tuning at any scale. It leverages popular frameworks like PyTorch and TensorFlow and state of the art algorithms to allow a user to perform their experiments with ease.

By default, Ray Tune is installed with the CodeFlare SDK. It can utilise the SDK's job submission client to perform these fine tuning tasks via KubeRay in a RHOAI instance. This notebook aims to demonstrate this using the client and a python fine tuning script.

This notebook will feature a mixture of the required executable setup and accompanying snippets from the tuning script to illustrate all of the above.

## Setup
First, we'll to import the CodeFlare SDK and relevant functions.

In [None]:
# Import dependencies from codeflare-sdk
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication, RayJobClient

Next, we'll need to define our authorisation via our OpenShift tokens and login via `auth.login()`.

In [None]:
from codeflare_sdk import TokenAuthentication
auth = TokenAuthentication(
    token = "sha256~q1VwpQtDPzY5TkjqCOlhcCsvkbpO1Bk5PgLNSN63zPc",
    server = "https://api.n4e0h1y1r9l0d7z.kuuz.p3.openshiftapps.com:443",
    skip_tls=False
)
auth.login()

Now we can define our Ray Cluster. This is the same Ray Cluster that we'll use for our Tune experiments.

In [None]:
cluster = Cluster(ClusterConfiguration(
    name='ray-tune-ks',
    head_cpu_requests='500m',
    head_cpu_limits='500m',
    head_memory_requests=2,
    head_memory_limits=5,
    num_workers=2,
    worker_cpu_requests='250m',
    worker_cpu_limits=3,
    worker_memory_requests=4,
    worker_memory_limits=5,
    head_extended_resource_requests={'nvidia.com/gpu':0}, 
    worker_extended_resource_requests={'nvidia.com/gpu':0},
    write_to_file=False, 
))

We can create this using `cluster.apply()`.

In [None]:
cluster.apply()

And then verify its status via `cluster.details()`. This may take some time to become active, once it does, click the dashboard link (we'll need it later).

In [None]:
cluster.details()

Before we actually run the first job, let's break down some of its logic first and how it relates to Ray Train.

## Experiment 1

The `simple_mnist_job.py` file will serve to illustrate some simple machine learning concepts.

```python
def objective(config):
    for step in range(10):
        x, y = config["x"], config["y"]
        loss = (x - 3) ** 2 + (y + 1) ** 2 + random.random() * 0.1
        session.report({"loss": loss})
```

* It defines a simple function named `objective()` which takes the `config` dictionary from Ray Train. This is a dict that contains various hyperparameters. 
* To simulate iterating through epochs we'll loop through 10 iterations.
* We'll define the loss via a quadtratic equation with some randomness.
* Then we'll output the loss results via Ray Tune. 

This essentially acts as a 'faked' training loop to demonstrate Ray Tune. It doesn't actually fine tune anything, simply demonstrates the process. The below snippet is where our function is passed to Ray Tune.

```python
tune.run(
    objective,
    config={
        "x": tune.uniform(-10, 10),
        "y": tune.uniform(-10, 10),
    },
    num_samples=5,
    resources_per_trial={"cpu": 1},
)
```

* It defines the search space for the hyperparameters.
  * Both are sampled from a uniform distribution between -10 and 10.
  * This means each number has an equal chance to be picked.
* It sets `num_samples` to '5' in order to perform 5 trials of the full process (5 of our 1-step loop).
* And it sets the available resources to be a single 'cpu' node.

Now that we've got a basic understanding of this experiment, let's use the cluster job client to execute it as a Ray Job.
First, we'll need to initialise the client. The SDK will automatically gather the dashboard address and authenticate using the Ray Job Submission Client.

In [None]:
client = cluster.job_client

We can now declare the `submission_id` to the return value of `submit_jub()`. This function takes:

* An `entrypoint` which is the command to pass as the Ray Job. In this case, a python execution of our script.
* A `runtime _nv` in this case our local directory and the command to pip install our requirements.

Once this cell has been executed, you should see some logging info from Ray and the `submission_id`.
This will look similarly to: `raysubmit_sTV4SPtFup7JsxKH`.

In [None]:
submission_id = client.submit_job(
    entrypoint="python simple_mnist.py",
    runtime_env={"working_dir": "./","pip": "requirements.txt"},
)
print(submission_id)

Now you can open the Ray Dashboard in order to observe the results of this experiment. When in the dashboard, click the "jobs" tab. Then find your Ray Job. It will look similar to the below image.

![ray-job](images/ray-job.png)

Click "log" on the far right of the screen and you can view the full logs of the Job. 
Early in the logs you should be able to see the pending trials. They will look akin to the below image.
Within this, you can observe the x and y values for each trial and their status. These details are a result of the Ray Tune output.

![ray-job](images/pending-trials.png)

Further into the logs you'll be able to observe the results of each iteration. They will look similar to the below image, featuring details regarding the specific iteration including its duration and loss.

![ray-job](images/iteration-1.png)

Finally, at the end of the log you should be able to observe the list of completed trials with accompanying data re loss, duration etc.

![ray-job](images/trials-completed.png)

## Experiment 2