## Agents on Kubeflow 🤓

In this tutorial we will be training a reinforcement learning agent from the `tensorflow/agents` project on Kubernetes using Kubeflow.

### Preliminaries

#### Agents

- A framework for building neural network-based agents that learn to perform tasks through interaction with an environment: https://github.com/tensorflow/agents
- These environments are provided through the OpenAI gym interface: https://github.com/openai/gym

#### What is kubeflow?

- Makes it easier to do distributed training of neural network models
- Get up to speed on distributed tensorflow starting with [these docs](https://www.tensorflow.org/deploy/distributed)
- Link to kubeflow documentation here: https://github.com/tensorflow/k8s

#### Overview

- Here we will be training a reinforcement learning agent to walk around a domain. The result will look lke the following: https://www.youtube.com/watch?v=UE7tvibbTDQ
- This narrative consists of three phases:
    1. Deployment and configuration of the Kubernetes cluster we'll need to run the training job
    2. A phase of learning the model parameters necessary to perform the task followed by
    3. Capture of a video of the model performing the task with the (then) parameterized model.

### Setup and Deployment

#### Google Cloud Platform

- What is the Google Cloud Platform?
    - https://cloud.google.com/getting-started/
- You will need to configure your cloud platform account and create a project before being able to proceed
    - Which apis must be enabled?
- Gcloud command line tool: https://cloud.google.com/sdk/gcloud/

In [None]:
%%bash
gcloud auth login && gcloud config set project [your project ID]

#### Obtain the code and dependencies

- Kubeflow github: https://github.com/tensorflow/k8s
- The kubeflow code can be obtained from github with the following command

In [1]:
%%bash
pip install tfk8s
pip install jinja2

SyntaxError: invalid syntax (<ipython-input-1-00d251b42567>, line 1)

#### Deploy the cluster

- Your very own kubeflow cluster can be deployed on the Google Cloud Platform with the following command

In [None]:
%%bash
GCLOUD_PROJECT_ID=[your project ID]
SALT=`date | shasum -a 256 | cut -c1-8`
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz

python -m tfk8s.deploy setup --project ${GCLOUD_PROJECT_ID} --cluster dev-${SALT} \
                             --zone us-central1-f --chart $CHART \
                             --junit_path /tmp/junit-info --initial_node_count 1

#### Test your deployment

- Provided the conditions under which you ran the above commands were identical to mine you'll get the desired result.
- Provide the command to verify the deployment is working fine.

In [38]:
%%bash
helm test tf-job

RUNNING: tf-job-tfjob-test-dlgd78
PASSED: tf-job-tfjob-test-dlgd78


#### Create output bucket

Lastly we need to create a Google Cloud Storage bucket to store job logs. That can be created from the command line with the following:

In [None]:
GCLOUD_PROJECT_ID=[your project ID]
gsutil mb gs://${GCLOUD_PROJECT_ID}-k8s

### Training

#### Objectives

- The objective of the training phase is to learn the parameterization of our model that confers a high level of performance on the provided task

#### Parameterizing the run

The parameters for a TFJob is typically specified using a YAML file. The following will template a TFJob yaml to `/tmp/tfjob.yaml` which we will subsequently run on kubernetes.

In [2]:
%%bash

GCLOUD_PROJECT_ID=[your project ID]
SALT=`date | shasum -a 256 | cut -c1-8`
VERSION_TAG=cpu-${SALT}
AGENTS_CPU=gcr.io/${PROJECT_ID}/agents:${VERSION_TAG}
LOG_DIR=gs://${GCLOUD_PROJECT_ID}-k8s/jobs/run-${SALT}
JOB_NAME=tfagents-${SALT}

echo '
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "{{job_name}}"
  namespace: default
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: {{image}}
              name: tensorflow
              args:
              - --log_dir
              - {{log_dir}}
              - --config
              - {{environment}}
              - --mode
              - {{mode}}
              - --run_base_tag
              - {{job_name}}
          restartPolicy: OnFailure
  tensorBoard:
    logDir: {{log_dir}}
  tfImage: gcr.io/dev01-181118-181500/agents-base:cpu-tf-latest
' > /tmp/tfjob.template.yaml

jinja2 /tmp/tfjob.template.yaml \
   -D image=${AGENTS_CPU} \
   -D job_name=${JOB_NAME} \
   -D log_dir=${LOG_DIR} \
   -D environment=pybullet_ant \
   -D mode=train > /tmp/tfjob.yaml

bash: line 2: project: command not found


#### Launching the TFJob

Once the TFJob YAML is prepared with the above command we're ready to launch the TFJob. This can be done as follows:

In [None]:
kubectl create -f /tmp/tfjob.yaml

The TFJob and the availibility and IDs of pod jobs can be listed with `kubectl get pods`, for example yielding the following:

In [14]:
%%bash
kubectl get pods

NAME                                                  READY     STATUS             RESTARTS   AGE
tf-job-operator-59ffc48689-vtv6k                      1/1       Running            0          1d
tfagents-5e05463f-master-xhga-0-xzr9b                 0/1       CrashLoopBackOff   405        1d
tfagents-5e05463f-ps-xhga-0-bqfn9                     1/1       Running            0          1d
tfagents-5e05463f-ps-xhga-1-x98wn                     1/1       Running            0          1d
tfagents-5e05463f-tensorboard-xhga-557995b57f-6kx6j   1/1       Running            0          1d
tfagents-5e05463f-worker-xhga-0-p9q5x                 1/1       Running            417        1d
tfagents-5e05463f-worker-xhga-1-xl6st                 0/1       CrashLoopBackOff   408        1d
tfagents-5e05463f-worker-xhga-2-h947f                 1/1       Running            414        1d


#### Monitoring training

As part of the TFJob we started, because we included the `tensorBoard` field, a tensorboard instance will have been deployed. Deployments on kubernetes can be listed with the following:

In [15]:
%%bash
kubectl get deployments

NAME                                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
tf-job-operator                      1         1         1            1           1d
tfagents-5e05463f-tensorboard-xhga   1         1         1            1           1d


Once we have the ID of our tensorboard deployment we can open tensorboard in our browser, after starting the kubernetes proxy with `kubectl proxy`, with the following command (templating in your deployment ID):

In [12]:
%%bash
TENSORBOARD_DEPLOYMENT_ID=[your tensorboard deployment ID]
open http://127.0.0.1:8001/api/v1/proxy/namespaces/default/services/${TENSORBOARD_DEPLOYMENT_ID}:80/

This will open tensorboard in a new browser tab.

### Simulation

#### Objectives

- Generate a gif of our parameterized model performing the task

#### Obtaining the model locally

- The following will transfer the model checkpoint to our local machine so we can run the model in visualize mode

In [None]:
%%bash
gsutil -m cp -r <gcs logdir path>

#### Simulating the model

- Using the local copy of the model checkpoint we can simulate the model performing the task with the following command:

In [None]:
%%bash

# TODO: Run the model inside of a container with model parameters mounted and container display captured
#  - E.g. over VNC...

# python task.py --mode simulate --log_dir <path to logs>

#### The result

- The above will open a display window showing the agent performing the task. Below is a screen capture gif showing the expected result.

In [None]:
(the resulting gif)

### Next actions

If this is your first time working with these technologies you might be interested in some suggestions of good next steps. Here are some ideas:
- Fork the above code and run it (possibly with modification) on other learning environments
- Take a shot at implementing your own (very simple) learning environment and use this agent to learn it.