# Vertex AI job submit

This is an extended tutorial from [aif_training_job_tutorial](https://github.com/MayoNeurologyAI/aif_training_job_tutorial) with specifics for running a model folder with multiple dependencies and additional requirements. 


## 1. Install Docker
Make sure Docker is installed on your machine. You can download and install Docker from the official Docker website (https://www.docker.com/).

Once it is installed, you only need to start the application to gain access to Docker. 

## 2. Authorize access to google cloud and docker images
<font color='red'>Please execute the commands in the cell below consecutively on your terminal or command prompt.</font>

Also make sure that Mayo VPN is active at the time of running these commands

In [None]:
# authenticate google cloud
$ gcloud auth login

# set project to fdgpet project
$ gcloud config set project ml-mps-aif-afdgpet01-p-6827

# Authenticating to Access Docker Images from the Google Container Registry in AIF
$ gcloud auth configure-docker us-central1-docker.pkg.dev
$ gcloud auth configure-docker us.gcr.io

## 3. Create a DockerFile

Create a Dockerfile as in `./Dockerfile`. The `Dockerfile` extends the AIF image `us.gcr.io/ml-mps-aif-afdgpet01-p-6827/speech:latest`. It can also install additional requirements from pip and can do conda installs if necessary.

The version of the Dockerfile used in this repository ([mayo-w2v2](https://github.com/dwiepert/mayo-w2v2)) has some specific commands to consider.
Because of the structure, the src folder must be copied fully, set as the working dir, and run so that relative imports function properly. 

   ##### AIF image `us.gcr.io/ml-mps-aif-afdgpet01-p-6827/speech:latest`:
        
      This is pytorch image that includes all packages listed in `base_requirements.txt` + dependencies from the base container (`gcr.io/deeplearning-platform-release/pytorch-gpu.1-13.py310`)
      Note that the base image also includes conda install of ffmpeg.

      You can also see [Prebuilt container image dependencies](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch) and
      [Pre-installed software in containers](https://cloud.google.com/deep-learning-containers/docs/overview#pre-installed_software) for the dependencies from the base container.

   
   ##### Script `run.py`:

      This can be altered to be whatever script is running your model. For our purposes, it is always `run.py`. There is also a [test.py](https://github.com/dwiepert/mayo-w2v2/src/test.py) script in src code for mayo-w2v2 that can be used if you want to test that all relative imports are functioning properly. 


### 3.1. Build a docker image locally
The following command(s) allows you to build a Docker image locally on your machine based on the specifications defined in the Dockerfile.


<font color='green'>Please note the following points regarding the usage of `LOCAL_IMAGE_NAME[:TAG]`</font>:

<ul>
    <li> You can choose any name and tag for the <LOCAL_IMAGE_NAME[:TAG]> parameter. This is for local reference and helps you identify and manage your Docker images.</li>
    <li> If you use an existing local image name and tag, the new Docker image will replace the previous one with the same name and tag. Be cautious when reusing image names and tags to avoid unintentionally overwriting existing images.</li>
</ul>

<font color='red'> Note: If you are building a local image from AIF, you will need to be connected to the VPN. Also, depending on how large the image is, this initial local build can take a long time (e.g., a 25GB image could take over 2 hours, so plan accordingly).</font>

In [None]:
# `speech_test_script` is the LOCAL_IMAGE_NAME and `latest` is the TAG we specified for this example

#can run these in a terminal or command shell

#version that works with a Dockerfile in the same directory as the notebook. Must `cd` into the directory with the Dockerfile to run this way. 
# Buildx version is the same, but explicitly indicates you are using the non-depreciated version of build. 
!docker build -t speech_test_script:latest .
!docker buildx build -t speech_test_script:latest .

#could also give a path to the Dockerfile (based on whatever dir you are currently in)
!docker buildx build -t speech_test_script:latest -f PATH_TO_DOCKERFILE . 

### 3.2. Run Docker Image Locally
This container can be run locally BUT if you are doing a job submission just skip this part of the step and move directly to 3.3. There are two ways you can work with the container. 

1. If you specified a script to execute upon running the container, you can use 

```
    docker run <LOCAL_IMAGE_NAME[:TAG]> --user-arg1 user-arg1-value --user-arg2 user-arg2-value
```

The user-args are based on whatever args are defined in your specified script. 
#### <font color="red"> Please note that in this situtation, the Docker container does not have access to Google Cloud authentication. However, when we submit a training job, we will gain access to Google Cloud resources.</font>

2. If you did not specify a script to execute, you can run the docker container and execute into the container to run any of the scripts. This version allows you to have access to Google Cloud authentication locally. 

```$ docker run -it <LOCAL_IMAGE_NAME[:TAG]>```

#### <font color="red"> Note: Run the below command inside a terminal window or command prompt. </font>

#### Running with entry point

In [None]:
$ docker run speech_test_script:latest

#### Running with bash

In [None]:
$ docker run -it speech_test_script:latest

#### Access Running Docker container

Find `container_id` using `docker ps -a` and then access running docker container with 
```
    docker exec -it container_id bash
```

In [None]:
! docker ps -a

#### <font color="red"> Note: Run the below command inside a new terminal window or command prompt</font>

In [None]:
# executes the container (don't forget to change the container id)
$ docker exec -it a4652310229d bash

#### Set access within running container

In [None]:
# autheticate the container with your credentials
$ gcloud auth login --update-adc

# set project id
$ gcloud config set project ml-mps-aif-afdgpet01-p-6827

#### Run a script 

In [None]:
# run test.py
$ python test.py 

# exit container after successful execution
$ exit

#### Cleanup container locally
Stop and remove the container

``` $ docker stop container_id```

``` $ docker rm container_id ```

In [None]:
! docker stop a4652310229d   

! docker rm a4652310229d

### 3.3. Tag the Docker image
Use the docker tag command to assign a specific tag to your Docker image. This tag helps identify and manage different versions or variations of the image.

```
docker tag -t <LOCAL_IMAGE_NAME[:TAG]> TARGET_IMAGE[:TAG]
```

<font color='green'>Please note the following points regarding the usage of `TARGET_IMAGE[:TAG]`</font>:

<ul>
    <li> It should be of the format <font color="green">us-central1-docker.pkg.dev/[PROJECT-ID]/[DATASET-ID]/[REMOTE-IMAGE-NAME][:TAG]</font></li>
    <li> Referring to [REMOTE-IMAGE-NAME][:TAG], it corresponds to the image name you wish to have in AIF. If you utilize an existing AIF image name and tag, the new Docker image will replace the previous one that shares the same name and tag. Exercise caution when reusing image names and tags to prevent unintended overwriting of existing images.</li>
    <li> To enchance the organization of images on AIF, use the same `REMOTE-IMAGE-NAME` wherever possible to avoid having too many images, however use distincr TAG names to make sure you do not overwrite someone else's work. <font color="red">Please take caution in this step so you don't accidentally overwrite someone else's Image</font>.
</ul>

In [None]:
# LOCAL_IMAGE_NAME[:TAG] is the one used in previous step: speech_test_script:latest
# TARGET_IMAGE[:TAG] is of the format `us-central1-docker.pkg.dev/[PROJECT-ID]/[DATASET-ID]/[IMAGE-NAME][:TAG]` where,
#   PROJECT-ID = ml-mps-aif-afdgpet01-p-6827
#   DATASET-ID = phi-main-us-central1-p
#   IMAGE-NAME = speech_test_script
#   TAG = test-latest

!docker tag speech_test_script:latest us-central1-docker.pkg.dev/ml-mps-aif-afdgpet01-p-6827/phi-main-us-central1-p/speech_test_w2v2:m144443-latest

### 3.4. Push the Docker image to AIF
<font color="red">Note: when pushing the docker image to AIF, please double check your TARGET-IMAGE-NAME and TAG to make sure it is unique to you.</font>

In [None]:
# docker push us-central1-docker.pkg.dev/[PROJECT-ID]/[DATASET-ID]/[TARGET-IMAGE-NAME][:TAG]
!docker push us-central1-docker.pkg.dev/ml-mps-aif-afdgpet01-p-6827/phi-main-us-central1-p/speech_test_w2v2:m144443-latest

### You can view the published image in `Artifact Registry` on AI Factory

https://console.cloud.google.com/artifacts/docker/ml-mps-aif-afdgpet01-p-6827/us-central1/phi-main-us-central1-p?project=ml-mps-aif-afdgpet01-p-6827

## 4. Edit `config.json` 

Modify `imageUri` & `args` accordingly

Note:
  <ul>
  <li><font color="green"> The args parameter is a list of command line arguments that can be passed to the script. </font></li>
 
  <li><font color="green">The imageUri refers to the Docker image (TARGET_IMAGE[:TAG]) that you pushed to AIF in step 2.3. Please make sure TARGET_IMAGE[:TAG] information in the config.json is the same as before </font></li>
  </ul>

### Passing files as arguments

When passing files as arguments, you will need to copy any local files over to the google cloud storage bucket with  ```gsutil cp PATH_TO_FILE gs://bucket-name/PATH_TO_SAVE```

## 5. Run training job with `config.json`

<font color="red">Note: In the comand below, the --display-name can be anything you choose, as it is used for identification purposes. We would suggest including your lanid in the display name to differentiate which job is yours, that way when multiple jobs are running it is easy to differentiate yours from someone else's </font>

In [None]:
# test_speech_gpu is the display name for this examples

!gcloud ai custom-jobs create \
  --region=us-central1 \
  --display-name=speech_timm_pt_embedding \
  --config=embedding_config.json

### Training job status
##### Please visit https://console.cloud.google.com/vertex-ai/training/custom-jobs?project=ml-mps-aif-afdgpet01-p-6827 to check the job status

##### Also visit google cloud storage path to check if the log file exists