# Create and test an AML environment supporting Deepspeed training

This notebook shows how to build an AzureML environment that supports [Deepspeed] training. At the end of it you should have an environment, backed by a [Docker] image, that can train PyTorch and transformers models. We build our environment on a remote machine by default but also show how to build the image locally for debugging purposes. 

## An aside on how environments are structured and created 

AzureML environments are [designed] to house all the dependencies for a particular experiment and computation as well as providing some standardized interfaces to plug into the AzureML run and context system. They are ultimately provisioned as running Docker images on single VMs (AzureML compute instances) or Kubernetes clusters (AzureML compute clusters). They are versioned and you can keep images around indefinitely, enabling recreation of past work and reducing "it worked on my machine, I don't know" problems.

Environments can be created or accessed in a number of ways. You can access and use [curated environments]: prebuilt environments that support popular libraries such as scikit-learn and PyTorch. If these environments have what you need, they are convenient and well-maintained. You can also add additional libraries onto them by [cloning] and adding additional PIP or Conda dependencies. These modified environments can be re-built each time they are used or cached as compiled Docker images in an attached Azure Container Repository (ACR). Ultimately all environments that are persisted are stored as cached Docker images in ACR. 

Instead of working with a curated environment, we can also create a Dockerfile defining the environment directly. This file can be built locally and pushed to ACR or uploaded to AzureML for remote compilation and storage. The former allows for local debugging but the latter is fastest and doesn't require any CLI usage. We'll explore the remote building option here and then describe the local process should you need to perform debugging. 

[Deepspeed]: http://deepspeed.ai
[Docker]: https://docs.docker.com/get-started/overview/
[designed]: https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#
[curated environments]: https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments
[cloning]: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-environments#use-a-curated-environment


## Registering and building an environment on AzureML

Throughout this notebook and those that follow we'll draw our configuration from `src/config.yml` where possible. This file contains most of the settings that will need to be customized for a new job type. Reading through it gives a sense of how to configure an experiment. 

First we'll import our needed libraries and load the configuration.

In [None]:
import subprocess
import yaml
from pathlib import Path
import azureml.core

with open('src/config.yml', 'r') as f:
    config = yaml.safe_load(f)

Now we'll connect to the AzureML [workspace]. The workspace can be thought of as the namespace that ties together all the models, runs, datasets, compute instances, cluster instances, and linked services we'll access. Each notebook will connect to this workspace before performing any operations with AzureML. 

We instantiate a connection to the workspace using a configuration file that is automatically provided on the AzureML compute instances but which [we must create] if we run this notebook on our own desktop or laptop machine. 

[workspace]: https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace
[we must create]: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#workspace

In [None]:
workspace = azureml.core.Workspace.from_config()

Now we'll register the environment. 

In [None]:
environment = azureml.core.Environment.from_dockerfile(
    name=config['environment'], 
    dockerfile=Path(config['environment_dockerfile']).read_text()
)
environment.python.user_managed_dependencies = True
environment.python.interpreter_path = "/opt/miniconda/bin/python"
environment = environment.register(workspace)

print(f"{environment.name} version {environment.version} registered in {workspace.name}")

Our environment is now registered in the workspace, but hasn't been built. If we do nothing further it will be built the first time it is used in a run. But we can trigger the build process now, either here or within the AzureML studio GUI, and check that it has completed successfully. We'll trigger this locally. 

In [None]:
build = environment.build(workspace)

print(f"Current status is {build.status}. We can track the build process at \n {build.log_url}")

Now we wait, this build takes 30 to 60 minutes. We can watch the build log within AzureML studio in the environment's entry in the environments tab, or by occasionally refreshing the URL above. The `build.status` turns to `Succeeded` upon success. 

In [None]:
build.status

With that complete we can move onto the next phase and next notebook, building our dataset.

# Optional: Should you need to debug your Docker image

The hardest part of getting this environment working on AzureML is the long debugging loop. Each time a Deepspeed-supporting Docker image is built from scratch it takes 30-60 minutes. If you are debugging your Docker image you only get to change a couple of things a day. To shorten this loop we can build the Docker image locally. This allows the use of cached docker layers in subsequent builds and interactive exploration of the Conda environment within the image. This greatly shortens debugging.  

The locally built image can be used solely for debugging or pushed to ACR and AzureML as a full environment. 

## Process overview

The process for creating our local build and then transforming it into an environment looks like:

- Locally build the Deepspeed docker image
- Verify Deepspeed's installation and configuration by dropping into the image and running `dsreport`
- Optional
    - Register it with ACR 
    - Create an environment pointing to that ACR-hosted image
    - Register the environment
    - Build the environment, pulling the built image from the ACR, adding AzureML environment and network details, and pushing the result as a new image back to ACR

Since most of the local work happens on the command line, we'll set some configuration options and then provide some example shell commands. 

In [None]:
acr_name = workspace.get_details()["containerRegistry"].split("/")[-1] 
container_name = f"{acr_name}.azurecr.io/{config['experiment']}/{config['environment']}"

Execute the next two lines in the root of the cloned repo to locally build the dockerfile and tag it for later retrieval. This will take between one minute and about an hour depending on how many layers you've locally cached. 

In [None]:
cmds = (f"docker build {config['source_directory']} .",
        'LATEST=`docker images --format "{{.ID}}" | head -n 1`',
       f"docker tag $LATEST {container_name}"
       )

for line in cmds:
    print(line)

To drop into the container for debugging run:

In [None]:
print(f"docker run -it {container_name} bash")

Finally, should you wish to build an environment off of this dockerfile you may upload it to the relevant ACR by executing the following. 

Note: You must follow the link after executing `az login` to, ya know, log in to the subscription

In [None]:
cmds = (
    f"az login",
    f"az acr login --name {acr_name}",
    f"docker push {container_name}:latest"
)
for line in docker_build_commands: 
    print(line)

We have a docker image in ACR. We can create an environment that inherits this ACR-located file as the base docker image and build a new environment. The environment build process appends some additional AzureML specific layers to our pre-built docker image. Building these additional layers, plus pulling the image from the ACR, will take a few minutes to complete. You can watch the progress of the job from within the Environments tab of AzureML studio. 

In [None]:
environment = azureml.core.Environment(config['environment'])
environment.docker.base_image = f"{container_name}:latest"
environment.python.user_managed_dependencies = True
environment.python.interpreter_path = "/opt/miniconda/bin/python"
environment = environment.register(workspace)
environment.build(workspace)

Now you've got an environment you can use and a head-start on debugging it should you need to make changes in the future. 