# Installing packages in a VM cluster on Azure
Expanding on the documentation here: https://cloudprovider.dask.org/en/latest/azure.html

The problem is that you may need to install packages, and these packages may also need to be available on your cluster VMs.
In this example, intake-esm is the package we want.

You can choose to use or not use a docker container, see below.

In [1]:
# Settings
env_name = "demo_env_name"

In [2]:
# Use this cell if the conda environment is not already set up
# You will then be able to select the env as a kernel in the jupyter notebook.
# This is controlled mainly by environment.yml,
# but env.sh installs the kernel for the jupyter notebook.
# You will probably not need to change env.sh.
!. env.sh {env_name} environment.yml

conda is /anaconda/envs/azureml_py36/bin/conda
demo_env_name            /anaconda/envs/demo_env_name
[1;36mactivating environment demo_env_name[0m
[1;31mChange kernel to demo_env_name, refresh browser if not available.


In [3]:
# You will need to set your own for these.
my_resource_group="PangeoHarvestOccupationalHeat"
my_vnet="PangeoHOH-vnet"
my_security_group="PangeoHOH-sg"
# See 

dockerhub_id = "charlessimpson" # You will need to set yours if using the docker method.

## Not using a docker container

In [4]:
# You will need to login to Azure in order to launch the VMs.
!az login

In [5]:
# If you decide not to docker, you can pass the dependencies to
# the AzureVMCluster constructor with the 'env_vars' keyword.
# You can generate the list of packages from your environment.yml,
# or list them manually.
# If you have a large number of dependencies this can be slow.
#
# If this doesn't work, check your dask version. 
# For dask versions before 23 Feb 2021 there is an issue,
# see https://github.com/dask/dask-cloudprovider/pull/258
#
# Parse the dependencies
import yaml
with open("environment.yml", 'r') as stream:
    try:
        env_yml = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)
env_yml

EXTRA_CONDA_PACKAGES=[]
EXTRA_PIP_PACKAGES=[]
for entry in env_yml["dependencies"]:
    if isinstance(entry, str):
        EXTRA_CONDA_PACKAGES.append(entry)
    elif isinstance(entry, dict):
        EXTRA_PIP_PACKAGES = entry['pip']
        
EXTRA_CONDA_PACKAGES = ' '.join(EXTRA_CONDA_PACKAGES)
EXTRA_PIP_PACKAGES = ' '.join(EXTRA_PIP_PACKAGES)

print("CONDA", EXTRA_CONDA_PACKAGES)
print("PIP", EXTRA_PIP_PACKAGES)

CONDA dask xarray pip
PIP dask-cloudprovider[azure] intake-esm


In [6]:
# Setup the azure dask cluster
from dask_cloudprovider.azure import AzureVMCluster
from dask.distributed import Client


cluster = AzureVMCluster(resource_group=my_resource_group,
                         vnet=my_vnet,
                         security_group=my_security_group,
                         location="UK South",
                         env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES, "EXTRA_CONDA_PACKAGES": EXTRA_CONDA_PACKAGES},
                         n_workers=1
                        )

client = Client(cluster)
client

Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-8b30dfd9-scheduler
Waiting for scheduler to run at 40.120.40.69:8786
Scheduler is running
Creating worker instance


  next(self.gen)


Network interface ready
Creating VM
Created VM dask-8b30dfd9-worker-5a3e7000



+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc   | None   | 1.9.2     | None    |
| lz4     | None   | 3.1.1     | None    |
| msgpack | 1.0.2  | 1.0.0     | None    |
+---------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Client  Scheduler: tls://40.120.40.69:8786  Dashboard: http://40.120.40.69:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


## Using a docker container
This may be faster in the long run if you have a lot of dependencies.

In [7]:
# Build a docker container for running on the cluster VMs.
# This is controlled by the Dockerfile and the environment.yml
# You will probably not need to change the Dockerfile.
!docker build -t {env_name} . 

Sending build context to Docker daemon  120.3kB
Step 1/8 : FROM continuumio/miniconda3:4.8.2
 ---> b4adc22212f1
Step 2/8 : RUN mkdir /opt/app
 ---> Using cache
 ---> b5a2c9c628ad
Step 3/8 : COPY prepare.sh /usr/bin/prepare.sh
 ---> Using cache
 ---> d102a00b0315
Step 4/8 : COPY environment.yml /opt/app/environment.yml
 ---> eb1baf9ec09a
Step 5/8 : RUN conda install mamba -n base -c conda-forge
 ---> Running in 890331f030a5
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - mamba


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            1_gnu          22 KB  conda-forge
    bzip2-1.0.8                |       h7f98852_4         484 KB 

In [8]:
# You will need to login to dockerhub - you will need to open a terminal to do this
# Or use the --username and --password flags.
# Obviously I won't put my credentials here.
!docker login

In [9]:
# This identifies the local docker container you just made
# with a remote repository in your dockerhub account.
!docker tag {env_name}:latest {dockerhub_id}/{env_name}:latest

In [10]:
# This pushes the docker image to your dockerhub account.
# This is necessary because when you launch the dask VMs, 
# they will not have access to the local files of the machine
# you are running this notebook on, and instead will retrieve 
# the container from dockerhub.
!docker push {dockerhub_id}/{env_name}:latest

The push refers to repository [docker.io/charlessimpson/demo_env_name]

[1Bfae8c2c9: Preparing 
[1B91c16778: Preparing 
[1Bb08efadf: Preparing 
[1Be9e21ff8: Preparing 
[1B12c0e863: Preparing 
[1Ba11e566d: Preparing 
[1Bd39597dd: Preparing 
[1B20aa853c: Preparing 
[8B91c16778: Pushed   1.028GB/1.001GBA[2K[4A[2K[8A[2K[7A[2K[7A[2K[8A[2K[3A[2K[8A[2K[7A[2K[8A[2K[9A[2K[8A[2K[9A[2K[2A[2K[7A[2K[8A[2K[7A[2K[9A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[9A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[9A[2K[7A[2K[8A[2K[8A[2K[6A[2K[8A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[9A[2K[8A[2K[8A[2K[8A[2K[9A[2K[8A[2K[9A[2K[7A[2K[7A[2K[9A[2K[7A[2K[7A[2K[7A[2K[8A[2K[9A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[9A[2K[7A[2K[9A[2K[7A[2K[8A[2K[7A[2K[9A[2K[7A[2K[9A[2K[7A[2K[9A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[9A[2K[7A[2K[8A[2K[7A[2K[9A[2K[8A[2K[9A

In [11]:
# You will need to login to Azure in order to launch the VMs.
!az login

In [12]:
# Setup the azure dask cluster
from dask_cloudprovider.azure import AzureVMCluster
from dask.distributed import Client


cluster = AzureVMCluster(resource_group=my_resource_group,
                         vnet=my_vnet,
                         security_group=my_security_group,
                         location="UK South",
                         docker_image=f'{dockerhub_id}/{env_name}:latest',
                         n_workers=1
                        )

client = Client(cluster)
client

Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-d4e576d8-scheduler
Waiting for scheduler to run at 40.120.40.80:8786
Scheduler is running
Creating worker instance
Network interface ready
Creating VM
Created VM dask-d4e576d8-worker-aaa9a7a4



+---------+-----------+-----------+---------+
| Package | client    | scheduler | workers |
+---------+-----------+-----------+---------+
| blosc   | None      | 1.10.2    | None    |
| dask    | 2021.05.0 | 2021.04.1 | None    |
| lz4     | None      | 3.1.3     | None    |
+---------+-----------+-----------+---------+


0,1
Client  Scheduler: tls://40.120.40.80:8786  Dashboard: http://40.120.40.80:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


## Prove it worked

In [13]:
client.run("import intake_esm; print(intake_esm.__version__)")

{}