## Debugging an AzureML remote run

This notebook shows how to an AzureML run that is running on and AMLCompute cluster. 
For setup instructions, please see the [Readme](README.md)


First, we get the workspace and compute target.

In [24]:
from azureml.core import Workspace, Experiment
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails
from azureml.core import VERSION
VERSION

'1.0.41'

## Start run with debugger enabled

First we will get the workspace and AML compute cluster we are about to use. The assumption is that you have created a cluster with the name `cpucluster` -- else change the name below accordingly. **It is important that, as you created the cluster, you have provided a username (I am using `debuguser`) and password and ssh key (ssh key is optional), since you will need to log in to the worker nodes to establish the port forwarding to the docker container.**

![create_cluster](img/create_cluster.png)


In [31]:
ws = Workspace.from_config()
cluster = ws.compute_targets['cpucluster2']

First, we will start the job on the cluster. As a remote debugging agent, we will be using the Python Tools Visual Studio Debugger (PTVSD). As the debugging client, we will be using VSCode -- see [here](https://code.visualstudio.com/docs/python/debugging#_remote-debugging) for some documentation on how to use the two together.

In our case we are using the command line launch method, since it doesn't require us to change our Python code. Instead I am adding the file `launch_debug.py` to the project, which takes the name of the actual script as a parameter and launches it from the debug agent. The agent will wait for the client to connect before running the actual script. The command line that `launch_debug.py` creates and then executes will look like this:

    python -m ptvsd --host 10.0.0.4 --port 5678 --wait train.py
    
- `train.py` is the simple training script that we want to debug 
- `--host 10.0.0.4` means that the debugger will attach to this IP address on the worker (which wil a docker container in AML Compute) 
- `--port 5678` determines the port the agent will use 
- `--wait` instructs the agent to wait for the debugger to attach before proceeding with execution of the script

The following code launches the script as shown above.

In this example I am assuming there is a cluster `cpucluster` defined on the workspace. For the pip/conda dependencies, make sure to include the package `'ptvsd'` in the environment.

In [41]:
est = Estimator('src',                                 # the directory where the launch and train script are
                compute_target=cluster, 
                entry_script='launch_debug.py', 
                pip_packages=['sklearn', 'ptvsd'],     # make sure to include ptvsd in the list of pip packages
                script_params={'': 'train.py'})

run = Experiment(ws, 'debug').submit(est)

The above cell will launch the run on the compute cluster, which will likely trigger a docker image to be created and then a compute node to be provision on the cluster. After about 9 minutes you should see the below image when you execute the next cell. This means that the process is now ready for the debugger to attach.

![](img/waiting_for_debugger.png)

In [42]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

## Attaching the VSCode debugger

### Establish the port-forwarding from Notebook VM to worker node
Since Notebook VM does not yet support VNets, you need to build an SSH port forwarder through SSH login.

First you need to find the IP and port of the node that is currently waiting for the debugger to connect. In the Azure portal in the Machine Learning Workspace, findthe nodes on the cluster by going to the nodes tab of the cluster. Note down the IP and port -- in my case it is `40.74.20.244` and port `50000`.

![compute_nodes](img/compute_nodes.png)

Then, open the terminal on the Notebook VM and type the following:  
    
    ssh debuguser@<clusternode IP> -p <clusternode port> -L 5678:<debugger IP>:5678

In my case it is:
    
    ssh debuguser@40.74.20.244 -p 50000 -L 5678:10.0.0.4:5678

Make sure to leave the terminal process running to keep the port-forward alive.

If you want to double-check that the port-forward is indeed forwarding to the debug agent waiting for a connection just run this in another terminal on the Notebook VM:

    telnet localhost 5678
    
This should return you something like this:

    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Content-Length: 131

    {"type": "event", "seq": 0, "event": "output", "body": {"category": "telemetry", "output": "ptvsd", "data": {"version": "4.2.10"}}}
    Connection closed by foreign host.
    
    
While setting up the port forward is tedious, you will only have to do it once for as long as your compute node stays up. If you set your cluster to have a higher value for **Idle seconds before scale down** (e.g. 3600 [i.e. 1 hour]), the you can work all day without having to re-establish the port-forward.

### Actually attaching the VSCode debugger

Now, all the hard work is done, and all that's left is open VSCode and remote into the Notebook VM (see [README.md](README.md) for setup instructions). Now, open the `src` folder of this repo as the project folder -- **it is important that the `src` folder is the project root in VSCode, so the file names match with the file names on the compute target**.

Open the file `train.py` and set a breakpoint somewhere in the middle:

![](img/set_breakpoint.png)

 Now attach the debugger by clicking on the debug icon on the left and then by picking the debug configuration "Python: Remote Attach" from the top. The debugger should attach and now allow you to step through your code.
 
 ![](img/debug.png)

![](img/network.png)

In [48]:
run.cancel()

In [49]:
est = Estimator('src',                                 # the directory where the launch and train script are
                compute_target=cluster, 
                entry_script='launch_debug.py', 
                pip_packages=['sklearn', 'ptvsd'],     # make sure to include ptvsd in the list of pip packages
                script_params={'': 'train.py'})

run = Experiment(ws, 'debug').submit(est)

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…