# Train with PyTorch Lightning

description: train single-node, including single-node multi-gpu, pytorch lightning

In [1]:
%pip install --upgrade tensorboard azureml-tensorboard

Note: you may need to restart the kernel to use updated packages.


In [8]:
import azureml.core
print(azureml.core.VERSION)

1.48.0


In [10]:
%conda install pywin32

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\ziqiwang\Anaconda3

  added / updated specs:
    - pywin32


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-22.11.1              |   py39haa95532_4         892 KB
    ruamel.yaml-0.17.21        |   py39h2bbff1b_0         174 KB
    ruamel.yaml.clib-0.2.6     |   py39h2bbff1b_1         101 KB
    ------------------------------------------------------------
                                           Total:         1.1 MB

The following NEW packages will be INSTALLED:

  ruamel.yaml        pkgs/main/win-64::ruamel.yaml-0.17.21-py39h2bbff1b_0
  ruamel.yaml.clib   pkgs/main/win-64::ruamel.yaml.clib-0.2.6-py39h2bbff1b_1

The following packages will be UPDATED:

  conda                               4.14.0-py39haa95532_0 -->

In [2]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

Workspace.create(name='ZiqiPipelineTest', subscription_id='ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9', resource_group='ziqitest')

In [6]:
# training script
source_dir = "src"
script_name = "train.py"

# environment file
environment_file = "environment.yml"

# azure ml settings
environment_name = "pt-lightning"
experiment_name = "pt-lightning-tutorial"
compute_name = "gpu-cluster"

## Create environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment. The dependencies for this tutorial include **torch**, **torchvision**, and **pytorch-lightning**.

Since this example is for GPU training, you will need to specify a GPU base image that has the necessary dependencies. Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use, see the [Azure/AzureML-Containers](https://github.com/Azure/AzureML-Containers) GitHub repo for more information.

Azure ML will build a conda environment with the dependencies you specified in your .yml file on the base image.

In [4]:
from azureml.core import Environment

env = Environment.from_conda_specification(environment_name, environment_file)

# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.


Alternatively, you can just capture all your dependencies directly in a custom Docker image or Dockerfile, and create your environment from that. For more information, see [Train with custom image](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-custom-image).

## Configure and run training job
Create a ScriptRunConfig to specify the training script & arguments, environment, and cluster to run on.

For single-node, single-GPU training, specify `1` GPU to the `--gpus` command-line argument expected by Lightning.
Note that you do not need to define this flag manually in your training script as Lightning can add it automatically. The training script parses the command-line arguments and passes them to the [`Trainer()`](https://pytorch-lightning.readthedocs.io/en/stable/trainer.html?highlight=Trainer).

Lightning handles all the NVIDIA flags for you, there's no need to set them yourself. 

In [7]:
import os
from azureml.core import ScriptRunConfig, Experiment

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=["--max_epochs", 25, "--gpus", 1], # Single Node Single GPU Training
    compute_target=compute_name,
    environment=env,
)

run = Experiment(ws, experiment_name).submit(src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
pt-lightning-tutorial,pt-lightning-tutorial_1671166951_9fabed63,azureml.scriptrun,Preparing,Link to Azure Machine Learning studio,Link to Documentation


In [8]:
run.wait_for_completion(show_output=True)

RunId: pt-lightning-tutorial_1671166951_9fabed63
Web View: https://ml.azure.com/runs/pt-lightning-tutorial_1671166951_9fabed63?wsid=/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourcegroups/ziqitest/workspaces/ZiqiPipelineTest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/20_image_build_log.txt

2022/12/16 05:02:37 Downloading source code...
2022/12/16 05:02:39 Finished downloading source code
2022/12/16 05:02:39 Creating Docker network: acb_default_network, driver: 'bridge'
2022/12/16 05:02:39 Successfully set up Docker network: acb_default_network
2022/12/16 05:02:39 Setting up Docker configuration...
2022/12/16 05:02:40 Successfully set up Docker configuration
2022/12/16 05:02:40 Logging in to registry: ziqitest.azurecr.io
2022/12/16 05:02:41 Successfully logged into ziqitest.azurecr.io
2022/12/16 05:02:41 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2022/12/16 05:02:41 Scanning for dependencies.

{'runId': 'pt-lightning-tutorial_1671166951_9fabed63',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2022-12-16T05:17:37.380317Z',
 'endTimeUtc': '2022-12-16T05:24:02.116676Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': 'fe46497b-7827-4dc4-bfc7-fef3c38f9463',
  'azureml.git.repository_uri': 'https://github.com/ZikeiWong/azureml-examples.git',
  'mlflow.source.git.repoURL': 'https://github.com/ZikeiWong/azureml-examples.git',
  'azureml.git.branch': 'ziqi/2022.11RampUp',
  'mlflow.source.git.branch': 'ziqi/2022.11RampUp',
  'azureml.git.commit': '7eedfdc455a1a67db6f89e032dda7fd95e0f7f3f',
  'mlflow.source.git.commit': '7eedfdc455a1a67db6f89e032dda7fd95e0f7f3f',
  'azureml.git.dirty': 'False',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  '

### Single-node multi-GPU training

Lightning supports several [distributed modes](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-modes) for training. DistributedDataParallel (DDP) is recommended over DataParallel (DP) for training.

For multi-GPU training on a single node, specify the number of GPUs to train on (typically this will correspond to the number of GPUs in your cluster's SKU) and the distributed mode, in this case DistributedDataParallel ("ddp"), which Lightning expects as arguments `--gpus` and `--accelerator`, respectively. The Lightning implementation of DDP will manage starting the individual processes on each GPU under the hood. See their [Multi-GPU](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html) training documentation for more information.

In [12]:
import os
from azureml.core import ScriptRunConfig, Experiment

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=["--max_epochs", 25, "--gpus", 2, "--accelerator", "ddp"],
    compute_target=compute_name,
    environment=env,
)

run = Experiment(ws, experiment_name).submit(src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
pt-lightning-tutorial,pt-lightning-tutorial_1671168920_dabd716e,azureml.scriptrun,Queued,Link to Azure Machine Learning studio,Link to Documentation


You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [13]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

Performing interactive authentication. Please follow the instructions on the terminal.
Performing interactive authentication. Please follow the instructions on the terminal.
Performing interactive authentication. Please follow the instructions on the terminal.
Performing interactive authentication. Please follow the instructions on the terminal.


In [14]:
run.wait_for_completion(show_output=True)

RunId: pt-lightning-tutorial_1671168920_dabd716e
Web View: https://ml.azure.com/runs/pt-lightning-tutorial_1671168920_dabd716e?wsid=/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourcegroups/ziqitest/workspaces/ZiqiPipelineTest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming user_logs/std_log.txt

Global seed set to 1234
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST/raw/train-images-idx3-ubyte.gz

  0%|          | 0/9912422 [00:00<?, ?it/s]
 53%|█████▎    | 5239808/9912422 [00:00<00:00, 51424758.59it/s]
9913344it [00:00, 51836587.67it/s]                             
Extracting MNIST/raw/train-images-idx3-ubyte.gz to MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST/raw/train-labels-idx1-ubyte.gz

  0%|          | 0/28881 [00:00<?, ?it/s]
29696it [00:00, 108

ActivityFailedException: ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "{\n  \"code\": \"ExecutionFailed\",\n  \"category\": \"UserError\",\n  \"message\": {\n    \"NonCompliant\": \"Process '/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/bin/python' exited with code 1 and error message 'Execution failed. Process exited with status code 1. Error:     trainer = pl.Trainer.from_argparse_args(args)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/properties.py\\\", line 148, in from_argparse_args\\n    return argparse_utils.from_argparse_args(cls, args, **kwargs)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/argparse_utils.py\\\", line 50, in from_argparse_args\\n    return cls(**trainer_kwargs)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py\\\", line 41, in overwrite_by_env_vars\\n    return fn(self, **kwargs)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py\\\", line 360, in __init__\\n    deterministic,\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py\\\", line 104, in on_trainer_init\\n    self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py\\\", line 78, in parse_gpu_ids\\n    gpus = _sanitize_gpu_ids(gpus)\\n  File \\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py\\\", line 142, in _sanitize_gpu_ids\\n    \\\"\\\"\\\")\\npytorch_lightning.utilities.exceptions.MisconfigurationException: \\n                You requested GPUs: [0, 1]\\n                But your machine only has: [0]\\n            \\n\\n'. Please check the log file 'user_logs/std_log.txt' for more details.\"\n  },\n  \"details\": [\n    {\n      \"name\": \"exit_codes\",\n      \"value\": {\n        \"Literal\": {\n          \"Compliant\": \"1\"\n        }\n      }\n    }\n  ],\n  \"error\": null,\n  \"node_info\": null\n}",
        "messageParameters": {},
        "details": []
    },
    "time": "0001-01-01T00:00:00.000Z"
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"{\\n  \\\"code\\\": \\\"ExecutionFailed\\\",\\n  \\\"category\\\": \\\"UserError\\\",\\n  \\\"message\\\": {\\n    \\\"NonCompliant\\\": \\\"Process '/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/bin/python' exited with code 1 and error message 'Execution failed. Process exited with status code 1. Error:     trainer = pl.Trainer.from_argparse_args(args)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/properties.py\\\\\\\", line 148, in from_argparse_args\\\\n    return argparse_utils.from_argparse_args(cls, args, **kwargs)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/argparse_utils.py\\\\\\\", line 50, in from_argparse_args\\\\n    return cls(**trainer_kwargs)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py\\\\\\\", line 41, in overwrite_by_env_vars\\\\n    return fn(self, **kwargs)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py\\\\\\\", line 360, in __init__\\\\n    deterministic,\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py\\\\\\\", line 104, in on_trainer_init\\\\n    self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py\\\\\\\", line 78, in parse_gpu_ids\\\\n    gpus = _sanitize_gpu_ids(gpus)\\\\n  File \\\\\\\"/azureml-envs/azureml_06fef1d6564a13118c5eea3527d66aa8/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py\\\\\\\", line 142, in _sanitize_gpu_ids\\\\n    \\\\\\\"\\\\\\\"\\\\\\\")\\\\npytorch_lightning.utilities.exceptions.MisconfigurationException: \\\\n                You requested GPUs: [0, 1]\\\\n                But your machine only has: [0]\\\\n            \\\\n\\\\n'. Please check the log file 'user_logs/std_log.txt' for more details.\\\"\\n  },\\n  \\\"details\\\": [\\n    {\\n      \\\"name\\\": \\\"exit_codes\\\",\\n      \\\"value\\\": {\\n        \\\"Literal\\\": {\\n          \\\"Compliant\\\": \\\"1\\\"\\n        }\\n      }\\n    }\\n  ],\\n  \\\"error\\\": null,\\n  \\\"node_info\\\": null\\n}\",\n        \"messageParameters\": {},\n        \"details\": []\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
    }
}