# Create and test an AzureML environment supporting DeepSpeed training

__Goal__: Get a DeepSpeed environment setup within Azure ML

This notebook shows how to build an AzureML environment that supports [DeepSpeed] training. At the end of it you should have an environment, backed by a [Docker] image, that can train PyTorch and HuggingFace transformer models. We build our environment on a remote machine by default but also describe the process of building it locally for debugging purposes. 

## Registering and building an environment on AzureML

The [Azure container for Pytorch docker image](https://learn.microsoft.com/en-us/azure/machine-learning/resource-azure-container-for-pytorch?view=azureml-api-2) should be the starting point for running DeepSpeed training and inference jobs. The image includes Deepspeed 0.9 and should be slightly modified to enable this tutorial.

### Composition of the envirinment
To follow this tutorial, I recommend you create a new environment based on the existing curated environment ACPT image. Afterwich, modify the dockerfile and requirements.txt contexts as follows:
I've named this environment `deepspeed-transformers-dataset`

### dockerfile

```dockerfile
FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu117-py38-torch1131:biweekly.202311.2


COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir


COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

RUN apt-get update
RUN apt-get install -y openssh-server openssh-client
RUN ds_report
```

### requirements.txt

```
azureml-core==1.54.0
azureml-dataset-runtime==1.54.0
azureml-defaults==1.54.0
azure-ml==0.0.1
azure-ml-component==0.9.18.post2
azureml-mlflow==1.54.0
azureml-contrib-services==1.54.0
azureml-contrib-services==1.54.0
azureml-automl-common-tools==1.54.0
torch-tb-profiler~=0.4.0
azureml-inference-server-http~=0.8.0
inference-schema~=1.5.0
MarkupSafe==2.1.2
regex
pybind11
urllib3>=1.26.18
cryptography>=41.0.4
aiohttp>=3.8.5
transformers
datasets
scikit-learn
transformers[torch]
accelerate
```