# Deploying MusicGen on SageMaker Hosting

This notebook deploys the MusicGen model from Facebook on a SageMaker endpoint using the large model inference container.

For faster startup, the model artifacts are pushed to S3 rather than downloading through the endpoint.

This endpoint requires a custom `requirements.txt` file to install a newer version of HuggingFace Transformers alongside other dependencies such as Protobuf. This is because it is using the `MusicGenForConditionalGeneration` class from HuggingFace which is available from v4.31 and later. Currently, PyPI hosts v4.30 as of writing. The installation instructions were taken from the repository for Audiocraft.

Related resources:
- [Audiocraft](https://github.com/facebookresearch/audiocraft)
- Model card for [musicgen-small](https://huggingface.co/facebook/musicgen-small)

In [2]:
!pip install huggingface_hub --quiet

[0m

In [3]:
import sagemaker
from sagemaker.djl_inference.model import DJLModel

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl-/musicgen"  # folder within bucket where code artifact will go
s3_model_prefix = "hf-large-model-djl-/musicgen"  # folder within bucket where code artifact will go
region = sess._region_name
account_id = sess.account_id()

In [4]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path("./musicgen")
local_model_path.mkdir(exist_ok=True)
model_name = "facebook/musicgen-small"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

Downloading (…)d96f3962/config.json:   0%|          | 0.00/7.87k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Downloading (…)ssion_state_dict.bin:   0%|          | 0.00/236M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)f3962/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading state_dict.bin:   0%|          | 0.00/841M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

In [5]:
model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

Model uploaded to --- > s3://sagemaker-us-east-1-485579262988/hf-large-model-djl-/musicgen
We will set option.s3url=s3://sagemaker-us-east-1-485579262988/hf-large-model-djl-/musicgen


In [6]:
!rm -rf musicgen

In [7]:
!mkdir -p code/musicgen

This is the `requirements.txt`.
Transformers and Accelerate are pulled from GitHub and then other required packages are included as well.

In [8]:
%%writefile code/musicgen/requirements.txt
git+https://github.com/huggingface/transformers.git
git+https://github.com/huggingface/accelerate.git
google
protobuf
torch>=2.0.0
torchaudio>=2.0.0

Overwriting code/musicgen/requirements.txt


This is the `inference.py` file. It uses DeepSpeed for inference. During testing, it was found that this provides much better performance instead of just using HuggingFace.

The output appears as:
```
{
    'outputs': {
        'audio': Python list of lists with model output
        'sampling_rate': Audio sampling rate for output
    }
}
```

The indices from the output can be passed into an IPython `Audio` element for playback in a notebook.

In [29]:
%%writefile code/musicgen/inference.py
import logging
import os
import torch
import base64
import deepspeed

from transformers import AutoProcessor, MusicgenForConditionalGeneration, AutoConfig
from djl_python.encode_decode import encode, decode
from djl_python.inputs import Input
from djl_python.outputs import Output
from typing import Optional
from PIL import Image
from io import BytesIO



def get_torch_dtype_from_str(dtype: str):
    if dtype == "auto":
        return dtype
    if dtype == "fp32":
        return torch.float32
    if dtype == "fp16":
        return torch.float16
    if dtype == "bf16":
        return torch.bfloat16
    if dtype == "int8":
        return torch.int8
    if dtype is None:
        return None
    raise ValueError(f"Invalid data type: {dtype}")

class MusicGenService(object):
    def __init__(self):
        self.model = None
        self.initialized = False
        self.processor = None
        self.device = None
    
    def initialize(self, properties):
        model_id_or_path = properties.get("model_id") or properties.get(
            "model_dir")
        device_id = int(properties.get("device_id", "-1"))
        self.device = int(os.getenv("LOCAL_RANK", 0))
        task = properties.get("task")
        tp_degree = int(properties.get("tensor_parallel_degree", 1))

        ds_config = {
            "tensor_parallel": {
                "tp_size": tp_degree
            },
            "enable_cuda_graph": properties.get("enable_cuda_graph", "false").lower() == "true",
            "triangular_masking": properties.get("triangular_masking", "true").lower() == "true",
            "return_tuple": properties.get("return_tuple", "true").lower() == "true",
            "training_mp_size": int(properties.get("training_mp_size", 1)),
            "save_mp_checkpoint_path": properties.get("save_mp_checkpoint_path"),
            "replace_with_kernel_inject": False,
            'dtype': 'fp32'
        }
        if "checkpoint" in properties:
            ds_config["checkpoint"] = os.path.join(
                self.model_id_or_path, properties.get("checkpoint"))
            ds_config["base_dir"] = self.model_id_or_path
            if self.data_type is None:
                raise ValueError(
                    "dtype should also be provided for checkpoint loading")
            
        # self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        # self.model = MusicgenForConditionalGeneration.from_pretrained(model_id_or_path)
        
            
        self.config = AutoConfig.from_pretrained(model_id_or_path)
            
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = MusicgenForConditionalGeneration.from_pretrained(model_id_or_path)

        self.model = deepspeed.init_inference(self.model, config=ds_config)
        
        self.initialized = True
        
    def inference(self, inputs):
        content_type = inputs.get_property('Content-Type')
        accept = inputs.get_property('Accept')
        if not accept:
            accept = content_type if content_type.startswith(
                "tensor/") else "application/json"
        elif "*/*" in accept:
            accept = "application/json"
            
        data = inputs.get_as_json()
        
        input_data = data['inputs']
        
        if 'parameters' in data:
            params = data['parameters']
        else:
            params = {}
        
        print(params)
        logging.info(params)
        
        processed_inputs = self.processor(text=input_data, return_tensors='pt', padding=True).to(self.device)
        
        logging.info("Processed inputs")
        audio_values = self.model.generate(**processed_inputs, **params)
        
        logging.info("Processed values")
        
        out = audio_values.cpu().detach().numpy().tolist()
        
        return Output().add_as_json({'outputs': {'audio': out, 'sampling_rate':self.model.config.audio_encoder.sampling_rate}})
    
_service = MusicGenService()

def handle(inputs: Input) -> Optional[Output]:
    if not _service.initialized:
        _service.initialize(inputs.get_properties())

    if inputs.is_empty():
        return None

    return _service.inference(inputs)

Overwriting code/musicgen/inference.py


In [34]:
model = DJLModel(
    model_artifact,
    role=role,
    source_dir='code/musicgen',
    entry_point='inference.py',
    number_of_partitions=1
)

In [35]:
predictor = model.deploy(instance_type='ml.g5.8xlarge', initial_instance_count=1)

------------!

In [36]:
%%time
out = predictor.predict({'inputs':["Country music to listen to on my dog walk"], 'parameters':{'max_new_tokens':1024}})

CPU times: user 292 ms, sys: 39 ms, total: 331 ms
Wall time: 22.2 s


Audio output is displayed below which is playable from within a notebook.

In [37]:
from IPython.display import Audio

# sampling_rate = model.config.audio_encoder.sampling_rate
Audio(out['outputs']['audio'][0], rate=out['outputs']['sampling_rate'])

Cleanup the endpoint

In [38]:
predictor.delete_endpoint()