Skip to content

Pytorch model deploy gives strange TorchServe errors #2357

@sivakhno

Description

@sivakhno

Describe the bug
I create model and run model.deploy(initial_instance_count=1, instance_type=instance_type) as per standard docs. It completes fine, but when I am trying to run predict command I get a time out error even though I expect predictions to take few seconds. Exactly the same code works well when run locally! More strangely I don't see any logs related to failing predict, but there are some hard to debug errors coming from the deploy step that actually didn't give any errors.

2021-05-19 18:54:41,249 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2021-05-19 18:54:50,760 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
--- but then 
2021-05-19 18:54:52,112 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
2021-05-19 18:54:52,112 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
2021-05-19 18:54:56,578 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.

It repeats it several times in logs, so it seems that the actual model deploy fails but I don't get any reports - only when running predict it times out.

To reproduce
Use sagemaker-2.39.1

TORCH_VERSION='1.7.1'
s3_uri='s3://...model.tar.gz'
model = PyTorchModel(
             model_data=s3_uri,
             role=role,
             py_version='py3',
             framework_version=TORCH_VERSION,
             entry_point='serve.py',
             source_dir='s3://..../sourcedir.tar.gz',

Here's serve.py.

import json

import torch
import numpy as np
from transformers import AutoTokenizer
from forms_sorter.modules.data_module import LabelMapper
from forms_sorter.modules.model import FormsSorterModel

"""
This is the sagemaker inference entry script
"""
CSV_CONTENT_TYPE = 'text/csv'
JSON_CONTENT_TYPE = 'text/json'
softmax = torch.nn.Softmax()


def model_fn(model_dir):
    """
    For serving with SageMaker
    SageMaker deployment presumes that
    all required files are provided within
    model_dir.
    Here we require classes.txt for index-to-label mapping and
    a model named last.ckpt
    model_fn is a reserved keyword
    """
    device = get_device()
    model_name = 'bert-base-cased'
    preprocessor = AutoTokenizer.from_pretrained(model_name)
    label_mapper = LabelMapper('classes.txt')
    model = FormsSorterModel.load_from_checkpoint('last.ckpt')
    model = model.to(device)
    model.eval()
    return preprocessor, model, label_mapper


def predict(
    input,
    checkpoint_file='last.ckpt',
    model_name='bert-base-cased',
    labels='classes.txt'
    ):
    """
    For model serving (outside SageMaker)
    """
    device = get_device()
    model_name = 'bert-base-cased'
    preprocessor = AutoTokenizer.from_pretrained(model_name)
    label_mapper = LabelMapper(labels)
    model = FormsSorterModel.load_from_checkpoint(checkpoint_file)
    model = model.to(device)
    model_artifacts = (preprocessor, model, label_mapper)
    results = predict_fn(input, model_artifacts)
    return results


def get_device():
    device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
    return device


def input_fn(input, content_type):
    if content_type == CSV_CONTENT_TYPE:
        records = input.split('\n')
        return records
    else:
        raise ValueError(
            'Content type {} not supported. The supported type is {}'.format(
                content_type, CSV_CONTENT_TYPE
            )
        )


def preprocess(input, preprocessor):
    r = []
    for i in input:
        x = preprocessor(i, padding='max_length', truncation=True)
        x = np.array(x['input_ids'])
        r.append(torch.tensor(x).unsqueeze(dim=0))
    result = torch.cat(r)
    return result


def predict_fn(input, model_artifacts):
    preprocessor, model, label_mapper = model_artifacts
    # Pre-process
    input_tensor = preprocess(input, preprocessor)
    # Copy input to gpu if available
    device = get_device()
    input_tensor = input_tensor.to(device=device)
    # Invoke
    with torch.no_grad():
        output_tensor = model(input_tensor)
        # Convert to probabilities
        softmax = torch.nn.Softmax()
        output_tensor = softmax(output_tensor.logits)
        probs, predictions = torch.max(output_tensor, dim=1)
        classes = label_mapper.reverse_map(predictions)

    return classes, probs


def output_fn(output, accept=JSON_CONTENT_TYPE):
    if accept == JSON_CONTENT_TYPE:
        prediction = json.dumps(output)
        return prediction, accept
    else:
        raise ValueError(
            'Content type {} not supported. The only types supported are {}'.format(
                accept, JSON_CONTENT_TYPE
            )
        )

Expected behavior
Model deploy fails with clear error.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.39.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.7.1
  • Python version: 3.7
  • CPU or GPU: Both
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions