Pytorch model deploy gives strange TorchServe errors

**Describe the bug**
I create model and run `model.deploy(initial_instance_count=1, instance_type=instance_type)` as per standard docs. It completes fine, but when I am trying to run `predict` command I get a time out error even though I expect predictions to take few seconds. _Exactly the same code works well when run locally!_ More strangely I don't see any logs related to failing `predict`, but there are some hard to debug errors coming from the deploy step that actually didn't give any errors.
```
2021-05-19 18:54:41,249 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2021-05-19 18:54:50,760 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
--- but then 
2021-05-19 18:54:52,112 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
2021-05-19 18:54:52,112 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
2021-05-19 18:54:56,578 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
```

It repeats it several times in logs, so it seems that the actual model deploy fails but I don't get any reports - only when running predict it times out.  

**To reproduce**
Use sagemaker-2.39.1
```
TORCH_VERSION='1.7.1'
s3_uri='s3://...model.tar.gz'
model = PyTorchModel(
             model_data=s3_uri,
             role=role,
             py_version='py3',
             framework_version=TORCH_VERSION,
             entry_point='serve.py',
             source_dir='s3://..../sourcedir.tar.gz',
```
Here's serve.py.
```
import json

import torch
import numpy as np
from transformers import AutoTokenizer
from forms_sorter.modules.data_module import LabelMapper
from forms_sorter.modules.model import FormsSorterModel

"""
This is the sagemaker inference entry script
"""
CSV_CONTENT_TYPE = 'text/csv'
JSON_CONTENT_TYPE = 'text/json'
softmax = torch.nn.Softmax()


def model_fn(model_dir):
    """
    For serving with SageMaker
    SageMaker deployment presumes that
    all required files are provided within
    model_dir.
    Here we require classes.txt for index-to-label mapping and
    a model named last.ckpt
    model_fn is a reserved keyword
    """
    device = get_device()
    model_name = 'bert-base-cased'
    preprocessor = AutoTokenizer.from_pretrained(model_name)
    label_mapper = LabelMapper('classes.txt')
    model = FormsSorterModel.load_from_checkpoint('last.ckpt')
    model = model.to(device)
    model.eval()
    return preprocessor, model, label_mapper


def predict(
    input,
    checkpoint_file='last.ckpt',
    model_name='bert-base-cased',
    labels='classes.txt'
    ):
    """
    For model serving (outside SageMaker)
    """
    device = get_device()
    model_name = 'bert-base-cased'
    preprocessor = AutoTokenizer.from_pretrained(model_name)
    label_mapper = LabelMapper(labels)
    model = FormsSorterModel.load_from_checkpoint(checkpoint_file)
    model = model.to(device)
    model_artifacts = (preprocessor, model, label_mapper)
    results = predict_fn(input, model_artifacts)
    return results


def get_device():
    device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
    return device


def input_fn(input, content_type):
    if content_type == CSV_CONTENT_TYPE:
        records = input.split('\n')
        return records
    else:
        raise ValueError(
            'Content type {} not supported. The supported type is {}'.format(
                content_type, CSV_CONTENT_TYPE
            )
        )


def preprocess(input, preprocessor):
    r = []
    for i in input:
        x = preprocessor(i, padding='max_length', truncation=True)
        x = np.array(x['input_ids'])
        r.append(torch.tensor(x).unsqueeze(dim=0))
    result = torch.cat(r)
    return result


def predict_fn(input, model_artifacts):
    preprocessor, model, label_mapper = model_artifacts
    # Pre-process
    input_tensor = preprocess(input, preprocessor)
    # Copy input to gpu if available
    device = get_device()
    input_tensor = input_tensor.to(device=device)
    # Invoke
    with torch.no_grad():
        output_tensor = model(input_tensor)
        # Convert to probabilities
        softmax = torch.nn.Softmax()
        output_tensor = softmax(output_tensor.logits)
        probs, predictions = torch.max(output_tensor, dim=1)
        classes = label_mapper.reverse_map(predictions)

    return classes, probs


def output_fn(output, accept=JSON_CONTENT_TYPE):
    if accept == JSON_CONTENT_TYPE:
        prediction = json.dumps(output)
        return prediction, accept
    else:
        raise ValueError(
            'Content type {} not supported. The only types supported are {}'.format(
                accept, JSON_CONTENT_TYPE
            )
        )

```
**Expected behavior**
Model deploy fails with clear error.


**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 2.39.1
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: PyTorch
- **Framework version**: 1.7.1
- **Python version**: 3.7
- **CPU or GPU**: Both
- **Custom Docker image (Y/N)**: N

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pytorch model deploy gives strange TorchServe errors #2357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pytorch model deploy gives strange TorchServe errors #2357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions