<a href="https://colab.research.google.com/github/blurred421/LFD473-code/blob/main/notebooks/Chapter11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 11: Serving Models with TorchServe

In [None]:
!pip install torch-model-archiver torchserve captum pyngrok

## 11.2 Learning Objectives

By the end of this chapter, you should be able to:
- understand, build, and assemble the necessary components into a model archive
- serve a trained model locally using TorchServe

## 11.3 Archiving and Serving Models

In [1]:
!wget https://github.com/dvgodoy/assets/releases/download/model/fomo_model.pth

--2025-03-12 15:11:16--  https://github.com/dvgodoy/assets/releases/download/model/fomo_model.pth
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/662216076/ef1f7d06-df52-4aee-b442-caf89a66872c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250312%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250312T151116Z&X-Amz-Expires=300&X-Amz-Signature=202b4a365753c4aa54fa4bb24fb645acd6d578fa40843028c363e8af816e279d&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dfomo_model.pth&response-content-type=application%2Foctet-stream [following]
--2025-03-12 15:11:16--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/662216076/ef1f7d06-df52-4aee-b442-caf89a66872c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential

In [2]:
import torch
import torch.nn as nn

repo = 'pytorch/vision:v0.15.2'
model = torch.hub.load(repo, 'resnet18', weights=None)
model.fc = nn.Linear(512, 4)

state = torch.load('fomo_model.pth', map_location='cpu')
model.load_state_dict(state)

Downloading: "https://github.com/pytorch/vision/zipball/v0.15.2" to /root/.cache/torch/hub/v0.15.2.zip
  warn(
  state = torch.load('fomo_model.pth', map_location='cpu')


<All keys matched successfully>

### 11.3.1 Model Archiver

Let's start with the model archive (`.mar`) file, a collection of files and folders zipped together that contains:
- a `MAR-INF` folder with a `MANIFEST.json` file inside that describes the contents of the model archive itself, such as model and archiver versions, and the files that make up the archive
- a serialized file containing the model's weights/state (`--serialized-file` argument)
- a Python file containing only one class definition of our model's class inherited from `nn.Module` (only required if the model isn't scripted - more on that later) (`--model-file` argument)
- an optional Python file containing one class definition of the handler's class inherited from `ts.torch_handler.BaseHandler` that performs the necessary transformations for pre- and post-processing  OR the name of a predefined handler (`--handler` argument)
- an optional extra file `index_to_name.json` for mapping predicted class indices to its corresponding category names (automatically used by some predefined handlers) (`--extra-files` argument)

It is typical to assemble the model archive file through the command line interface:

```
torch-model-archiver --model-name <your_model_name> \
                     --version <your_model_version> \
                     --model-file <your_model_file>.py \
                     --serialized-file <your_model_name>.pth \
                     --handler <handler-script OR name> \
                     --extra-files ./index_to_name.json
```

However, let's take a closer look at each one of its components and assemble it ourselves instead.

### 11.3.2 Model File

We need to:
- define our own class
- create an instance of an untrained ResNet18 model
- replace its head (`fc` layer) with our own
- update our own class internal dictionary with the entries from ResNet's dictionary
- set ResNet's forward pass to our own class using `setattr`

It looks like this:

In [8]:
# Creates a wrapper of FOMONet ( resnet dictionary )

from torchvision.models import resnet18

class FOMONet(nn.Module):
    def __init__(self):
        super().__init__()

        # Create an instance of an untrained ResNet18
        resnet = resnet18(weights=None)
        # Modifies the architecture to our task
        resnet.fc = nn.Linear(512, 4)

        # Replicate ResNet's modified architecture to FOMONet
        self.__dict__.update(resnet.__dict__)
        # Replicate Resnet's forward method to FOMONet
        setattr(self, 'forward', resnet.forward)

In [9]:
fomo = FOMONet()
fomo.load_state_dict(model.state_dict())

<All keys matched successfully>

In [10]:
fomo.eval()
model.eval()

torch.manual_seed(32)
x = torch.randn(1, 3, 224, 224)

fomo(x), model.cpu()(x)

(tensor([[ 0.2412, -2.8556, -1.1869,  0.8597]], grad_fn=<AddmmBackward0>),
 tensor([[ 0.2412, -2.8556, -1.1869,  0.8597]], grad_fn=<AddmmBackward0>))

In [12]:
model_file_script = """
import torch.nn as nn
from torchvision.models import resnet18

class FOMONet(nn.Module):
    def __init__(self):
        super().__init__()

        # Create an instance of an untrained ResNet18
        resnet = resnet18(weights=None)
        # Modifies the architecture to our task
        resnet.fc = nn.Linear(512, 4)

        # Replicate ResNet's modified architecture to FOMONet
        self.__dict__.update(resnet.__dict__)
        # Replicate Resnet's forward method to FOMONet
        setattr(self, 'forward', resnet.forward)
"""

with open('model_file.py', 'w') as fp:
    fp.write(model_file_script)

### 11.3.3 Scripted Models

"*TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.*"

Source: [Torchscript](https://pytorch.org/docs/stable/jit.html)

The key element here is "*no Python dependency*", meaning the model can be run in a standalone C++ program, for example. This preserves the best of both worlds: the ease and friendliness of the Python language for development, and the speed and reliability of the C++ language for deploying in production.

In [13]:
# once it is scripted, there is no need for the model class def anymore
scripted_model = torch.jit.script(model)

### 11.3.4 Serialized File

In [14]:
# We already saved the model to disk in the previous chapter
# eager mode version
torch.save(model.state_dict(), 'fomo_model.pth')

# scripted version
scripted_model.save("fomo_model.pt")

### 11.3.5 Inference Handler

There are several implemented [default handlers](https://pytorch.org/serve/default_handlers.html) in Torchserve:
- `image_classifier`
- `object_detector`
- `text_classifier`
- `image_segmenter`

The first three handles also implement mapping the predicted class to its corresponding names/categories using an standard `index_to_name.json` extra file.

#### 11.3.5.1 Initialize

```python
def initialize(self, context):
    """Initialize function loads the model.pt file and initialized the model object.
       First try to load torchscript else load eager mode state_dict based model.
    """
    model_file = self.manifest["model"].get("modelFile", "")
    if model_file:
        self.model = self._load_pickled_model(model_dir, model_file, self.model_pt_path)
        self.model.to(self.device)
        self.model.eval()
    elif self.model_pt_path.endswith(".pt"):
        self.model = self._load_torchscript_model(self.model_pt_path)
        self.model.eval()
```

#### 11.3.5.2 Handle

```python
def handle(self, data, context):
    """Entry point for default handler. It takes the data from the input request and returns
       the predicted outcome for the input.
    """
    data_preprocess = self.preprocess(data)
    output = self.inference(data_preprocess)
    output = self.postprocess(output)

    return output
```

#### 11.3.5.3 Preprocess

```python
def preprocess(self, data):
    """
    Preprocess function to convert the request input to a tensor(Torchserve supported format).
    The user needs to override to customize the pre-processing
    """
    images = []

    for row in data:
        # Compat layer: normally the envelope should just return the data
        # directly, but older versions of Torchserve didn't have envelope.
        image = row.get("data") or row.get("body")
        if isinstance(image, str):
            # if the image is a string of bytesarray.
            image = base64.b64decode(image)

        # If the image is sent as bytesarray
        if isinstance(image, (bytearray, bytes)):
            image = Image.open(io.BytesIO(image))
            image = self.image_processing(image)
        else:
            # if the image is a list
            image = torch.FloatTensor(image)

        images.append(image)

    return torch.stack(images).to(self.device)
```

Let's take a quick look at the `image_processing()` function that's called by the `preprocess()` method:

In [20]:
!pip install torchserve captum

Collecting captum
  Downloading captum-0.7.0-py3-none-any.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6->captum)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6->captum)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6->captum)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6->captum)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6->captum)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.6->captum)
  Downloading nvidia_cufft_cu1

In [21]:
from ts.torch_handler.image_classifier import ImageClassifier

ImageClassifier.image_processing



Compose(
    Resize(size=256, interpolation=bilinear, max_size=None, antialias=warn)
    CenterCrop(size=(224, 224))
    ToTensor()
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)

In [22]:
from torchvision.models import get_weight

weights = get_weight('ResNet18_Weights.DEFAULT')
weights.transforms()

ImageClassification(
    crop_size=[224]
    resize_size=[256]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

In [23]:
!wget https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch9/fig_0_100.jpg

--2025-03-12 15:29:45--  https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch9/fig_0_100.jpg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4106 (4.0K) [image/jpeg]
Saving to: ‘fig_0_100.jpg’


2025-03-12 15:29:45 (38.1 MB/s) - ‘fig_0_100.jpg’ saved [4106/4106]



In [24]:
from PIL import Image

img = Image.open('./fig_0_100.jpg')

(ImageClassifier.image_processing(img) == weights.transforms()(img)).all()

tensor(True)

#### 11.3.5.4 Inference

```python
def inference(self, data, *args, **kwargs):
    """
    The Inference Function is used to make a prediction call on the given input request.
    The user needs to override the inference function to customize it.
    """
    with torch.no_grad():
        marshalled_data = data.to(self.device)
        results = self.model(marshalled_data, *args, **kwargs)
    return results
```

#### 11.3.5.5 Postprocess

```python
def postprocess(self, data):
    """
    The post process function makes use of the output from the inference and converts into a
    Torchserve supported response output.
    """
    ps = F.softmax(data, dim=1)
    probs, classes = torch.topk(ps, self.topk, dim=1)
    probs = probs.tolist()
    classes = classes.tolist()
    return map_class_to_label(probs, self.mapping, classes)
```

#### 11.3.5.6 Custom Handler

In [25]:
handler_file_script = """
from ts.torch_handler.image_classifier import ImageClassifier

class FOMOHandler(ImageClassifier):
    def __init__(self):
      super().__init__()

      # By default, ImageClassifier uses top-5 classes
      # but our task has only 4, so we need to tweak it
      self.set_max_result_classes(4)
"""

with open('handler_file.py', 'w') as fp:
    fp.write(handler_file_script)

### 11.3.6 Extra Files

In [26]:
# We didn't load the dataset in this chapter, so we're building the dict manually
# class_to_idx = datasets['train'].class_to_idx

class_to_idx = {'Fig': 0, 'Mandarine': 1, 'Onion White': 2, 'Orange': 3}

In [27]:
index_to_name = {v: k for k, v in class_to_idx.items()}
index_to_name

{0: 'Fig', 1: 'Mandarine', 2: 'Onion White', 3: 'Orange'}

In [28]:
import json

with open('index_to_name.json', 'w') as f:
    json.dump(index_to_name, f)

### 11.3.7 Packaging

```
torch-model-archiver --model-name FOMO> \
                     --version 1.0 \
                     --model-file ./model_file.py \
                     --serialized-file fomo_model.pth \
                     --handler ./handler_file.py \
                     --extra-files ./index_to_name.json
```

In [29]:
!mkdir ./model_store

In [31]:
!pip install torch-model-archiver

Collecting torch-model-archiver
  Downloading torch_model_archiver-0.12.0-py3-none-any.whl.metadata (1.4 kB)
Collecting enum-compat (from torch-model-archiver)
  Downloading enum_compat-0.0.3-py3-none-any.whl.metadata (954 bytes)
Downloading torch_model_archiver-0.12.0-py3-none-any.whl (16 kB)
Downloading enum_compat-0.0.3-py3-none-any.whl (1.3 kB)
Installing collected packages: enum-compat, torch-model-archiver
Successfully installed enum-compat-0.0.3 torch-model-archiver-0.12.0


In [32]:
import sys
from model_archiver.model_packaging import generate_model_archive

sys.argv = ['',
            '--model-name', 'FOMO',
            '--version', '1.0',
            '--model-file', 'model_file.py',
            '--serialized-file', 'fomo_model.pth',
            '--handler', 'handler_file.py',
            '--extra-files', 'index_to_name.json',
            '--export-path', './model_store',
            '--force']

generate_model_archive()

## 11.4 TorchServe

[TorchServe](https://pytorch.org/serve/) is a flexible and easy to use tool for serving and scaling PyTorch eager mode and scripted models in production. It offers APIs for querying, managing, and analyzing the performance of its served models (by default, they are only accessible from localhost):

- [Inference API](https://github.com/pytorch/serve/blob/master/docs/inference_api.md): it listens to port 8080, and it offers the following services
  - description (`OPTIONS /`)
  - health check (`GET /ping`)
  - predictions (`POST {/predictions/{model_name}`)
  - explanations (`POST /explanations/{model_name}`)
  - kserve (`/v1/models/{model_name}:predict:`)
  - kserve explanations (`/v1/models/{model_name}:explain:`)
  
- [Management API](https://github.com/pytorch/serve/blob/master/docs/management_api.md): it listens to port 8081, and it offers the following services
  - description (`OPTIONS /`)
  - list models (`GET /models`)
  - describe a model (`GET /models/{model_name}`)
  - register a model (`POST /models`)
  - scale workers (`POST /models/{model_name}`)
  - set default version (`PUT /models/{model_name}/{version}/set-default`)
  - unregister a model (`DELETE /models/{model_name}/{version}`)
  
- [Metrics API](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md): it listens to port 8082, and it returns Prometheus-formatted frontend and backend metrics, such as number of requests, CPU and memory utilization, handler and prediction time, and many more.

```
torchserve --start \
           --disable-token-auth \
           --model-store ./model_store \
           --models fomo=FOMO.mar \
           --ts-config config.properties
```

In [33]:
config_properties = """
inference_address=http://127.0.0.1:7777
"""

with open('config.properties', 'w') as fp:
    fp.write(config_properties)

In [34]:
from ts.model_server import start

sys.argv = ['',
            '--start',
            '--disable-token-auth',
            '--model-store', './model_store',
            '--models', 'fomo=FOMO.mar',
            '--ts-config', 'config.properties']
start()

In [35]:
import requests

with open('./fig_0_100.jpg', 'rb') as f:
    data = f.read()

response = requests.put('http://127.0.0.1:7777/predictions/fomo', data=data)
response.json()

{'Fig': 0.9928925037384033,
 'Orange': 0.004503952339291573,
 'Onion White': 0.0016829799860715866,
 'Mandarine': 0.0009206130634993315}

In [36]:
#!torchserve --stop
sys.argv = ['', '--stop']
start()

TorchServe has stopped.


### 11.4.1 Ngrok (optional)

"*Online in One Line*" reads the [ngrok](https://ngrok.com/) website. It is an easy and convenient way of serving your model through a tunnel, thus allowing it to handle incoming requests from the outside world in your own Jupyter Notebook.

***
**DISCLAIMER**: You should NOT use Google Colab notebooks as backend for your deployed models. This is just a proof-of-concept, and a way to make your model available to the world for a brief amount of time, so you can showcase it to your family, friends, or colleagues.
***

If you want to try the code below, you'll need to [signup](https://dashboard.ngrok.com/signup) for a free account on [ngrok](https://ngrok.com/) and, once you're done, you can install the [pyngrok](https://pypi.org/project/pyngrok/) package that takes care of downloading and installing ngrok:

You'll need to copy your [authorization token](https://dashboard.ngrok.com/get-started/your-authtoken) and paste it in the appropriate command below:

***
**DISCLAIMER**: The responsibility for keeping your credentials and/or authorization tokens safe and private is your own. Make sure to remove any credentials and/or authorizations tokens from your notebook before saving or pushing it to public repositories, such as GitHub.
***

In [None]:
# Option 1
# You can call ngrok with your token
# Uncomment the line below and replace ... with your token
# !ngrok authtoken ...

# Option 2
# Or you can save it to a configuration file
# Uncomment the line below and replace ... with your token
# !echo "authtoken: ..." >> /root/.ngrok2/ngrok.yml

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


Once ngrok is setup, let's start Torchserve once again with a few modifications in the `config.properties` file:

***
**DISCLAIMER**: CORS stands for cross-origin resource sharing, and the configuration below makes Torchserve wide open to requests from anywhere. You SHOULD NOT use these configuration parameters in production as they're not safe. The responsibility for ensuring the security of your application, model, and data, is your own.
***

In [None]:
config_properties = """
inference_address=http://127.0.0.1:7777
cors_allowed_origin=*
cors_allowed_methods=GET, POST, PUT, OPTIONS
"""

with open('config_cors.properties', 'w') as fp:
    fp.write(config_properties)

In [None]:
sys.argv = ['',
            '--start',
            '--model-store', './model_store',
            '--models', 'fomo=FOMO.mar',
            '--ts-config', 'config_cors.properties']
start()

In [None]:
from pyngrok import ngrok

# <NgrokTunnel: "http://<public_sub>.ngrok.io" -> "http://localhost:7777">
http_tunnel = ngrok.connect(7777, "http")



In [None]:
http_tunnel.public_url

'https://f295-35-202-252-169.ngrok-free.app'

In [None]:
with open('./fig_0_100.jpg', 'rb') as f:
    data = f.read()

response = requests.put(f'{http_tunnel.public_url}/predictions/fomo', data=data)
response.json()

{'Fig': 0.9934685230255127,
 'Orange': 0.004324017558246851,
 'Onion White': 0.0012627042597159743,
 'Mandarine': 0.0009447108022868633}

In [None]:
ngrok.disconnect(http_tunnel.public_url)

In [None]:
sys.argv = ['', '--stop']
start()

TorchServe has stopped.
