<img src="https://drive.google.com/uc?export=view&id=1JIIlkTWa2xbft5bTpzhGK1BxYL83bJNU" width="800"/>

# 🔥 Inference Acceleration Demo
---

<img src="https://drive.google.com/uc?export=view&id=1cyDL_idJ78Bjp7RdO1baTVZHekb9_OFJ" width="800"/>

For this demo, we’re going to show you how you can take a vanilla Pytorch model, have NOS accelerate it, and then scale it to speed up inference by a **factor of 10x**, in just a few lines of code.

By the end of this demo, we’ll use NOS to build an end-to-end semantic video search demo. We’ll process a 10-minute video of San Francisco in real-time. 

Here's the video clip we'll be using for the demos. 

In [None]:
from nos.common.io import VideoReader

FILENAME = "top-10-things-to-do-in-sf.mp4"
print(VideoReader(FILENAME))

In [None]:
from IPython.display import Video
Video(FILENAME, width=640)

### 🔥 1. Inference with 🤗 transformers (OpenAI CLIP)
---

Let's say we're using the popular OpenAI CLIP for extracting image and text embeddings, and we're using the 🤗 transformers library. 

In [None]:
from typing import Union, List
from PIL import Image

import numpy as np
import torch

class CLIP:
    """Text and image encoding using OpenAI CLIP"""
    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        from transformers import CLIPModel
        from transformers import CLIPProcessor, CLIPTokenizer
        
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.tokenizer = CLIPTokenizer.from_pretrained(model_name)
        
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = CLIPModel.from_pretrained(model_name).to(device)
        self.model.eval()
        self.device = self.model.device
        
    def encode_image(self, images: Union[Image.Image, np.ndarray, List[Image.Image], List[np.ndarray]]):
        """Encode image into an embedding."""
        with torch.inference_mode():
            inputs = self.processor(images=images, return_tensors="pt").to(self.device)
            return self.model.get_image_features(**inputs).cpu().numpy()

    def encode_text(self, texts: Union[str, List[str]]) -> np.ndarray:
        """Encode text into an embedding."""
        with torch.inference_mode():
            if isinstance(texts, str):
                texts = [texts]
            inputs = self.tokenizer(
                texts,
                padding=True,
                return_tensors="pt",
            ).to(self.device)
            text_features = self.model.get_text_features(**inputs)
            return text_features.cpu().numpy()

#### Naive Inference

Let's see what a naive inference implementation of inference would look like:

In [None]:
from nos.common import tqdm
from nos.common.io import VideoLoader

# Load the first image
images = VideoLoader(FILENAME, shape=(1, 224, 224, 3))
image = next(images)

# Load the Pytorch model
clip = CLIP()

In [None]:
for _ in tqdm(duration=5, desc="Naive implementation", unit=" images"):
    clip.encode_image(image)

### Pytorch in /dev
<img src="https://drive.google.com/uc?export=view&id=1JQcd4hRBIBi77xKgypy-XwL9bwtgHDGB" width="800"/>

### Pytorch in /prod
<img src="https://drive.google.com/uc?export=view&id=1_ZqGkyGBBy22gtKFoyg-f6f-8qlIME16" width="800"/>

### ⚡️ 2. Optimizing Inference with NOS
---

**NOS** provides a convenient way to **compile**, **tune** and **auto-scale** Pytorch models for inference. 

Let’s start the NOS backend. NOS can be run locally or in the cloud accessing 100s of GPUs in a cluster.

In [None]:
import nos

nos.init(runtime="local")

Recall that we used `CLIPModel` from the 🤗 transformers library. 

```python
class CLIP:
    def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
        from transformers import CLIPModel
        ...
        self.model = CLIPModel.from_pretrained(model_name).to(device)
        ...
```

In [None]:
clip.model = nos.compile(clip.model, [image], precision=torch.float32)

In [None]:
for _ in tqdm(duration=5, desc="NOS optimized", unit=" images"):
    clip.encode_image(image)

**Key takeaway:** >60% better performance, identical API.

### 🚀 3. Optimizing and Scaling Inference with NOS
---

Now let's compile and scale `CLIP` for production deployment. 

In [None]:
# Load the model for remote-execution
model = nos.load(CLIP, 
                 init_args=(), init_kwargs={"model_name": "openai/clip-vit-base-patch32"}, 
                 method_name="encode_image")
model

Now, let's optimize the model and scale the number of replicas so that we maximally use the underlying hardware. NOS automatically decides the optimal number of replicas to give us the best performance for the hardware we have.

In [None]:
from nos.managers.model import ModelOptimizationPolicy

# Optimize (ahead-of-time compilation) + Scale the model for max throughput
model = model.scale(replicas="auto", policy=ModelOptimizationPolicy.MAX_THROUGHPUT)
model

Finally let’s take this optimized and scaled model to build a video search engine. 

### 4. 🔍 Video Search Demo (in < 10 lines of code)

Let's put together a quick semantic search demo with OpenAI's CLIP model, using the NOS inference engine. 

The following snippet extracts embeddings from all the frames in the video and uses CLIP to cross-reference text queries with the image embeddings. 

In [None]:
from nos.common import tqdm
from nos.common.io import VideoLoader

# Load frames from the video lazily
B = model.batch_size()
images = VideoLoader(FILENAME, shape=(B, 224, 224, 3))
images = ({"images": img} for img in tqdm(images, unit_scale=B, unit="images"))

# Batch inference using auto-scaled model, then normalize embeddings
video_features = torch.from_numpy(np.vstack(list(model.imap(images))))
video_features /= video_features.norm(dim=-1, keepdim=True)

In [None]:
from IPython.display import HTML, display
from nos.common.io import VideoReader

encode_text = CLIP().encode_text
video = VideoReader(FILENAME)

def search_video(query: str, video_features: np.ndarray, topk: int = 3):
    """Semantic video search demo in 8 lines of code"""
    # Encode text and normalize
    with torch.inference_mode():
        text_features = encode_text(texts=[query])
        text_features = torch.from_numpy(text_features)
        text_features /= text_features.norm(dim=-1, keepdim=True)

    # Compute the similarity between the search query and each video frame
    similarities = (video_features @ text_features.T)
    _, best_photo_idx = similarities.topk(topk, dim=0)
    
    # Display the top k frames
    results = np.hstack([video[int(frame_id)] for frame_id in best_photo_idx])
    display(Image.fromarray(results).resize((600, 400)))

### 🔍 Semantic video search with text

In [None]:
search_video("golden gate bridge", video_features, topk=1)

In [None]:
search_video("alcatraz prison", video_features, topk=1)

In [None]:
search_video("fishermans wharf", video_features, topk=1)

In [None]:
search_video("golden gate park windmill", video_features, topk=1)

In [None]:
search_video("chinatown", video_features, topk=1)

In [None]:
search_video("lombard street", video_features, topk=1)

In [None]:
search_video("pier 39", video_features, topk=1)

In [None]:
search_video("riding the tram", video_features, topk=1)

In [None]:
search_video("ferry building", video_features, topk=1)

### How NOS works
<img src="https://drive.google.com/uc?export=view&id=1aHyc3pFX-wPtQbxKzo3SuuXP60KyWRbu" width="800"/>

### NOS Applications
<img src="https://drive.google.com/uc?export=view&id=1kRj_gEip4aZ7QIpiG_OSqev81nkeeWtQ" width="800"/>

Thanks for watching!

Reach out to us: **Sudeep**  (sudeep@autonomi.ai) | **Scott** (scott@autonomi.ai) | https://www.autonomi.ai