# Zero-Shot Text Classification with Triton Inference Server

The recent release of the GPT-3 model by Open AI is one of the largest NLP model in human history, with whooping 175 billion parameters. This gigantic model has achieved promising results under zero-shot, few-shot, and one-shot settings and in some cases even surpassed state-of-the-art models using the aforementioned techniques. All of this got me interested in to dig deeper into the process of zero-shot learning in NLP. Before the success of transformer models, most of the zero-shot learning research was concentrated towards Computer Vision only, but now, there has been a lot of interesting work going on in the NLP domain as well due to the increase in quality of sentence embeddings.

## What is Zero-Shot-Learning (ZSL)?

In short, ZSL is the ability to detect classes that the model has never seen during training. In this blog post, I am using the Latent embedding approach where we find the latent embeddings of the given input sequence and hypothesis (label against which we want to classify the premise) by embedding both the premise and hypothesis into the same space of model and then finding the distance between these two embeddings in the same space.

# Client-Side Script to Interact with Triton Inference Server for Zero-Shot-Text Classification

In [19]:
import numpy as np
import sys
from functools import partial
import os
import tritongrpcclient
import tritongrpcclient.model_config_pb2 as mc
import tritonhttpclient
from tritonclientutils import triton_to_np_dtype
from tritonclientutils import InferenceServerException
import torch
from transformers import AutoTokenizer
from torch.nn import functional as F


  and should_run_async(code)


We fetch the tokenizer for sentence bert model from the transformer library

In [20]:
tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
VERBOSE = False

  and should_run_async(code)


## Let's test some input sentences and labels

In [21]:
sentence1 = 'Who are you voting for 2021?'
sentence2 = 'Jupiter’s Biggest Moons Started as Tiny Grains of Hail'
sentence3 = 'Hi Matt, your payment is one week past due. Please use the link below to make your payment'
labels = ['business', 'space and science', 'politics']
input_name = ['input__0', 'input__1']
output_name = 'output__0'

  and should_run_async(code)


run_inference function recieves sentence as an input, preprocess it (i.e. perform tokenization), hit the server with a preporcessed inputs and get back the embeddings from the triton server.

In [22]:
def run_inference(sentence, model_name='deepset', url='127.0.0.1:8000', model_version='1'):
    triton_client = tritonhttpclient.InferenceServerClient(
        url=url, verbose=VERBOSE)
    model_metadata = triton_client.get_model_metadata(
        model_name=model_name, model_version=model_version)
    model_config = triton_client.get_model_config(
        model_name=model_name, model_version=model_version)
    # I have restricted the input sequence length to 256
    inputs = tokenizer.batch_encode_plus([sentence] + labels,
                                     return_tensors='pt', max_length=256,
                                     truncation=True, padding='max_length')
    
    input_ids = inputs['input_ids']
    input_ids = np.array(input_ids, dtype=np.int32)
    mask = inputs['attention_mask']
    mask = np.array(mask, dtype=np.int32)
    mask = mask.reshape(4, 256) 
    input_ids = input_ids.reshape(4, 256)
    input0 = tritonhttpclient.InferInput(input_name[0], (4,  256), 'INT32')
    input0.set_data_from_numpy(input_ids, binary_data=False)
    input1 = tritonhttpclient.InferInput(input_name[1], (4, 256), 'INT32')
    input1.set_data_from_numpy(mask, binary_data=False)
    output = tritonhttpclient.InferRequestedOutput(output_name,  binary_data=False)
    response = triton_client.infer(model_name, model_version=model_version, inputs=[input0, input1], outputs=[output])
    embeddings = response.as_numpy('output__0')
    embeddings = torch.from_numpy(embeddings)
    sentence_rep = embeddings[:1].mean(dim=1)
    label_reps = embeddings[1:].mean(dim=1)
    similarities = F.cosine_similarity(sentence_rep, label_reps)
    closest = similarities.argsort(descending=True)
    for ind in closest:
        print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')


  and should_run_async(code)


Here we're using the cosine similarity to retrieve the closest label to the input sentence based on its embeddings. We're calculating this metric based on the equation:
$$
 \frac{
  \sum\limits_{i=1}^{n}{a_i b_i}
  }{
      \sqrt{\sum\limits_{j=1}^{n}{a_j^2}}
      \sqrt{\sum\limits_{k=1}^{n}{b_k^2}}
  }
$$



Once we calculate the cosine similarities, we can then SORT the labels according to highest similarity!

In [23]:
print("Input sentence:", sentence1)
print('\n')
run_inference(sentence1)
print('\n')
print("Input sentence:", sentence2)
print('\n')
run_inference(sentence2)
print('\n')
print("Input sentence:", sentence3)
print('\n')
run_inference(sentence3)

  and should_run_async(code)


Input sentence: Who are you voting for 2021?


label: politics 	 similarity: 0.23479251563549042
label: space and science 	 similarity: 0.13357104361057281
label: business 	 similarity: 0.03533152863383293


Input sentence: Jupiter’s Biggest Moons Started as Tiny Grains of Hail


label: space and science 	 similarity: 0.3903110921382904
label: business 	 similarity: 0.184669628739357
label: politics 	 similarity: 0.1614534705877304


Input sentence: Hi Matt, your payment is one week past due. Please use the link below to make your payment


label: business 	 similarity: 0.221212238073349
label: politics 	 similarity: 0.18082530796527863
label: space and science 	 similarity: 0.05963622406125069


## Further Reading

In additon to the sentence bert model, transformers library provides much bigger and more advanced models like RoBerta model which can dramatically improves the result for zero shot text classification. I have written a blog post which will walkthrough you from the same process but with more advanced transformer model (RoBerta) shows much better results for zero shot learning approach. I am attaching the blogpost for further reference

https://medium.com/nvidia-ai/how-to-deploy-almost-any-hugging-face-model-on-nvidia-triton-inference-server-with-an-8ee7ec0e6fc4