<img src="./images/DLI_Header.png" style="width: 400px;">

# 2. Unoptimized deployment of GPT-J 

In this lab, we are going to look at several different strategies for deploying large models. As for this notebook, we will start with a basic example of how to run inference for GPT-J. We will not implement any optimizations to the model, instead we will deploy this 6B parameter model using PyTorch and an out-of-the box Transformers library. This approach, despite currently not being the most performant option for a production system, will allow us to run our first inference requests. We will demonstrate how to use Few-Shot Learning to transform our generic language model into a neural machine translation tool to carry out English to French translation. We will conclude this notebook by measuring inference latency so that we can compare our performance to a more optimized version of the model. 

The goals of this notebook are to: 
* Deploy a 6B parameter large GPT-J model using nothing but PyTorch and the Transformers library. 
* Learn about the basics of prompt engineering which will allow us to take advantage of few-shot learning capability of large models. 
* We will also measure the speed of inference to use it as a baseline for the next sections of this lab.

**[2.1 GPT-J 6B deployment with the Transformers library](#2.1)<br>**
**[2.2 Few-shot learning](#2.1)<br>**
**[2.3 Speed measurement](#2.2)<br>**

## 2.1 GPT-J 6B deployment with the Transformers library
### Transformers library

The Transformers library, developed by HuggingFace, is a utility for development of transformer-based architectures for NLP, CV and other machine learning applications. It is also a community-based repository hosting thousands of pretrained models from contributors across the globe. This includes models for different modalities such as text, vision, and audio. Besides training, Transformers library can also be used for inference, including inference of large transformer-based architectures. This includes models trained with Transformer library but also external models including those trained with Megatron-LM and other libraries. In this part of the class, we will use it to deploy GPT-J and execute it on a GPU. </br> 

### GPT-J 6B 

[GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) is a transformer model trained by a company called Eleuther.AI. It was trained using the "Mesh Transformer JAX" library which provides the implementation of both model and pipeline parallelism for JAX. "GPT-J" refers to the class of models (GPT models trained with JAX), while "6B" represents the number of trainable parameters. The model consists of 28 layers with a hidden layer size of 4096, and a feedforward dimension of 16384. The attention mechanism is composed of 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3. GPT-J 6B was trained on "The Pile" dataset, which is a large-scale curated dataset constructed from 22 diverse high-quality subsets like Wikipedia, Books3, Arxiv, PubMed, GitHub and more.

### The lab environment
All steps will be carried out within a Docker container built using the following Dockerfile https://github.com/triton-inference-server/fastertransformer_backend/blob/571a1fce438409087f5d3889237541828cc24ba5/docker/Dockerfile

Additionally, the following Python libraries were installed:
- transformers==4.18.0
- huggingface_hub==0.5.1
- tokenizers==0.12.1
- SentencePiece==0.1.96
- sacrebleu==2.0.0
- jaxlib==0.3.7
- jax==0.3.7

### Single GPU deployment

The 6B parameter model is small enough to fit into the memory of a 16GB V100. We will start with a single GPU deployment for now and move on to model parallel deployment in the next notebook. Let us begin by importing the key dependencies like PyTorch and Transformers library.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Let us initialize a pre-trained GPT-J 6B model and the required tokenizer. The model is big, so to limit the amount of time required for its download, we have saved it in a local folder called `weights`.  We will use the `from_pretrained()` function with a local path to load the copy of the model. The commented lines show how to download the model from the HuggingFace repository. 

<b>When this Jupyter server launched, we began downloading the weights of the model in the background. If you have reached this point of the class quickly, it is possible that the download is still in progress. If you face an error in the next step that says ` We could not connect to 'https://huggingface.co' to load this model` or that model/weights can't be found, please wait a couple more minutes for the weights to finish downloading.</b> 

In [None]:
# The lines below demonstrate how to download the pretrained model on your own system. In this lab we have predownloaded the weight for you.
# model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
# tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

# We already downloaded model weights so will use local path to the weights
model = AutoModelForCausalLM.from_pretrained("./weights/gpt-j/hf")
tokenizer = AutoTokenizer.from_pretrained("./weights/gpt-j/hf")

We will use our model in `fp16` format. This is because the model has 6B weights which in 32 bit representation would exceed the memory capacity of a V100 16GB (i.e. `6 billion * 4 bytes ~ 24GB`). Using a 2 byte representation (fp16) will reduce the size of model weights by 50% (`6 billion * 2 bytes ~ 12GB`). Additionally using fp16 representation allows us to take additional advantage of TensorCore acceleration of GPUs. We can do the conversion using the `.half()` method. 

We will also switch the model to an evaluation (`.eval()`) mode. Evaluation mode will allow us to remove elements of the execution graph that are not necessary for inference. It will also change the behavior of some other layers (like dropout or batch normalization) which behave differently outside of the training loop. In addition, the common practice for evaluating/validation is using `torch.no_grad()` in pair with `model.eval()` to turn off gradient computation: 

In [None]:
assert torch.cuda.is_available()
device = torch.device("cuda:0")
model.half().to(device)
model = model.eval()

Now that we have loaded our model, we are ready for inference. Since this is a generative model and we are not providing it with any prompts to guide its behavior, the model will generate random sentences. We will look at how to change that in just a minute.

In [None]:
# Generate a random sentence.
with torch.no_grad():
    output = model.generate(input_ids=None, max_length=128, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

The generated sentences cannot be read in their current format. We need to decode them back from tokens to text to be able to print them.

In [None]:
# Decoding the generated text
for sentence in output:
    sentence = sentence.tolist()
    text = tokenizer.decode(sentence, clean_up_tokenization_spaces=True)
    print(text)

## 2.2 Prompt engineering / Few shot learning.

During the lecture, we discussed the fact that the bigger the model gets, the more sample efficient it becomes. Bigger models, once pretrained, become Few-Shot learners demonstrating exceptional generalizability. With just a few samples, they can adapt to new, previously unseen tasks. Few-Shot Learning refers to the practice of feeding a machine learning model with a small amount of training data to guide its predictions. Large generative models can be provided with just a few examples of a new task at inference time not changing any model weights. This contrasts with standard fine-tuning techniques which require a large amount of training data for the pre-trained model to adapt to the desired task. 

Those few training examples are very frequently referred to as "Prompts". A prompt typically consists of a text describing the problem with no, one or a couple of examples of the task we want the model to carry out (hence zero, singe and few shot learning). Few-Shot Learning can be used with Large Language Models because they have learned to perform a wide number of tasks implicitly during their pre-training on large text datasets. This enables the model to generalize, that is, to understand related but previously unseen tasks with just few examples. 

Let us try to do Few-Shot inference with the GPT-J model. We will attempt to adapt our model to carry out translation from <b>English</b> to <b>French</b>. We will achieve that by providing the model with three examples of translation and in the final part of the prompt, we will only provide the English sentence to be translated triggering translation aligned with the examples provided. E.g.: 

<b>"English: What rooms do you have available? French:"</b> 

Using this prompt with multiple examples (Few-Shots), we “demonstrate” to the model what we expect to see in the generated output, and we expect that model will finalize the sentence with a French translation of the sentence provided. 

In [None]:
input_ids = tokenizer.encode("English: I do not speak French. French: Je ne parle pas français." \
                             "English: See you later! French: À tout à l'heure!" \
                             "English: Where is a good restaurant? French: Où est un bon restaurant?" \
                             "English: What rooms do you have available? French:", return_tensors="pt").cuda(0)

In [None]:
# Generate translation.
output = model.generate(input_ids=input_ids, max_length=82, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

In [None]:
sentence = output[0].tolist()
text = tokenizer.decode(sentence, clean_up_tokenization_spaces=True)
print(text)

The model should have provided the following translation: `Quel est le nombre de chambres disponibles?` 

If you are French speaker, you might have noticed that this is not the highest quality translation as it translates to: `How many rooms are available?`, instead of the requested sentence: `What rooms do you have available? ` 

That happened because we used a greedy decoder for output generation. Our model generate one token at a time, and on each generation step, we took the token with the maximum probability, leading to a suboptimal solution. Greedy decoding is one of the simplest approaches, but many different techniques exist that allow us to maximize the quality of the generation. This includes techniques such as: `Beam Search`, `Top-K` and `Top-P`. Additionally, some of those methods have hyperparameters that can be adjusted, such as `Temperature` of the logits or `Repetition penalty` to further control the quality of the generated output. 

With that in mind, let us adapt the decoding algorithm and change some of its hyperparameters. 

In [None]:
output = model.generate(input_ids=input_ids, max_length=80, num_return_sequences=1, num_beams=5, temperature=0.7, repetition_penalty=3.0, pad_token_id=tokenizer.eos_token_id)
sentence = output[0].tolist()
text = tokenizer.decode(sentence, clean_up_tokenization_spaces=True)
print(text)

A more robust decoding algorithm creates an output of higher quality: `Quels sont vos chambres disponibles?` 

Learn more about decoding methods here: [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate) 

## Optional task

In the section above, we have demonstrated just a single example of prompt engineering. If you search using your favorite search engine for "prompt engineering GPT" or "prompt examples GPT" there should be countless examples that can be adapted to your own problem. Below is an example of how to convert this model to do SQL generation. Do you think you can prompt it into writing python code or solving mathematical equations? Experiment with the code below and use as many internet resources as you want to help you get started. For more detailed information on prompting please refer to this review paper: https://arxiv.org/pdf/2107.13586.pdf. 

In [5]:
input_ids = tokenizer.encode("Create an SQL request to find all users that live in Califorian and have more than 1000 credits.", return_tensors="pt").cuda(0)

In [6]:
# Generate translation.
output = model.generate(input_ids=input_ids, max_length=82, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

In [None]:
output = model.generate(input_ids=input_ids, max_length=80, num_return_sequences=1, num_beams=5, temperature=0.7, repetition_penalty=3.0, pad_token_id=tokenizer.eos_token_id)
sentence = output[0].tolist()
text = tokenizer.decode(sentence, clean_up_tokenization_spaces=True)
print(text)

## 2.3 Inference latency measurement

Now let's have a look at how fast our inference pipeline is. We will measure performance of the 128 token generation.

In [None]:
# Generate the sentence.
import time

execution_time = 0
num_iterations = 10
with torch.no_grad():
    for _ in range(num_iterations):
        start = time.time()
        output = model.generate(input_ids=None, max_length=128, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, eos_token_id=50256)
        end = time.time()
        execution_time += end - start

In [None]:
print("Average inference time of 128 tokens is:",
     1000 * (execution_time/float(num_iterations)), "ms")

128 tokens can be generated in 6.3 seconds. Let us move to the next notebook and test an optimized inference pipeline.

<h2 style="color:green;">Congratulations!</h2>

Great job finishing this notebook! Please proceed to: [Inference of the GPT-J 6b model with FasterTransformer.](03_FTRunInferenceOfTheGPT-J.ipynb)
