# Full Stack Practice of LLM Training - LLM Depolyment @ RLChina 2024

- Author: [Cheng Deng](https://www.cdeng.net/)[✉️]((davendw49@gmail.com), [Jun Wang](http://www0.cs.ucl.ac.uk/staff/jun.wang/)

---
## Main Task

Given the high cost of pretraining, we'll load the pre-trained weights into our custom-built architecture. This section will cover how to load pretrained weights from models like Llama, Phi, Gemma, and Mistral or customer LLM using related framework like Transformers, vLLM, and llama.c library. Before advanced knowledge in LLM deployment, we will go throught the main methods in LLM evaluation.

![](https://www.cdeng.net/resources/imgs/RLChina24/d.png)

Here is the prerequisite knowledge required:

- `Transformers` model
- Basic OS knowledge

Now, let's start with the preparation like package intallation, utilities import and package import.

In [None]:
!pip install vllm

And then import the essential packages.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

## 0. Warm-up

First, let's warm-up a normal model loading using Hugging Face `Transformers` library.

In [None]:
checkpoint = "HuggingFaceTB/SmolLM-135M"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

In [None]:
del model

In [None]:
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


def print_hello_world():
    print("Hello World!")

# print_hello_


## 1. Model Evaluation

### Introduction to LLM Evaluation

#### Dataset Preparation
   - **Multiple-Choice Format**: Create a dataset where each question has a set of possible answers (usually labeled as A, B, C, D, etc.).
   - **Correct Answer Annotation**: Ensure each question has the correct answer clearly marked.
   - Example format:
     ```json
     {
       "question": "What is the capital of France?",
       "choices": ["A. Paris", "B. London", "C. Rome", "D. Madrid"],
       "correct_answer": "A"
     }
     ```

#### Prompting the Model
   - You can prompt the LLM to select the correct answer by presenting the question and the available choices. Structure the input in a way that the LLM understands it is a multiple-choice task.
   - Example prompt for GPT-style models:
     ```
     Question: What is the capital of France?
     A) Paris
     B) London
     C) Rome
     D) Madrid
     Choose the correct answer:
     ```

#### Model Inference
   - Run the model on each question. Depending on the design of the LLM, you can:
     - **Direct Answer Selection**: The model directly outputs one of the choices (e.g., "A" for Paris).
     - **Text Completion**: The model completes the sentence with the correct answer (e.g., "The capital of France is Paris.") and then maps the output back to the correct choice.
     - **Logits Comparison**: For models that output probability distributions over tokens, you can compare the logits or probabilities of each choice and select the one with the highest score.

#### Evaluation Metrics
   
- **Accuracy**: The most straightforward metric, simply the ratio of correctly answered questions to the total number of questions.

$$Accuracy = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}}$$

- **Top-N Accuracy**: If evaluating with more complex answer spaces or ambiguous questions, you might use top-N accuracy, where the model is considered correct if the correct answer is within its top N predicted choices.
- **Log-Loss**: In cases where the model provides probabilities for each answer, log-loss (cross-entropy) can be used to evaluate how confident the model is about its predictions.
- **F1 Score**: If the questions involve multiple correct answers or require ranking, you can compute precision, recall, and the F1 score.

So, let's go through these three methods.

### Direct Answer Selection (aka Exactly Match)

In [None]:
import json
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load MMLU dataset (example using "high_school_geography" topic from MMLU)
mmlu_dataset = load_dataset("openai_humaneval", "mmlu/high_school_geography")


- For API-based model like ChatGPT, Yiyan, etc.

In [None]:
# Define your OpenAI API key here
openai.api_key = "YOUR_OPENAI_API_KEY"

def get_model_prediction(prompt, model="gpt-3.5-turbo"):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    return response['choices'][0]['message']['content'].strip()

def evaluate_exact_match(dataset, model="gpt-3.5-turbo"):
    correct_predictions = 0
    total_questions = len(dataset)

    for idx, example in enumerate(dataset):
        # Construct the prompt for the model
        prompt = example['question']
        for i, choice in enumerate(example['choices']):
            prompt += f"\n{i + 1}: {choice}"
        prompt += "\nAnswer with the number corresponding to the correct choice."

        # Get model prediction
        prediction = get_model_prediction(prompt, model)

        # Check if the prediction matches the answer (convert to string)
        correct_answer = str(example['answer'])
        if prediction == correct_answer:
            correct_predictions += 1

        print(f"Question {idx + 1}/{total_questions} | Prediction: {prediction} | Correct Answer: {correct_answer}")

    # Calculate and return exact match accuracy
    accuracy = correct_predictions / total_questions
    return accuracy

# Evaluate the model on MMLU
dataset_split = mmlu_dataset['validation']  # or 'test' depending on the split
accuracy = evaluate_exact_match(dataset_split)
print(f"Exact Match Accuracy: {accuracy * 100:.2f}%")

- For models on Hugging Face.

In [None]:
# Define your Hugging Face model and tokenizer
model_name = "gpt2"  # Replace with the desired HF model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def get_model_prediction(prompt, model, tokenizer):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(**inputs, max_length=inputs.input_ids.shape[1] + 10)
    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    return prediction.strip()

def evaluate_exact_match(dataset, model, tokenizer):
    correct_predictions = 0
    total_questions = len(dataset)

    for idx, example in enumerate(dataset):
        # Construct the prompt for the model
        prompt = example['question']
        for i, choice in enumerate(example['choices']):
            prompt += f"\n{i + 1}: {choice}"
        prompt += "\nAnswer with the number corresponding to the correct choice."

        # Get model prediction
        prediction = get_model_prediction(prompt, model, tokenizer)

        # Extract the predicted answer (assuming it ends with the choice number)
        predicted_answer = prediction.split()[-1]

        # Check if the prediction matches the answer (convert to string)
        correct_answer = str(example['answer'])
        if predicted_answer == correct_answer:
            correct_predictions += 1

        print(f"Question {idx + 1}/{total_questions} | Prediction: {predicted_answer} | Correct Answer: {correct_answer}")

    # Calculate and return exact match accuracy
    accuracy = correct_predictions / total_questions
    return accuracy

# Evaluate the model on MMLU
dataset_split = mmlu_dataset['validation']  # or 'test' depending on the split
accuracy = evaluate_exact_match(dataset_split, model, tokenizer)
print(f"Exact Match Accuracy: {accuracy * 100:.2f}%")

### Text Completion

In [None]:
import json
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load MMLU dataset (example using "high_school_geography" topic from MMLU)
mmlu_dataset = load_dataset("openai_humaneval", "mmlu/high_school_geography")

# Define your Hugging Face model and tokenizer
model_name = "gpt2"  # Replace with the desired HF model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def get_perplexity(prompt, model, tokenizer):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs.input_ids)
    log_likelihood = outputs.loss.item() * inputs.input_ids.size(1)
    perplexity = np.exp(log_likelihood / inputs.input_ids.size(1))
    return perplexity

def evaluate_perplexity(dataset, model, tokenizer):
    total_perplexity = 0
    total_questions = len(dataset)

    for idx, example in enumerate(dataset):
        question = example['question']
        min_perplexity = float('inf')
        best_choice = None

        # Calculate perplexity for each choice
        for choice in example['choices']:
            prompt = f"{question} {choice}"
            perplexity = get_perplexity(prompt, model, tokenizer)

            if perplexity < min_perplexity:
                min_perplexity = perplexity
                best_choice = choice

        # Check if the choice with lowest perplexity matches the answer
        correct_answer = example['choices'][int(example['answer']) - 1]
        if best_choice == correct_answer:
            total_perplexity += 1

        print(f"Question {idx + 1}/{total_questions} | Best Choice: {best_choice} | Correct Answer: {correct_answer} | Min Perplexity: {min_perplexity}")

    # Calculate and return perplexity accuracy
    accuracy = total_perplexity / total_questions
    return accuracy

# Evaluate the model on MMLU for perplexity
dataset_split = mmlu_dataset['validation']  # or 'test' depending on the split
accuracy = evaluate_perplexity(dataset_split, model, tokenizer)
print(f"Perplexity-Based Accuracy: {accuracy * 100:.2f}%")

### Logits Comparison

In [None]:
import json
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

# Load MMLU dataset (example using "high_school_geography" topic from MMLU)
mmlu_dataset = load_dataset("openai_humaneval", "mmlu/high_school_geography")

# Define your Hugging Face model and tokenizer
model_name = "gpt2"  # Replace with the desired HF model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def get_logits_probability(prompt, model, tokenizer, options):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits[0, -1, :]

    probabilities = {}
    for option in options:
        option_token_id = tokenizer(option, add_special_tokens=False).input_ids[0]
        probabilities[option] = F.softmax(logits, dim=-1)[option_token_id].item()
    return probabilities

def evaluate_logits(dataset, model, tokenizer):
    correct_predictions = 0
    total_questions = len(dataset)

    for idx, example in enumerate(dataset):
        question = example['question']
        options = ["A", "B", "C", "D", "a", "b", "c", "d"]

        # Calculate probabilities for each option
        probabilities = get_logits_probability(question, model, tokenizer, options)

        # Find the option with the highest probability
        best_option = max(probabilities, key=probabilities.get)

        # Check if the predicted option matches the answer
        correct_answer = example['choices'][int(example['answer']) - 1]
        correct_label = correct_answer[0]  # Assume the correct answer starts with A, B, C, or D

        if best_option.lower() == correct_label.lower():
            correct_predictions += 1

        print(f"Question {idx + 1}/{total_questions} | Predicted: {best_option} | Correct Answer: {correct_label} | Probability: {probabilities[best_option]:.4f}")

    # Calculate and return accuracy
    accuracy = correct_predictions / total_questions
    return accuracy

# Evaluate the model on MMLU using logits comparison
dataset_split = mmlu_dataset['validation']  # or 'test' depending on the split
accuracy = evaluate_logits(dataset_split, model, tokenizer)
print(f"Logits-Based Accuracy: {accuracy * 100:.2f}%")

## 2. Model Service Deployment (GPU)

In [None]:
import torch
from vllm import LLM, SamplingParams

ModuleNotFoundError: No module named 'vllm'

In [None]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# initialize
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model=checkpoint, dtype="float32")

# perform the inference
outputs = llm.generate(prompts, sampling_params)

# print outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

NameError: name 'SamplingParams' is not defined

## 3. Model On-device Deployment (CPU)

In general, practical frameworks for CPU Deployment:
ONNX Runtime with OpenVINO - Converts and optimizes models for efficient inference.
DeepSparse by Neural Magic - Takes advantage of sparsity for ultra-fast inference on CPUs.
TensorFlow Lite - Can be used for efficient CPU-based inference with optimizations.
MLC & PowerInfer2 - Since you've previously used these frameworks, they could be adapted to a CPU-centric setting, leveraging the lightweight inference they support.

- First, To use MLC in Python for model deployment and inference, which involves tools like TVM, MLC LLM, and Here's a step-by-step guide on how you can use MLC within Python. And let's install some important package.

In [2]:
!python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
!pip install onnx onnxruntime

Looking in links: https://mlc.ai/wheels


And import these package with other will-be-use library.

In [31]:
import tvm
from tvm import relay
from tvm.contrib import graph_executor
import numpy as np
import torch
import torch.onnx
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import GPT2Tokenizer
import onnxruntime as ort
import numpy as np

We take GPT2 (127M) as an example, to see how it can work using CPU. At the very beginning, we will be exporting GPT-2 to ONNX.

In [5]:
# Define your Hugging Face model and tokenizer
model_name = "gpt2"  # Replace with the desired HF model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Assume you have a trained model saved in `model.pth`
# model = torch.load("path/to/your_model.pth")
model.eval()  # Switch to evaluation mode for exporting

# Prepare Dummy Input for Export:
# You need a dummy input that matches the expected input type of GPT-2:
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

input_ids = inputs["input_ids"]  # This is the tensor containing tokenized input

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Export the GPT-2 model to ONNX format using torch.onnx.export:

In [7]:
torch.onnx.export(
    model,
    input_ids,
    "gpt2.onnx",
    input_names=["input_ids"],
    output_names=["last_hidden_state"],
    dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence_length"},
                  "last_hidden_state": {0: "batch_size", 1: "sequence_length"}},  # Dynamic batch and sequence length
    opset_version=14
)

Explanation:

`input_ids`: This is the input tensor containing tokenized input. GPT-2 expects the input tensor to be of type LongTensor (int64).
**Dynamic Axes**: GPT-2 can work with variable batch sizes and sequence lengths. Defining "batch_size" and "sequence_length" as dynamic helps to run inference on inputs of different lengths.

Then we will convert ONNX Model to TVM Relay.

Once you have the GPT-2 model in ONNX format, you can convert it to TVM's Relay IR and then compile it. For GPT-2, the inputs are different compared to image models, and special attention should be given to the input tensor types and shapes.

In [8]:
import onnx

# Load the ONNX model
onnx_model = onnx.load("gpt2.onnx")

# Print input names
for input in onnx_model.graph.input:
    print(f"Input name: {input.name}")

Input name: input_ids


This will help you determine the input names, which you need when creating shape_dict. GPT-2 expects a tokenized input of variable length. We'll need to define the shape based on your typical input length:

In [9]:
shape_dict = {"input_ids": (1, 10)}  # Batch size of 1, sequence length of 10 tokens (can be adjusted)

Then, we use the following script to convert the ONNX model to TVM:

In [10]:
import tvm
from tvm import relay

# Convert the ONNX model to Relay format
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

# Set target for CPU compilation
target = "llvm"
dev = tvm.cpu()

# Compile the model with TVM
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)



The `target = "llvm"` setting in TVM refers to **LLVM (Low-Level Virtual Machine)** as the compilation target, which is used to generate optimized code for **CPU** inference. Here’s a detailed explanation:

### Why Use `llvm` as the Target?
1. **General CPU Backend**:
   - In TVM, `"llvm"` is used to represent a **general-purpose CPU backend**. By setting the target to `"llvm"`, TVM will generate machine code optimized for your CPU. LLVM is a popular compiler infrastructure that allows TVM to generate highly optimized native code for a wide variety of CPU architectures, including x86, ARM, etc.
   - This is why `"llvm"` is commonly used when you want to run inference on a **CPU**, regardless of the specific type of CPU in your machine.

2. **Cross-Platform Compilation**:
   - LLVM can be used to generate code for different types of CPUs, which makes it flexible and well-suited for **cross-compilation**. This means you can compile a model on one type of CPU and run it on another type.
   - For example, you could use LLVM to compile on an x86 machine and then execute the code on an ARM-based CPU.

3. **Optimized Performance**:
   - LLVM enables various optimizations that are essential for deep learning workloads, such as **loop unrolling**, **vectorization**, and **instruction selection**. By using LLVM as the target, TVM can generate highly optimized machine-level code that takes advantage of CPU-specific features, such as vector instructions (e.g., AVX, NEON).
   - This optimization process helps improve the performance of the inference on CPU.

4. **Ease of Use**:
   - The `"llvm"` target is the most straightforward target for many developers because it’s CPU-agnostic and doesn’t require specifying complex hardware details.
   - It allows developers to use TVM without needing to worry about hardware-specific intricacies.

### Examples of TVM Targets
Besides `"llvm"`, TVM supports various targets, each suited to different hardware:

- **"cuda"**: Used for compiling models to run on **NVIDIA GPUs**. This enables the use of CUDA to leverage GPU acceleration.
- **"opencl"**: Used for **OpenCL-compatible GPUs**, FPGAs, or other hardware accelerators.
- **"rocm"**: Target for **AMD GPUs** that support ROCm.
- **"metal"**: Target for running models on **macOS and iOS** devices using Apple's Metal API for GPU support.
- **"llvm -mtriple=aarch64-linux-gnu"**: Can be used for **cross-compiling** to an ARM architecture CPU, typically for ARM-based devices such as embedded systems or mobile devices.

### Example Use Cases
- If you are deploying a model on a **desktop server or a cloud server** that uses CPUs, using `"llvm"` is appropriate.
- For deploying a model on an **NVIDIA GPU**, you would use `"cuda"` as the target to ensure the generated code takes full advantage of GPU capabilities.

***Back to the Code***

>Once compiled, we run inference using TVM's Graph Executor. You need to prepare an input tensor that matches the expected input format.

Let's see the next logits first.

In [11]:
from tvm.contrib import graph_executor
import numpy as np

# Create runtime executor module
module = graph_executor.GraphModule(lib["default"](dev))

# Create input tensor (make sure it matches the expected type and shape)
dummy_input = np.random.randint(0, 50256, (1, 10)).astype("int64")  # GPT-2 has a vocabulary size of 50256

# Set the input and run the model
module.set_input("input_ids", dummy_input)
module.run()

# Get the output
output = module.get_output(0).numpy()
print("Model output:", output)

Model output: [[[ -30.774525  -30.635603  -33.38208  ...  -39.944157  -38.41505
    -31.39816 ]
  [-106.73038  -104.11879  -112.616196 ... -119.68652  -116.35712
   -110.7905  ]
  [ -90.46258   -92.32849   -99.19918  ...  -97.21706  -101.02731
    -94.703636]
  ...
  [ -84.6223    -85.16498   -86.904564 ...  -93.811134  -95.23654
    -86.65767 ]
  [ -80.349815  -79.701195  -80.61562  ...  -88.192     -87.88979
    -79.997856]
  [ -77.12928   -76.443954  -77.38651  ...  -82.53172   -85.08695
    -76.50649 ]]]


Then, we can see the next token

In [33]:
# Load the pretrained GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Create a natural language input
input_text = "Hello, what is your name"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="np")
input_ids = inputs["input_ids"]  # This will be a NumPy array of token IDs

# Load the ONNX model
ort_session = ort.InferenceSession("gpt2.onnx")

# Use the tokenized input IDs for inference
outputs = ort_session.run(None, {"input_ids": input_ids})

# Get the logits output from the ONNX model
logits = outputs[0]

next_token_logits = logits[:, -1, :]  # Get the logits for the last token in the sequence
next_token_id = np.argmax(next_token_logits, axis=-1)  # Choose the token with the highest probability
next_token = tokenizer.decode(next_token_id)
print("Generated word:", next_token)



Generated word: ?


And, finally, we generate the whole sentence.

In [39]:
generated_ids = list(input_ids[0]) # Start with the original input

for _ in range(20):  # Generate up to 20 tokens
    # Convert to numpy and run inference
    input_array = np.array([generated_ids], dtype=np.int64)
    logits = ort_session.run(None, {"input_ids": input_array})[0]

    # Get logits for the last token
    next_token_logits = logits[:, -1, :]
    next_token_id = np.argmax(next_token_logits, axis=-1)

    # Append next token to generated sequence
    generated_ids.append(next_token_id[0])

    # Stop if end-of-sequence token is generated (for GPT-2, token 50256)
    if next_token_id[0] == 50256:
        break

# Convert generated token IDs back to text
generated_text = tokenizer.decode(generated_ids)
print("Generated text:", generated_text)

Generated text: Hello, what is your name?

I'm a student at the University of California, Berkeley. I'm a student at


## Reference