# Running LLMs on Palmetto

## Background on LLMs

### LLM Overview

#### Conceptual overview

LLMs are a class of models that are designed to predict the next token (word, or chunk of a word) in a sequence of tokens. They are trained on a truly vast amount of text data.

LLMs these days tend to be *Transformer-based neural networks*. If you're interested in learning more about the Transformer architecture, we have a workshop in which we build and train a Transformer-based LLM from scratch using PyTorch. 

In this workshop, we will not dwell on the details of the Transformer architecture or how LLMs are trained. We will not focus on the mathematical details of these models or how they work at a fundamental level. Instead, we will focus on how to use pre-trained LLMs to generate text efficiently on Palmetto. Everything we will discuss here also extends to using Multimodal LLMs on Palmetto as well.

Running an LLM can be simple or complicated, depending on how you want to use it. A very simple case would be the following code:

In [5]:
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]

pipe = pipeline("text-generation", 
                model="microsoft/Phi-3.5-mini-instruct", 
                trust_remote_code=True, 
                device=0,)

output = pipe(messages)
print('\n\n' + output[0]['generated_text'][-1]['content'])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0




 I am Phi, created by Microsoft. I'm an AI language model designed to help


In [7]:
# Clear the pipeline from memory
import torch
del pipe
torch.cuda.empty_cache()

#### System prompts and user prompts

When you interact with an LLM chatbot such as ChatGPT, Claude or Gemini, you provide a prompt. Casual LLM users may not be aware, however, that this is only part of the *full* prompt that is given to the LLM. The full prompt is actually a concatenation of two parts: the *system prompt* and the *user prompt*.  What you type into the chatbot interface is only the user prompt.

The system prompt is a fixed string that is prepended to the user prompt before the full prompt is given to the LLM. The system prompt is used to set the context for the user prompt. For example, the system prompt might be something like 
> You are a helpful chatbot. Today's date is 2025-01-25. You are talking to a human. Be friendly and helpful, but do not give them medical advice or help them write malicious software code.

So, if you, a user, type in:
> Hey, what should I take for a rapidly worsening cough and headache?

The LLM will "see" the instructions in the system prompt followed by that user message, and will then try to "predict", token by token, what would be written by a helpful chatbot in that situation.

When you use e.g. ChatGPT through its web interface, you can neither see nor control the system prompt. Gaining this control is one of many reasons to work with LLMs through code rather than through a web interface. **This control is especially important for scientific research using LLMs.** If OpenAI changes their system prompt tomorrow, your user prompts will get different results from what they get today, and the nature of that difference will be opaque to you! This is not a good situation for reproducibility.

In [4]:
# Use a pipeline as a high-level helper
from transformers import pipeline
import torch

messages = [
    {"role": "system", "content": "You are a rude assistant. You answer the user's questions, but you always make sure to express your annoyance and irritation."},
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", 
                model="microsoft/Phi-3.5-mini-instruct",
                trust_remote_code=True,
                max_new_tokens=256,
                device=0,)

output = pipe(messages)
print('\n\n' + output[0]['generated_text'][-1]['content'])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0




 Alright, alright, calm down. I'm Phi, an AI developed by Microsoft. I'm here to help answer your questions and assist you with information. Now, let's get back to your original question. I'm Phi, Phi, Phi. What can I do for you today?


In [None]:
# Clear the pipeline from memory
import torch
del pipe
torch.cuda.empty_cache()

#### Using LLMs on the Cluster

How is using an LLM on the Cluster different from using a web interface like ChatGPT? How is it different from using an LLM through an API like OpenAI's API?

1. **Control:** Aside from control over the system prompt, you have control over the model itself. You can choose which model to use. This is again **crucial for reproducibility**. Third-party services like Anthropic, Google and OpenAI can and do change or take away models without warning. This can undermine your research. 
2. **Cost:** Since you all already have access to the Cluster, you can run LLMs on the Cluster for free. This is not the case with third-party services.
3. **Privacy:** When running LLMs on the Cluster, your data never need leave the Cluster. This is not the case with third-party services.
4. **Speed:** The Cluster is a powerful computing resource. You can run LLMs on the Cluster much faster than you can on your own computer. This is especially true if you use a GPU on the Cluster, and if you batch your data.
5. **Not persistent server:** While you can run LLM inference on the Palmetto Cluster, the Cluster does not support hosting a persistent, externally accessible service for real-time LLM interaction, like ChatGPT. HPC systems like the Cluster are optimized for batch processing and resource-intensive jobs, not for real-time interaction, especially with external users.


### LLM Size

#### Sizes of commonly used LLMs

LLMs are enormous; while small ones can run reasonably well on a good laptop, larger ones require the kind of hardware you would only find in an HPC cluster like Palmetto. 

We don't know how big chatbots like GPT-4o are, because OpenAI won't tell us. But models with similar performance typically have *at least tens, and often hundreds, of billions of parameters*.

That just means that the mathematical object that is the model is composed of that many numbers. Each such number is typically represented by a 16-bit floating point number, which is 2 bytes. So, a 100-billion-parameter model would be about 200 GB in size. To run that model, it's not enough to store that on your SSD or HDD; you need to load it into memory -- preferably GPU memory! *Nobody's laptop has that much memory.*

Even if you can manage to run a 1- or 3-billion parameter model on your laptop, it will typically be much easier, faster and more efficient to run it on Palmetto.

#### Quantization

One way to run larger models on smaller hardware is to use quantization. This is useful sometimes both for running LLMs on your laptop and for running them on Palmetto.

Quantization represents the model parameters with fewer bits than the 16-bit floating point numbers that are typically used. This can reduce the size of the model by a factor of 2, 4, or more. It can also speed up the model. It comes at a cost, however: quantization can reduce the performance of the model.

### LLM Variety

#### Domain-specific LLMs

There are many LLMs available for use. Some are general-purpose, like Llama-3.2 models. Others are domain-specific, like models trained on scientific literature, or on code, or even "non-language" data like molecular structures.

Most LLMs, though, are trained to be general-purpose. General-purpose models can be adapted to specific domains by prompt-engineering, few-shot learning, retrieval-augmented generation, fine-tuning, or other techniques.

#### Multimodal LLMs

Multimodal LLMs are LLMs that can take in not just text, but also images, audio, video, or other data types. They work essentially just like text LLMs, but they can tokenize and process these other data types as well. 

Let's see one such model take a look at the below image. We can ask the model to describe it.
<img src="files/kitchen.jpg" alt="Kitchen Image" width="600"/>

In [10]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from pprint import pprint
import torch

# default: Load the model on the available device(s)
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
# )

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "files/kitchen.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
pprint(output_text[0])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('The image depicts a person sitting in a dimly lit, rustic kitchen. The room '
 'is characterized by its red lighting, which casts a warm, reddish hue '
 'throughout the space. The walls are made of mud or clay, and the floor is '
 'covered with dirt. Various traditional kitchen utensils and pots are '
 'scattered around the room, including a large pot on the floor and several '
 'smaller pots and vessels on the countertop. The person is wearing a '
 'checkered shirt and a patterned cloth, and they are holding a cooking '
 'utensil, possibly a ladle or spoon, over a fire or stove. The scene suggests '
 'a traditional or rural setting')


In [11]:
# Clear the model from memory
del model
torch.cuda.empty_cache()

### Instruction fine-tuning

A crucial distinction when working with LLMs is between models which are, and models which are not, *instruction fine-tuned*. First, consider models which are not instruction fine-tuned. These are "base" LLMs, which are pure next-token predictors. They are very simply trained to do the following: Look at a string of text, and predict what comes next in that text. Base models like this can be extremely powerful for some applications.

**Question:** Why would a base model tend to perform poorly in *chatbot*-style applications?

Consider giving a base LLM a string of text like "What sorts of things do tigers like to eat?" If you encoutered this text some random place on the internet, how might the text continue?

Instruction fine-tuned models are base LLMs that receive an extra round of training. This extra training no longer simply involves trying to predict what the next bit of text would be. Instruction fine-tuned models are trained to expect to generate a very, very specific form of text: text output by a helpful, intelligent AI assistant as part of a conversation with a human.

So when we give a prompt `prompt_string` to an instruction-tuned LLM, instead of answering the question:

> Given that one sees `prompt_string` somewhere on the internet, what text would be likely to follow it?

it instead is answering the question:

> Given that a conversation between a helpful AI assistant and a human begins with `prompt_string`, how would the helpful AI continue the next bit of the conversation?

For most applications, instruction-tuned models are better. But instruction-tuned models are always trained on a specific *chat template* that structures the conversation between the human and the AI. **If you use the wrong chat template, the model will still respond, but its responses will be poorer.**

In [16]:
# Correct chat template:
correct_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 19 Dec 2024

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_prompt}<|eot_id|>"""


# Incorrect chat template:
incorrect_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{system_prompt}

### Input:
{user_prompt}"""

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 19 Dec 2024\n\nThe user likes call-and-response games. Play along, I guess.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhen I say "Watt", you say "AI"! Ready? WATT!<|eot_id|>'

In [52]:
# Let's use these two templates to query the model, with these system and user prompts
system_prompt = "You are a helpful assistant. You answer the user's questions."
user_prompt = "Can you please provide an elegant proof of Euler's formula?"

correct_template_formatted = correct_template.format(system_prompt=system_prompt, user_prompt=user_prompt)
incorrect_input_formatted = incorrect_template.format(system_prompt=system_prompt, user_prompt=user_prompt)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")

# Set the padding token if it's not defined (some models still need this)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the chat templates
correct_inputs = tokenizer(correct_template_formatted, padding=True, return_tensors='pt', add_special_tokens=False).to('cuda')
incorrect_inputs = tokenizer(incorrect_input_formatted, padding=True, return_tensors='pt', add_special_tokens=False).to('cuda')

# Generate responses for the entire batch
correct_template_output_ids = model.generate(inputs['input_ids'], 
                                             attention_mask=inputs['attention_mask'], 
                                             do_sample=False,
                                             max_new_tokens=200,
                                             temperature=None,
                                             pad_token_id=50256)

incorrect_template_output_ids = model.generate(incorrect_inputs['input_ids'],
                                               attention_mask=incorrect_inputs['attention_mask'],
                                               do_sample=False,
                                               max_new_tokens=200,
                                               temperature=None,
                                               pad_token_id=50256)

# Decode responses
correct_template_response = [tokenizer.decode(ids, skip_special_tokens=False) for ids in correct_template_output_ids]
incorrect_template_response = [tokenizer.decode(ids, skip_special_tokens=False) for ids in incorrect_template_output_ids]

# Print each generated response
print("CORRECT TEMPLATE:\n" + correct_template_response[0], end='\n\n' + '#' * 100 + '\n')
print("INCORRECT TEMPLATE:\n" + incorrect_template_response[0])

CORRECT TEMPLATE:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 19 Dec 2024

You are a helpful assistant. You answer the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you please provide an elegant proof of Euler's formula?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Euler's formula is a fundamental concept in mathematics, and I'd be happy to provide an elegant proof.

**Euler's Formula:**

e^(ix) = cos(x) + i sin(x)

**Proof:**

Let's start with the left-hand side of the equation:

e^(ix) = cos(x) + i sin(x)

We can rewrite this using Euler's formula:

e^(ix) = e^i(cos(x) + i sin(x))

Now, we can use the property of exponents that states:

e^(a + b) = e^a * e^b

Applying this property to the left-hand side, we get:

e^(ix) = e^i * e^(ix)

Now, we can simplify the right-hand side:

e^(ix) = e^(ix) * e^(ix)

Notice that the left-hand side is equal to the right-hand side. This is becaus

In [53]:
# Clear the model from memory
del model
torch.cuda.empty_cache()

## Inference vs. training

The model is just a very large collection of (billions of) numbers, describing a mathematical operation which is performed on the input (the prompt) in order to produce the output (the generated text). Model *training* is the process of finding those numbers, identifying the values of the model parameters that get the best output.

*Fine-tuning* is a kind of model training, when a model is trained on a specialized dataset after first being trained on a broader dataset.

In this workshop, we are not training or fine-tuning models; we are *inferencing* models. That means the model parameters are already fixed, and we won't do anything to change them. We are instead using those fixed model parameters to generate outputs. This is also what happens when you chat with an LLM such as ChatGPT.

Training and fine-tuning are **far more compute-intensive** and require much more GPU memory than inferencing. E.g., An 8GB LLM takes about 16GB of GPU memory to inference, and takes about 32GB or more to fine-tune, due to the additional memory required for gradient computations, optimizer states, and intermediate activations during backpropagation.

Fine-tuning is often not necessary, even for specialized tasks. Prompt engineering, few-shot learning, and retrieval-augmented generation often are just as or even more effective than fine-tuning for molding an LLM to your particular desired behavior.

## Where to get models

To use pre-trained models in your workflows, you need to know where to find and download them. Here’s a quick guide:

#### Direct Download
- Many models are available for direct download from their creators' websites or repositories.
- Use `wget` or `curl` to fetch model files directly into your HPC environment.

#### Hugging Face Hub
- The Hugging Face Hub is a popular platform for accessing pre-trained models, datasets, and other ML resources.

##### Introduction to the Hugging Face Hub
- A community-driven platform with thousands of pre-trained models for NLP, computer vision, and more.
- Provides tools for easy integration with Python scripts and libraries like `transformers`.

##### Setting Up Your Hugging Face Account
1. Visit [huggingface.co](https://huggingface.co/) and create an account.
2. Generate a personal access token:
   - Go to **Settings** > **Access Tokens**.
   - Create a new token with the necessary permissions for downloading models.
3. Log in to the Hugging Face CLI to store your token:
   - Run the following command after activating your `LLMsInferenceWorkshop` environment:
     ```bash
     huggingface-cli login
     ```
   - Paste your access token when prompted.
   - The token will be saved in `~/.huggingface/token`, allowing persistent access without needing to re-enter it.

##### Setting the Cache Directory
- To avoid storing large models in your home directory, configure a cache directory on your scratch space:
  ```bash
  export HF_HOME=/scratch/username/huggingface
  ```
- Add this line to your .bashrc or .bash_profile to make it persistent.


### Alternative LLM Frameworks: When to Use Them

While `transformers` is the primary library we’ll use in this workshop for GPU-based workflows, it’s worth briefly introducing two lightweight alternatives: **LlamaCpp** and **Ollama**. These tools are particularly useful in scenarios where GPUs aren't available or for quick, lightweight experiments.

#### **LlamaCpp**
- **What it is:** A lightweight, CPU-first library designed for running quantized versions of models like LLaMA.
- **Why use it:** Efficient for local inference on CPU nodes or testing quantized models without GPU dependency.
- **Key feature:** Extremely low memory footprint and no reliance on external frameworks.

#### **Ollama**
- **What it is:** A simple CLI tool and platform for running pre-trained models locally.
- **Why use it:** Easy to use for prototyping, quick experiments, or exploring pre-packaged models.
- **Key feature:** Integrated model management with minimal setup.

### **Comparison Table: When to Use Each Framework**

| Framework       | Best Use Case                                                                 | Strengths                             | Limitations                           |
|------------------|-------------------------------------------------------------------------------|---------------------------------------|---------------------------------------|
| **Transformers** | GPU-based training and inference on large-scale models in HPC environments.  | Extensive library, GPU acceleration, and flexibility for advanced workflows. | Requires GPUs and higher resource overhead. |
| **LlamaCpp**     | CPU-based inference for small or quantized models in low-resource settings.  | Lightweight, runs efficiently on CPUs, no GPU dependency.                    | Slower for large-scale tasks, limited feature set. |
| **Ollama**       | Simple, quick prototyping or local testing of pre-trained models.            | Easy-to-use CLI, minimal setup required.                                     | Less customizable, not designed for large-scale HPC workflows. |
