# Running LLMs on Palmetto

## Instructor
- **Instructor**: Carl Ehrett
- **Office**: 2105 Barre Hall, Clemson University
- **Email**: cehrett AT clemson DOT edu

## Workshop Description
This workshop series introduces essential concepts related to LLMs and works through the common steps in an LLM inference workflow. This workshop focuses on efficiently running LLMs, rather than on constructing, training or fine-tuning them. Throughout the sessions, students will learn how to use the Hugging Face Transformers library to run LLMs on the Palmetto Cluster. The workshop will also cover how to use the Palmetto Cluster to run LLMs on large datasets and how to use the Palmetto Cluster to run LLMs on multiple GPUs and multiple nodes.

## Prerequisites
* **All workshop participants should have a Palmetto Cluster account.** If you do not already have an account, you can visit our [getting started page](https://docs.rcd.clemson.edu/palmetto/starting).
* **Participants should be familiar with the Python programming language.** This requirement could be fulfilled by personal projects, coursework, or completion of the Introduction to [Python Programming workshop series](https://clemsonciti.github.io/rcde_workshops/python_programming/00-index.html).

## Environment
To run the code in this workshop, you will need a python environment with the appropriate libraries installed. I have created such an environment in a shared space, but that environment will not always be available to you. You can create such an environment yourself as follows. 

To set up your Python environment, load the Anaconda module provided on Palmetto (e.g., `module load anaconda3/2023.09-0`), which gives you access to the `conda` tool. Then create and activate your own conda environment (for example, `conda create -n llm_workshop python=3.11 && conda activate llm_workshop`). After that, install any additional packages your code requires — either one by one using `pip install packagename`, or from a `requirements.txt` file if you have one.

## Hugging Face Hub
In order to use the code in the Workshop notebooks, you will need a Hugging Face account. You can create one [here](https://huggingface.co/join).

## Where to get models

To use pre-trained models in your workflows, you need to know where to find and download them. Here’s a quick guide:

**NOTE: USE THE DATA TRANSFER NODE TO DOWNLOAD MODELS.** From the login node, `ssh hpcdtn01.rcd.clemson.edu`, or `ssh hpcdtn02.rcd.clemson.edu`.

#### Direct Download
- Many models are available for direct download from their creators' websites or repositories.
- Use `wget` or `curl` to fetch model files directly into your HPC environment.

#### Hugging Face Hub
- The Hugging Face Hub is a popular platform for accessing pre-trained models, datasets, and other ML resources.

##### What is the Hugging Face Hub?
- A community-driven platform with thousands of pre-trained models for NLP, computer vision, and more.
- Provides tools for easy integration with Python scripts and libraries like `transformers`.

##### Setting Up Your Hugging Face Account
1. Visit [huggingface.co](https://huggingface.co/) and create an account.
2. Generate a personal access token:
   - Go to **Settings** > **Access Tokens**.
   - Create a new token with the necessary permissions for downloading models.
3. Log in to the Hugging Face CLI to store your token:
   - Run the following command in the terminal after activating your `LLMsInferenceWorkshop` environment:
     ```bash
     huggingface-cli login
     ```
   - Paste your access token when prompted.
   - The token will be saved in `~/.huggingface/token`, allowing persistent access without needing to re-enter it.

##### Setting the Cache Directory
- To avoid storing large models in your home directory, configure a cache directory on your scratch space:
  ```bash
  export HF_HOME=/scratch/username/huggingface
  ```
- Add this line to your .bashrc or .bash_profile to make it persistent.

##### Downloading models
- To download a model, run the command
  ```bash
  huggingface-cli download [model_name]
  ```
- For these notebooks, you will need to download `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-VL-4B-Instruct`, `Qwen/Qwen3-4B-Instruct-2507`, and `Equall/Saul-Instruct-v1`. 

## Background on LLMs

### LLM Overview

#### Conceptual overview

LLMs are a class of models that are designed to predict the next token (word, or chunk of a word) in a sequence of tokens. They are trained on a truly vast amount of text data.

LLMs these days tend to be *Transformer-based neural networks*. If you're interested in learning more about the Transformer architecture, we have a workshop in which we build and train a Transformer-based LLM from scratch using PyTorch. 

In this workshop, we will not dwell on the details of the Transformer architecture or how LLMs are trained. We will not focus on the mathematical details of these models or how they work at a fundamental level. Instead, we will focus on how to use pre-trained LLMs to generate text efficiently on Palmetto. Everything we will discuss here also extends to using Multimodal LLMs on Palmetto as well.

In [None]:

%env HF_HOME=/project/rcde/models   
%env HF_HUB_CACHE=/project/rcde/models   

# this sets the Huggingface model cache to be the directory listed above -- 
# where I have already placed some models. Ordinarily, you should set this
# to be somewhere on the scratch drive!

In [None]:
# Use a pipeline as a high-level helper
import warnings
import logging
import os

# Suppress warnings and logging
warnings.filterwarnings('ignore')
logging.getLogger("transformers").setLevel(logging.ERROR)
print(os.getpid())

In [None]:
from utils import create_answer_box

create_answer_box(
    question=(
        "Please describe your level of familiarity with the Python programming language."
    ),
    question_id="nb1-01"
)

create_answer_box(
    question=(
        "Please describe your level of previous experience running LLMs using code. (It's okay if the answer is \"none\".)"
    ),
    question_id="nb1-02"
)

create_answer_box(
    question=(
        "Please describe any present or planned use of LLMs in your research.)"
    ),
    question_id="nb1-03"
)

Running an LLM can be simple or complicated, depending on how you want to use it. A very simple case would be the following code:

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", 
                model="Qwen/Qwen3-0.6B", # or Qwen/Qwen3-4B-Instruct-2507", 
                )

messages = ["The purpose of this workshop is"]

output = pipe(messages)
print(output[0][0]['generated_text'])

In [None]:
from utils import create_answer_box

create_answer_box(
    question=(
        "How much GPU memory (VRAM) do you think is needed to run the above code? "
        "How much CPU RAM do you think is needed? "
        "How could you ballpark, or check?"
    ),
    question_id="nb1-04"
)

In [None]:
# Clear the pipeline from memory
import torch
del pipe
torch.cuda.empty_cache()

#### Using LLMs on the Cluster

How is using an LLM on the Cluster different from using a web interface like ChatGPT? How is it different from using an LLM through an API like OpenAI's API?

1. **Control:** Aside from control over the system prompt, you have control over the model itself. You can choose which model to use. This is again **crucial for reproducibility**. Third-party services like Anthropic, Google and OpenAI can and do change or take away models without warning. This can undermine your research. 
2. **Cost:** Since you all already have access to the Cluster, you can run LLMs on the Cluster for free. This is not the case with third-party services.
3. **Privacy:** When running LLMs on the Cluster, your data never need leave the Cluster. This is not the case with third-party services.
4. **Speed:** The Cluster is a powerful computing resource. You can run LLMs on the Cluster much faster than you can on your own computer. This is especially true if you use a GPU on the Cluster, and if you batch your data.
5. **Not persistent server:** While you can run LLM inference on the Palmetto Cluster, the Cluster does not support hosting a persistent, externally accessible service for real-time LLM interaction, like ChatGPT. HPC systems like the Cluster are optimized for batch processing and resource-intensive jobs, not for real-time interaction, especially with external users.


### LLM Size

#### Sizes of commonly used LLMs

LLMs are enormous; while small ones can run reasonably well on a good laptop, larger ones require the kind of hardware you would only find in an HPC cluster like Palmetto. 

We don't know how big chatbots like GPT-5.1 are, because OpenAI won't tell us. But models with similar performance typically have *at least hundreds of billions of parameters*.

That just means that the mathematical object which *is* the model is composed of that many numbers. Each such number is typically represented by a 16-bit floating point number, which is 2 bytes. So, a 100-billion-parameter model would be about 200 GB in size. To run that model, it's not enough to store that on your SSD or HDD; you need to load it into memory -- preferably GPU memory! *Nobody's laptop has that much memory.*

Even if you can manage to run a 1- or 3-billion parameter model on your laptop, it will typically be much easier, faster and more efficient to run it on Palmetto, especially for large-scale workflows.

#### Quantization

One way to run larger models on smaller hardware is to use quantization. This is useful sometimes both for running LLMs on your laptop and for running them on Palmetto.

Quantization represents the model parameters with fewer bits than the 16-bit floating point numbers that are typically used. This can reduce the size of the model by a factor of 2, 4, or more. It can also speed up the model. It comes at a cost, however: quantization can reduce the performance of the model.

In [None]:
# Load a heavily quantized model to demonstrate size reduction
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import torch

# Set seed for reproducibility
set_seed(355)

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load 4-bit quantized model
model_id = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=64)
pipe_4bit = pipeline("text-generation", model=model_4bit, tokenizer=tokenizer, max_new_tokens=64)

# Print model size info
print(f"Model loaded in 16-bit quantization, Approximate memory usage: {pipe.model.get_memory_footprint() / 1e9:.2f} GB")
print(f"Model loaded in 4-bit quantization, Approximate memory usage: {pipe_4bit.model.get_memory_footprint() / 1e9:.2f} GB")


In [None]:
message = "The Fibonacci sequence begins: "

print(f"16-bit model: {pipe(message)[0]['generated_text']}")
print(f"4-bit model: {pipe_4bit(message)[0]['generated_text']}")

In [None]:
# Clear the model from memory
import torch
del pipe
del pipe_4bit
del model
del model_4bit
torch.cuda.empty_cache()

### LLM Variety

#### Domain-specific LLMs

There are many LLMs available for use. Some are general-purpose, like Llama-3.2 models. Others are domain-specific, like models trained on scientific literature, or on code, or even "non-language" data like molecular structures.

Most LLMs, though, are trained to be general-purpose. General-purpose models can be adapted to specific domains by prompt-engineering, few-shot learning, retrieval-augmented generation, fine-tuning, or other techniques.

In [None]:
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="Equall/Saul-Instruct-v1", dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "user", "content": "Please explain the key differences between common law and civil law systems."},
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
print(outputs[0]["generated_text"])


In [None]:
# Clear the model from memory
del pipe, prompt, outputs
torch.cuda.empty_cache()

#### Multimodal LLMs

Multimodal LLMs are LLMs that can take in not just text, but also images, audio, video, or other data types. They work essentially just like text LLMs, but they can tokenize and process these other data types as well. 

Let's see one such model take a look at the below image. We can ask the model to describe it.

<img src="files/kitchen.jpg" alt="Kitchen Image" width="600"/>

In [None]:
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_id = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image
image = Image.open("files/kitchen.jpg")

# Build chat prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Write a recipe that this woman might be using."},
        ],
    }
]

# Convert to text template
chat_text = processor.apply_chat_template(messages, add_generation_prompt=True)

# Preprocess (tokenize + image encoding)
inputs = processor(text=[chat_text], images=[image], return_tensors="pt").to("cuda")

# Generate
output = model.generate(**inputs, max_new_tokens=128)

# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)


In [None]:
# Clear the model from memory
del model
torch.cuda.empty_cache()

### Code challenge

Copy the above code cell that inferences the VL (vision-language) model. Then consult with the earlier section of this notebook that describes how to load a model in quantized form. Modify the VL code to use a 4-bit quantization of the model. When you're finished, report the resulting model size.

In [None]:
# Use this code cell to develop your code





In [None]:
from utils import create_answer_box

create_answer_box(
    question=(
        "How much GPU memory (VRAM) does your quantized version of the VL model consume?"
    ),
    question_id="nb1-05"
)

## Inference vs. training

The model is just a very large collection of (billions of) numbers, describing a mathematical operation which is performed on the input (the prompt) in order to produce the output (the generated text). Model *training* is the process of finding those numbers, identifying the values of the model parameters that get the best output.

*Fine-tuning* is a kind of model training, when a model is trained on a specialized dataset after first being trained on a broader dataset.

In this workshop, we are not training or fine-tuning models; we are *inferencing* models. That means the model parameters are already fixed, and we won't do anything to change them. We are instead using those fixed model parameters to generate outputs. This is also what happens when you chat with an LLM such as ChatGPT.

Training and fine-tuning are **far more compute-intensive** and require much more GPU memory than inferencing. E.g., An 8B LLM takes about 16GB of GPU memory to inference, and takes about 32GB or more to fine-tune, due to the additional memory required for gradient computations, optimizer states, and intermediate activations during backpropagation.

Fine-tuning is often not necessary, even for specialized tasks. Prompt engineering, few-shot learning, and retrieval-augmented generation often are just as or even more effective than fine-tuning for molding an LLM to your particular desired behavior.

### Alternative LLM Frameworks: When to Use Them

While `transformers` is the primary library we’ll use in this workshop for GPU-based workflows, it’s worth briefly introducing several alternative frameworks: **LlamaCpp**, **Ollama**, and **vLLM**.  
Each serves a different purpose — from lightweight local inference to high-throughput serving — and can complement your workflow depending on your goals and environment.

---

#### **LlamaCpp**
- **What it is:**  
  A lightweight, CPU-first library designed for running quantized versions of models.
- **Why use it:**  
  Efficient for local inference on CPU nodes or testing quantized models without GPU dependency.  
  Particularly useful for offline environments on laptops.
- **Key features:**  
  - Extremely low memory footprint.  
  - Efficient CPU-only execution with minimal dependencies.  
  - Supports quantized model formats that dramatically reduce memory use.  
- **Limitations:**  
  - CPU-only performance is slower for large-scale workloads.  
  - Limited to decoder-only models.  
  - No GPU acceleration, fine-tuning, or integration with the broader Hugging Face ecosystem.

---

#### **Ollama**
- **What it is:**  
  A simple CLI tool and platform for running pre-trained models locally.  
- **Why use it:**  
  Excellent for quick experiments or exploring pre-packaged models on your own machine, with automatic model management and setup.  
- **Key features:**  
  - Run models with a single command (`ollama run llama3`).  
  - Supports quantized models for efficient local use.  
  - Simple API and local HTTP endpoints for easy integration into scripts and apps.  
- **Limitations:**  
  - Limited customization or experimental control.  
  - Restricted to supported models and formats curated by Ollama.  
  - Not designed for multi-GPU, HPC, or large-batch workflows.  
  - Lacks transparency and extensibility compared to `transformers`.

---

#### **vLLM**
- **What it is:**  
  A high-performance inference engine built for *maximum throughput* on GPUs.  
  It uses techniques like **PagedAttention** and **continuous batching** to minimize memory waste and maximize token generation speed.
- **Why use it:**  
  Ideal when the goal is *speed and scalability* — for example, running inference over large datasets or serving models to multiple users concurrently.  
- **Key features:**  
  - **High-throughput inference:** Faster batch and streaming generation than `transformers`.  
  - **Optimized GPU utilization:** Custom CUDA kernels and memory scheduling designed for decoding-heavy workloads.  
- **Limitations:**  
  - **Inference-only:** Cannot fine-tune or modify model internals.  
  - **Limited model coverage:** Primarily supports *decoder-only* text-generation models (LLaMA, Mistral, Falcon, etc.); does **not** support encoder-decoder or most multimodal or embedding architectures (e.g., T5, BART, CLIP).  
  - **Partial ecosystem integration:** Loads Hugging Face models and tokenizers but lacks full compatibility with libraries like `datasets`, `evaluate`, and `peft`.  
  - **Reduced introspection:** Does not expose hidden states or attention weights, limiting interpretability.  
  - **Reproducibility caveats:** Asynchronous batching and kernel scheduling introduce slight nondeterminism in outputs and logprobs.  

---

**Summary:**  
For **research**, `transformers` remains the most appropriate choice. It's transparent, reproducible, and compatible with the full Hugging Face ecosystem.  
Use **vLLM** when maximum throughput or production-scale inference is the priority. But be mindful of its narrower model support and limited experimental control.  
Use **LlamaCpp** or **Ollama** for lightweight local experimentation when GPUs aren’t available or setup time is limited.
