# 2. Inference - using trained LLMs

## What is inference?

Inference refers to the process of using an AI model to generate predictions, produce outputs or perform specific tasks based on new input data.

Inference enables real-world applications of LLMs, allowing users to interact with the model by providing prompts — text inputs that guide the model’s responses.

During inference, the input text is tokenized (text data is chunked into pieces suitable for the model), processed by the model and transformed into a meaningful response. This process happens in real-time, but its speed and efficiency depend on the underlying hardware and model size.

To better understand how sentences are tokenized, on of the upcoming exercise will demonstrate tokenization using the Poro-34B model. Tokenization varies depending on the model and is more complex than simply splitting a sentence into words. During tokenization, the text is broken into smaller chunks, which are then converted into token IDs that map to the model's token vocabulary, since computers understand numbers rather than text. There is a web-based tool [Tiktokenizer](https://tiktokenizer.vercel.app/), for visualizing and experimenting with tokenization using OpenAI's tiktoken library.

**LLM inference process overview:**

![inference-process.png](./images/inference-process.png)

1. **Tokenization** 
    * The input text is broken down into smaller units (tokens).
2. **Processing** 
    * The model processes the tokens through its layers, analyzing context and predicting the most likely next token(s) based on prior information and learned patterns.
3. **Generation** 
    * The model generates the most probable continuation of the input, forming a coherent response based on learned probabilities.
4. **Decoding** 
    * The predicted tokens are decoded into human-readable text.

## Hardware and memory requirements for inference

Inference is generally less computationally demanding than training. However, many large-scale inference tasks require GPU acceleration for efficiency. The computational requirements depend on the size and complexity of the model:

* Small models can run on a CPU, but this is generally slower.
* Larger models require GPU acceleration to achieve smooth performance. 

### Understanding memory requirements

This part is based on the text "Working with large language models on supercomputers" from [CSC Docs](https://docs.csc.fi/support/tutorials/ml-llm/). 

To run an LLM on a GPU, the entire model must fit into GPU memory (VRAM). The memory required depends on:
* Model size (number of parameters/weights).
* Precision of stored weights (floating-point format).  

Each model parameter (weight) is stored as a floating-point number. Typically a regular floating-point value in a computer is stored in a format called fp32, which uses 32 bits of memory, or 4 bytes (remember 8 bits = 1 byte). In deep learning, 16-bit floating point formats (fp16 of bf16) have been used for a long time to speed up part of the computation. These use 2 bytes of memory per weight. The memory needed for inference can be estimated using the formula:

$$
\text{Memory Required} = \text{Number of Parameters} \times \text{Bytes per Weight}
$$


**Floating-point precision & memory usage**

Different floating-point formats affect memory usage:
| Floating-Point format	| Bits per weight | Bytes per weight |
| ----- | ----- | ----- |
| FP32 (Single Precision) | 32 bits | 4 bytes | 
| FP16 (Half Precision)	| 16 bits| 2 bytes |
| BF16 (Brain Floating Point) |	16 bits |	2 bytes |

Usually LLMs require gigabytes of memory since parameter amount is usually in billions and 1 GB = 1 024 MB = ~1 000 000 000 bytes.

The model size in memory is then the number of parameters times the number of bytes needed for storing a single weight. For example a 30 billion parameter model with fp16 takes up 60 GB of memory. In practice for inference there's up to 20% overhead so you might actually need around 70 GB of memory.

## LUMI supercomputer & hardware for inference

LUMI is one of Europe’s most powerful AI supercomputers, designed to handle massive computations required for AI and scientific research. LUMI is one of the hardware components behind Aitta Inference platform.

**Key hardware components:**
* AMD Instinct MI250X GPUs: High-performance graphics processors designed for AI and machine learning.
* Multi-GPU Setup: LUMI has 2,978 nodes, each equipped with 4 MI250X GPUs.
* AMD EPYC "Trento" CPU: A powerful processor with 64 cores that coordinates GPU operations.

**Understanding the MI250X GPU:**

Each MI250X GPU is not just one GPU but two in a single package (this design is called a multi-chip module). These two chips, known as Graphics Compute Dies (GCDs), work together to accelerate computations.

* Each MI250X GPU contains 220 processing units (110 per chip).
* It has a total of 128GB of memory, split between the two chips (64GB per chip). 
* This high-speed memory (HBM) allows the GPU to quickly process large amounts of data.


Depending on the model, it might not fit into memory of one GPU (one GCD in LUMI has 64GB memory). Then you would need to scale up to use multiple GPUs together.

Further reading see https://docs.lumi-supercomputer.eu/hardware/lumig/

## Inference deployment options

LLMs can be deployed in various ways, depending on the available hardware, performance needs, and privacy considerations.

* **Local execution** – Running models on your own hardware (laptop, workstation or local server).
* **Cloud-based execution** – Using cloud VMs, dedicated AI hardware, or supercomputers.
* **API access** – Calling models hosted by a third-party via specialized APIs.

There are pros and cons in different approaches:

|Deployment method|Pros|Cons|
|-----------------|----|----|
|**Local execution**|Full control, data privacy, lower/no recurring costs|Requires high-end hardware, limited scalability|
|**Cloud VMs & supercomputers**|Scalable, access to high-performance GPUs|Can be expensive, requires network connectivity|
|**API-based services**|Easy access, no hardware setup|Potential data privacy concerns, dependent on provider|


Each method serves different needs, from running models privately on personal machines to leveraging cloud infrastructure for large-scale AI applications. The right choice depends on factors like cost, privacy, ease of use, and computational power.   
For more information on inference using supercomputers, check out this [CSC Docs](https://docs.csc.fi/support/tutorials/ml-llm/#inference) page. You can also explore the [**ai-inference-examples**](https://github.com/CSCfi/ai-inference-examples) repository on CSC's GitHub for examples of using LLMs on CSC's supercomputers.

There are also **user interfaces** that access third-party models through user-friendly interfaces (e.g., ChatGPT, Copilot). They provide an optimized experience for users, but with limited control over model behavior and customization. These applications rely on third-party APIs and are subject to the provider's limitations and data privacy considerations. 


## Conclusion

By now, you might see why running large language models on a personal laptop isn’t feasible — these models require significant computational power. That’s why accessing dedicated AI infrastructure is essential.

Next, we’ll explore Aitta, an AI inference platform that provides both a web-based user interface and API access for seamless model interaction. Let's move on to next section, [**Aitta - AI Inference platform**](./03_aitta.ipynb).