# LLM Inference with Llama.cpp
## DeepSeek-R1-Distill-Llama-8B: GPU vs. CPU Inference Comparison

In this notebook, we compare the performance of running the [`DeepSeek-R1-Distill-Llama-8B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) model using **GPU acceleration** versus running it on **CPU only**. We will measure inference time in both cases to highlight the speedup achieved with GPU offloading.

`DeepSeek-R1-Distill-Llama-8B` model is a distilled version of [`DeepSeek-R1`](https://huggingface.co/deepseek-ai/DeepSeek-R1) based on [`Llama-3.1-8B`](https://huggingface.co/meta-llama/Llama-3.1-8B).
Distillation is a process where a smaller model is trained to mimic the behavior of a larger model. This allows the smaller model to achieve similar performance to the larger model while being more efficient.

Quantization is a technique that reduces the **precision** of model weights (e.g., from 16-bit floating-point to 4-bit integers), making inference more memory-efficient and faster, especially on consumer hardware.  We will be using [`DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf`](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF) model which is the quantized version of the `DeepSeek-R1-Distill-Llama-8B` model which significantly reduces memory requirements while maintaining good performance.

[`llama.cpp`](https://llama-cpp-python.readthedocs.io/en/latest).is a lightweight and efficient C++ implementation with the goal to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware. `llama.cpp` requires the model to be stored in the GGUF file format. 

In [1]:
import pathlib

from llama_cpp import Llama
from llama_cpp.llama_cpp import load_shared_library

## Inference with CPU

We first run inference on the CPU to measure the baseline performance. Setting `n_gpu_layers=0` will force the inference to run on the CPU only.

In [2]:
deepseek = Llama.from_pretrained(
	repo_id="unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF",
	filename="DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf",
	n_gpu_layers=0
)

DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A100 80GB PCIe) - 80611 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from /home/jovyan/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Llama-8B-GGUF/snapshots/4f9ea2b81efbfb1e2cf589261003b84ef3626d52/./DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_mode

In [3]:
prompt = "What is Machine Learning?"
prompt_formated = f'<｜User｜>{prompt}<｜Assistant｜>'

output = deepseek(
    prompt_formated,
    max_tokens = 300
)

print(output['choices'][0]['text'])

llama_perf_context_print:        load time =    4139.39 ms
llama_perf_context_print: prompt eval time =    4138.84 ms /     8 tokens (  517.36 ms per token,     1.93 tokens per second)
llama_perf_context_print:        eval time =  517293.12 ms /   299 runs   ( 1730.08 ms per token,     0.58 tokens per second)
llama_perf_context_print:       total time =  521938.55 ms /   307 tokens


<think>
Okay, so I'm trying to understand what machine learning is. I've heard the term before, but I'm not exactly sure how it works. Let me start by breaking down the words. "Machine learning" seems to combine "machine" and "learning." So, I guess it's about machines learning something, right?

Hmm, I remember hearing that computers can do tasks like recognizing images or understanding speech. Is that part of machine learning? Maybe. I think it's different from just programming because, in programming, we set explicit rules, but machine learning must be more about the computer learning on its own.

Wait, how does a computer learn without being explicitly programmed? I suppose it has something to do with data. Maybe the computer uses data to figure out patterns or make predictions. So, if I have a dataset, the machine can analyze it and find relationships or trends that I didn't explicitly set.

I'm trying to think of an example. Like, if I have a dataset of images of cats and dogs, m

The response was generated in **62 seconds** with 3 CPUs 20GB of RAM.

## Checking GPU Availability

Before loading the model, we check if GPU offloading is supported on the environment.


In [4]:
def is_gpu_available() -> bool:
    lib = load_shared_library('llama',pathlib.Path('/opt/conda/lib/python3.1/site-packages/llama_cpp/lib'))
    return bool(lib.llama_supports_gpu_offload())

is_gpu_available()

True

## Inference using GPU


If GPU acceleration is available, we can reload the model with `n_gpu_layers=-1`, meaning that all the model layers will be offloaded to the GPU.
This should significantly speed up inference.


In [5]:
deepseek = Llama.from_pretrained(
	repo_id="unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF",
	filename="DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf",
    n_gpu_layers = -1
)

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A100 80GB PCIe) - 79941 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from /home/jovyan/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Llama-8B-GGUF/snapshots/4f9ea2b81efbfb1e2cf589261003b84ef3626d52/./DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSe

In [6]:
prompt = "What is Machine Learning?"
prompt_formated = f'<｜User｜>{prompt}<｜Assistant｜>'

output = deepseek(
    prompt_formated,
    max_tokens = 300
)

print(output['choices'][0]['text'])

llama_perf_context_print:        load time =     115.71 ms
llama_perf_context_print: prompt eval time =     115.53 ms /     8 tokens (   14.44 ms per token,    69.24 tokens per second)
llama_perf_context_print:        eval time =    2041.80 ms /   299 runs   (    6.83 ms per token,   146.44 tokens per second)
llama_perf_context_print:       total time =    2476.28 ms /   307 tokens


<think>
Okay, so I'm trying to understand what machine learning is. I've heard the term before, especially with all the talk about AI and neural networks. But I'm not entirely sure what it really is. Let me try to break it down.

From what I remember, machine learning is a type of artificial intelligence. So, it's about machines being able to learn from data, right? But how exactly does that work? I think it's different from regular programming where you set specific rules. Instead, machine learning allows the machine to figure things out on its own, maybe using data to make predictions or decisions.

I think there are different types of machine learning. I've heard terms like supervised learning, unsupervised learning, and reinforcement learning. I'm not exactly sure what each one does. Maybe I should start by understanding each type individually.

Supervised learning sounds like it has some sort of guidance. So, the machine is trained on data with labeled examples. The model then lea

The response was generated in **8 seconds** with a NVIDIA T4 GPU with 16GB of GPU Memory.

## Conclusion

This demonstrates the **significant speedup** achieved by using GPU acceleration with `llama.cpp` (8 seconds on GPU vs. 61 seconds on CPU). In this case, GPU offloading provides a **7-8x speedup**.
