# üìò Gemma 2B ‚Äì Quantizing GGUF

- **Author:** Ederson Corbari <e@NeuroQuest.ai>
- **Date:** February 01, 2026  

---

## Overview

This notebook documents the correct pipeline for converting a **LoRA-fine-tuned (QLoRA)** model into a **quantized GGUF** model, ready for use in local runtimes such as **llama.cpp**, **Jan**, **Ollama**, and **LM Studio**.

‚ö†Ô∏è An important note: **It is not possible to quantize a LoRA model directly**.  

The process requires a well-defined sequence of intermediate steps.

### Correct pipeline order

- **Merge the LoRA adapter with the base model**  
‚Üí Produces a full model in **fp16 or bfloat16**

- **Convert Hugging Face to GGUF (fp16)**  
‚Üí Using the official `llama.cpp` tooling

- **Quantize the GGUF model**  
‚Üí Formats such as **Q4 / Q5 / Q8**, depending on the desired balance between quality and memory usage

Every **GGUF model is first created in fp16**. Quantization is only applied after this conversion step.

Throughout this notebook, this workflow is applied to a **Gemma 2B** model fine-tuned for empathetic and psychologically safe responses, covering everything from LoRA merging to final quantization, inference testing, and publication to Hugging Face.

The final merged model is publicly available on the Hugging Face Hub at:

- **https://huggingface.co/ecorbari/Gemma-2b-it-Psych-GGUF**

### Alternative: GGUF Conversion Without Local Setup

If you prefer **not to perform the conversion and quantization locally**, Hugging Face provides a simple and fully managed alternative via the following Space:

- **https://huggingface.co/spaces/ggml-org/gguf-my-repo**

---

## 1Ô∏è‚É£ Prerequisites

If the previous notebook steps were executed successfully, the following model directory should already exist:

### Expected result

Gemma-2b-it-Psych-Merged/  
‚îú‚îÄ config.json  
‚îú‚îÄ model.safetensors  
‚îú‚îÄ tokenizer.model / tokenizer.json  

If this directory does not exist, you must download the merged model from:

- https://huggingface.co/ecorbari/Gemma-2b-it-Psych-Merged

The `tokenizer.model` file is also required and can be obtained from:

- https://huggingface.co/google/gemma-2b-it/blob/main/tokenizer.model

After downloading, move the tokenizer file into the model directory:


In [None]:
mv tokenizer.model ~/fine-tune-llm/notebooks/Gemma-2b-it-Psych-Merged/

## 2Ô∏è‚É£ Install / Update llama.cpp

You can use the official `llama.cpp` repository as a reference to install the runtime on your machine.  
It is important that your system has a **CUDA-capable GPU** available.

A detailed guide on installing and configuring llama.cpp is available here:

- [Build llama-cpp](https://ecorbari.medium.com/running-local-llms-on-ubuntu-with-nvidia-gpu-using-llama-cpp-2ec2e010c040)

Below are the essential steps to build `llama.cpp` with CUDA support.

In [None]:
mkdir -p ~/projects && cd ~/projects
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp

Configure and build with **CUDA** enabled (adjust the CUDA architecture according to your GPU compute capability):

In [None]:
cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_CUDA_ARCHITECTURES=75  # Use your GPU compute capability (without decimals)

cmake --build build -j$(nproc)

To avoid dependency conflicts, create a Python virtual environment inside the `llama.cpp` directory:

In [None]:
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt

## 3Ô∏è‚É£ Converting the Model to GGUF Format

With the merged model ready, the next step is to convert it to the **GGUF format**, which is required before any quantization can be applied.

Run the following command from the `llama.cpp` directory:

In [None]:
python3 convert_hf_to_gguf.py \
  ~/fine-tune-llm/notebooks/Gemma-2b-it-Psych-Merged/ \
  --outfile gemma-2b-it-psych-f16.gguf \
  --outtype f16

This command converts the merged Hugging Face model into a GGUF file in **fp16** precision.

**Output:**

After a successful conversion, the following file will be generated:

`gemma-2b-it-psych-f16.gguf`

You can verify the file size with:

In [None]:
du -sh gemma-2b-it-psych-f16.gguf

Example output:

- 4.7G gemma-2b-it-psych-f16.gguf

üìå This fp16 GGUF file serves as the base artifact for all subsequent quantization steps (e.g., Q4, Q5, Q8).

## 4Ô∏è‚É£ Quantizing to GGUF

With the fp16 GGUF file generated, we can now apply quantization to significantly reduce the model size while preserving most of its performance.

**Recommended quantization:**

The following command applies **Q5_K_M quantization**, which offers an excellent balance between model quality and memory usage:

In [None]:
./build/bin/llama-quantize \
  gemma-2b-it-psych-f16.gguf \
  gemma-2b-it-psych-q5_k_m.gguf \
  Q5_K_M

After quantization, the resulting file will be noticeably smaller:

- 1.8G gemma-2b-it-psych-q5_k_m.gguf

**Notes:**

Other quantization formats (e.g., Q4, Q6, Q8) can also be used depending on your deployment constraints.

**Q5_K_M** is generally recommended for instruction-following and alignment-sensitive models, as it provides a strong balance between compactness and output quality.

üìå The quantized GGUF file is now ready for efficient local inference using llama.cpp, Jan, Ollama, or LM Studio.

## 5Ô∏è‚É£ Testing Model Inference

With the quantized GGUF model ready, the next step is to validate inference and ensure the model behaves as expected.

### Single-prompt inference

The following command runs a single inference pass using a fixed prompt:


In [None]:
./build/bin/llama-cli \
  -m gemma-2b-it-psych-q5_k_m.gguf \
  -p "I feel anxious and overwhelmed lately. What should I do?" \
  -n 256 \
  --temp 0.7 \
  --repeat-penalty 1.1

This allows you to quickly verify:

- Model loading and runtime compatibility
- Output coherence and alignment
- Basic response quality after quantization

**Interactive chat mode**

For a real-time, conversational experience with the model, use interactive chat mode:

In [None]:
./build/bin/llama-cli \
  -m gemma-2b-it-psych-q5_k_m.gguf \
  -cnv

This mode enables continuous dialogue, making it easier to evaluate:

- Instruction-following behavior
- Conversational consistency
- Empathy and tone across multiple turns

üìå At this point, the model is fully operational and ready for practical use in local inference environments.

## 6Ô∏è‚É£ Publishing the Model to Hugging Face

After validating inference, the final step is to publish the quantized GGUF model to the **Hugging Face Hub**, making it easily accessible for download and integration with local runtimes.

Before uploading, authenticate with the Hugging Face CLI:

In [None]:
hf auth login

Follow the prompts and provide your Hugging Face access token.

**Upload the GGUF model**

Use the following command to upload the quantized model file to an existing Hugging Face repository:

In [None]:
hf upload \
  ecorbari/Gemma-2b-it-Psych-GGUF \
  /home/edmc/projects/llama.cpp/gemma-2b-it-psych-q5_k_m.gguf \
  --repo-type model

Once uploaded, the model becomes publicly available and can be consumed directly by tools that support GGUF models, such as llama.cpp, Ollama, LM Studio, and Jan (local mode).

üìå For additional details on the Hugging Face CLI, refer to the official documentation:

- https://huggingface.co/docs/huggingface_hub/en/guides/cli

**Using the Model with Jan**

To use the published GGUF model in Jan, configure it as a local model (GGUF / llama.cpp backend).
Jan does not list GGUF models under the **Hugging Face** provider, but it fully supports GGUF files in local mode.

For more information and downloads, visit:

- https://www.jan.ai/

At this point, the model is fully packaged, published, and ready for practical use.