# Gemma2 Fine-Tuning: From SFT and DPO to GGUF Deployment with Ollama

![](https://qudata.com/en/images/gemma.png)

This notebook will utilize Hugging Face's[TRL](https://huggingface.co/docs/trl/index), [transformers](https://huggingface.co/docs/transformers/index), [datasets](https://huggingface.co/docs/datasets/index), [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) and PEFT libraries to fine-tune Google's open model, Gemma2-2b, for instruction-based tasks.

Although google/gemma-2-2b-it is an instruction-tuned model, I am fine-tuning it using SFT and DPO to create a concise and focused Korean QA bot that provides answers with only the essential information.  

## Overview
The project will be conducted in 3 parts as follows:

#### 1. Part 1: SFT and Evaluation
Filename: 1_SFT_and_Evaluation.ipynb  
Contents:
- Setup the development environment
- Prepare and preprocess the dataset
- Fine-tune the Gemma2-2b model using TRL and SFTTrainer
- Evaluate the fine-tuned model (basic performance assessment)

#### 2. Part 2: DPO and Comparison
Filename: 2_DPO_and_Comparison.ipynb  
Contents:
- Apply DPO to the SFT fine-tuned model for additional tuning
- Compare the model's output before and after DPO

#### 3. Part 3: GGUF Conversion and Deployment
Filename: 3_GGUF_Conversion_and_Deployment.ipynb  
Contents:
- Convert the DPO fine-tuned model to GGUF format
- Apply quantization to GGUF

## Part 3: GGUF Conversion and Deployment

## 1. Setup development environment

In [1]:
!pip install transformers==4.42.3 peft==0.10.0 trl==0.8.6 bitsandbytes==0.43.0 accelerate==0.29.0

Collecting transformers==4.42.3
  Downloading transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.10.0
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting trl==0.8.6
  Downloading trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes==0.43.0
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl.metadata (1.8 kB)
Collecting accelerate==0.29.0
  Downloading accelerate-0.29.0-py3-none-any.whl.metadata (18 kB)
Collecting datasets (from trl==0.8.6)
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting tyro>=0.5.11 (from trl==0.8.6)
  Downloading tyro-0.8.11-py3-none-any.whl.metadata (8.4 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl==0.8.6)
  Downloading shtab-1

In [3]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, PeftModel
import torch
from huggingface_hub import login
from pprint import pprint
import warnings

In [2]:
HF_token="YOUR TOKEN HERE"

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Merging the Base Model and Fine-Tuned Adapter

The results of the DPO fine-tuning we saved only contain the fine-tuned adapter.  
Therefore, we need to merge the base model from the DPO fine-tuning process with the LoRA adapter to create a complete model.

In [10]:
model_id = "acho98/gemma2-2b-it-tuned-and-merged"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    token=HF_token,
)

fine_tuned_dpo_adapter_path = '/content/drive/MyDrive/llm/gemma/gemma2-2b-it-dpo_output/checkpoint-80'

base_and_adapter_model = PeftModel.from_pretrained(model, fine_tuned_dpo_adapter_path)
base_and_adapter_model = base_and_adapter_model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(fine_tuned_dpo_adapter_path)

base_and_adapter_model.push_to_hub("acho98/gemma2-2b-it-dpo-tuned-and-merged", token=HF_token)
tokenizer.push_to_hub("acho98/gemma2-2b-it-dpo-tuned-and-merged", token=HF_token)

After merging the base model and the fine-tuned LoRA adapter, save the complete model locally to ensure easy access and further use.

In [6]:
output_dir = "/content/gemma2-2b-it-dpo-tuned-and-merged"

base_and_adapter_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('/content/gemma2-2b-it-dpo-tuned-and-merged/tokenizer_config.json',
 '/content/gemma2-2b-it-dpo-tuned-and-merged/special_tokens_map.json',
 '/content/gemma2-2b-it-dpo-tuned-and-merged/tokenizer.model',
 '/content/gemma2-2b-it-dpo-tuned-and-merged/added_tokens.json',
 '/content/gemma2-2b-it-dpo-tuned-and-merged/tokenizer.json')

## 3. Conversion to GGUF Format

The GGUF project focuses on optimizing and compressing models for efficient deployment on low-resource devices.

first, Cloning from GitHub.

In [7]:
! git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 34853, done.[K
remote: Counting objects: 100% (6808/6808), done.[K
remote: Compressing objects: 100% (395/395), done.[K
remote: Total 34853 (delta 6592), reused 6490 (delta 6410), pack-reused 28045 (from 1)[K
Receiving objects: 100% (34853/34853), 58.02 MiB | 16.30 MiB/s, done.
Resolving deltas: 100% (25268/25268), done.


In [1]:
%cd llama.cpp

/content/llama.cpp


Install Required Libraries.

In [2]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu


Convert to GGUF Format.

In [3]:
!python3 convert_hf_to_gguf.py /content/gemma2-2b-it-dpo-tuned-and-merged/ --outfile /content/ggml-gemma2-2b-it-dpo-f32.gguf

INFO:hf-to-gguf:Loading model: gemma2-2b-it-dpo-tuned-and-merged
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> F16, shape = {2304, 256000}
INFO:hf-to-gguf:blk.0.attn_norm.weight,            torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:blk.0.ffn_down.weight,             torch.bfloat16 --> F16, shape = {9216, 2304}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,             torch.bfloat16 --> F16, shape = {2304, 9216}
INFO:hf-to-gguf:blk.0.ffn_up.weight,               torch.bfloat16 --> F16, shape = {2304, 9216}
INFO:hf-to-gguf:blk.0.post_attention_norm.weight,  torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:blk.0.post_ffw_norm.weight,        torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:

## 4. Apply Quantization to GGUF: Q4_K_M
To generate the executable required for quantization, run the make command

In [4]:
!make

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 -g 
I LDFLAGS:    
I CC:     

Execute the quantization process with the Q4_K_M method.

In [7]:
! ./llama-quantize /content/ggml-gemma2-2b-it-dpo-f32.gguf Q4_K_M

main: build = 3828 (95bc82fb)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/ggml-gemma2-2b-it-dpo-f32.gguf' to '/content/ggml-model-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from /content/ggml-gemma2-2b-it-dpo-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma2 2b It Tuned And Merged
llama_model_loader: - kv   3:                       general.organization str              = Acho98
llama_model_loader: - kv   4:                           general.finetune str              = it-tuned-and-merged
llama_mo

After completion, verify the generated files and compare their sizes.

In [8]:
!ls -l /content

total 6781080
drwx------  5 root root       4096 Sep 27 04:43 drive
drwxr-xr-x  2 root root       4096 Sep 27 05:05 gemma2-2b-it-dpo-tuned-and-merged
-rw-r--r--  1 root root 5235213888 Sep 27 05:24 ggml-gemma2-2b-it-dpo-f32.gguf
-rw-r--r--  1 root root 1708582464 Sep 27 06:08 ggml-model-Q4_K_M.gguf
drwxr-xr-x 23 root root       4096 Sep 27 05:54 llama.cpp
drwxr-xr-x  1 root root       4096 Sep 20 13:22 sample_data


## Conclusion

In this notebook, the DPO-tuned model was converted to the GGUF format and subsequently quantized using the Q4_K_M method. After deploying the optimized model on Ollama, it demonstrated remarkably fast response times, indicating the success of the model's optimization and compression.