### Convert Model to gguf format.

In  this notebook we will save the the model to the gguf format. The GGUF format is a file format for storing model for inference with GGML.  GGML is a tensor library developed for Machine Learning.

You can learn more about the format [here.](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)

In [2]:
from huggingface_hub import snapshot_download

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model_name = 'Lajavaness/bilingual-embedding-large'

In [4]:
from pathlib import Path

In [5]:
model_repository = Path.cwd().joinpath("models")

In [6]:
model_repository.exists()

True

In [7]:
model_path = model_repository.joinpath(model_name)

### Download the model 

Uncomment the bellow row to download the model.

In [8]:
snapshot_download(repo_id=model_name, local_dir=model_path,
                  force_download=True, revision="main")

Fetching 13 files: 100%|██████████| 13/13 [04:36<00:00, 21.29s/it]


'/Users/esp.py/Projects/Personal/end-to-end-rag/models/Lajavaness/bilingual-embedding-large'

After downloading the model, we need to save it to gguf file, which is the file format used by llam cpp

In [8]:
gguf_32_bits_path  = model_path.parent.joinpath(f"{model_name.split('/')[0]}_32.gguf")
gguf_16_bits_path  = model_path.parent.joinpath(f"{model_name.split('/')[0]}_16.gguf")
assert gguf_32_bits_path.parent.exists()
assert gguf_16_bits_path.parent.exists()

In [9]:
llama_cpp_path = Path.cwd().parent.joinpath("llama.cpp")
convert_script_path = llama_cpp_path.joinpath(
    "convert_hf_to_gguf.py").__str__()

In [10]:
!python $convert_script_path $model_path --outfile $gguf_16_bits_path --outtype f16

INFO:hf-to-gguf:Loading model: Qwen2.5-1.5B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> F16, shape = {1536, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> F16, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> F16, shape = {1536, 256}
INFO:hf-to-gguf:blk.0.attn_output.weight,  torch.bfloat16 --> F16, shape = {1536, 1536}
INFO:hf-