### Convert Model to gguf format.

In  this notebook we will save the the model to the gguf format. The GGUF format is a file format for storing model for inference with GGML.  GGML is a tensor library developed for Machine Learning.

You can learn more about the format [here.](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)

In [1]:
from huggingface_hub import snapshot_download

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = 'intfloat/multilingual-e5-large-instruct'

In [3]:
from pathlib import Path

In [4]:
model_repository = Path.cwd().joinpath("models")

In [5]:
model_repository.exists()


True

In [6]:
model_path = model_repository.joinpath(model_name)

### Download the model 

Uncomment the bellow row to download the model.

In [7]:
snapshot_download(repo_id=model_name, local_dir=model_path,
                  force_download=True, revision="main")

Fetching 19 files: 100%|██████████| 19/19 [02:56<00:00,  9.30s/it]


'/Users/esp.py/Projects/Personal/end-to-end-rag/models/intfloat/multilingual-e5-large-instruct'

After downloading the model, we need to save it to gguf file, which is the file format used by llam cpp

In [8]:
gguf_32_bits_path  = model_path.parent.joinpath(f"{model_name.split('/')[0]}_32.gguf")
gguf_16_bits_path  = model_path.parent.joinpath(f"{model_name.split('/')[0]}_16.gguf")
assert gguf_32_bits_path.parent.exists()

assert gguf_16_bits_path.parent.exists()

In [9]:
llama_cpp_path = Path.cwd().parent.joinpath("llama.cpp")
convert_script_path = llama_cpp_path.joinpath(
    "convert_hf_to_gguf.py").__str__()

In [10]:
!python $convert_script_path $model_path --outfile $gguf_16_bits_path --outtype f16

INFO:hf-to-gguf:Loading model: multilingual-e5-large-instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd_norm.bias,            torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:token_embd_norm.weight,          torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:position_embd.weight,            torch.float16 --> F32, shape = {1024, 512}
INFO:hf-to-gguf:token_types.weight,              torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:token_embd.weight,               torch.float16 --> F16, shape = {1024, 250002}
INFO:hf-to-gguf:blk.0.attn_output_norm.bias,     torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output_norm.weight,   torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output.bias,          torch.float16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,        torch.float16 -

We will try to download Qwen/Qwen2.5-14B-Instruct-GGUF 6bit quantization and run it on a 16 GB ram machine