Tutorial: How to convert HuggingFace model to GGUF format #2948

samos123 · 2023-09-01T02:02:05Z

samos123
Sep 1, 2023

Source: https://www.substratus.ai/blog/converting-hf-model-gguf-model/

I published this on our blog but though others here might benefit as well, so sharing the raw blog here on Github too. Hope it's helpful to folks here and feedback is welcome.

Downloading a HuggingFace model

There are various ways to download models, but in my experience the huggingface_hub
library has been the most reliable. The git clone method occasionally results in
OOM errors for large models.

Install the huggingface_hub library:

pip install huggingface_hub

Create a Python script named download.py with the following content:

from huggingface_hub import snapshot_download
model_id="lmsys/vicuna-13b-v1.5"
snapshot_download(repo_id=model_id, local_dir="vicuna-hf",
                  local_dir_use_symlinks=False, revision="main")

Run the Python script:

python download.py

You should now have the model downloaded to a directory called
vicuna-hf. Verify by running:

ls -lash vicuna-hf

Converting the model

Now it's time to convert the downloaded HuggingFace model to a GGUF model.
Llama.cpp comes with a converter script to do this.

Get the script by cloning the llama.cpp repo:

git clone https://github.com/ggerganov/llama.cpp.git

Install the required python libraries:

pip install -r llama.cpp/requirements.txt

Verify the script is there and understand the various options:

python llama.cpp/convert.py -h

Convert the HF model to GGUF model:

python llama.cpp/convert.py vicuna-hf \
  --outfile vicuna-13b-v1.5.gguf \
  --outtype q8_0

In this case we're also quantizing the model to 8 bit by setting
--outtype q8_0. Quantizing helps improve inference speed, but it can
negatively impact quality.
You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original
quality.

Verify the GGUF model was created:

ls -lash vicuna-13b-v1.5.gguf

Pushing the GGUF model to HuggingFace

You can optionally push back the GGUF model to HuggingFace.

Create a Python script with the filename upload.py that
has the following content:

from huggingface_hub import HfApi
api = HfApi()

model_id = "substratusai/vicuna-13b-v1.5-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="vicuna-13b-v1.5.gguf",
    path_in_repo="vicuna-13b-v1.5.gguf",
    repo_id=model_id,
)

Get a HuggingFace Token that has write permission from here:
https://huggingface.co/settings/tokens

Set your HuggingFace token:

export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token>

Run the upload.py script:

python upload.py

KerfuffleV2 · 2023-09-01T02:20:36Z

KerfuffleV2
Sep 1, 2023
Collaborator

You might want to add a small note that requantizing to other formats from q8_0 will reduce quality a bit compared to quantizing from f16 or f32. It should be a pretty small difference, but could matter if the person is intending to widely distribute quantized models or aiming for the top of the HF leaderboard. In other words, someone like TheBloke wouldn't want to convert to q8_0 and then requantize to all the other formats to distribute them, they'd want to convert to f16 or f32 and quantize from that.

3 replies

samos123 Sep 1, 2023
Author

Thanks very helpful feedback! I'm just getting started and learning as I go here.

I added this as a note:
In this case we're also quantizing the model to 8 bit by setting
--outtype q8_0. Quantizing helps improve inference speed, but it can
negatively impact quality.
You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original
quality.

KerfuffleV2 Sep 1, 2023
Collaborator

No problem. The convert.py tool is mostly just for converting models in other formats (like HuggingFace) to one that other GGML tools can deal with. I was actually the who added the ability for that tool to output q8_0 — what I was thinking is that for someone who just wants to do stuff like test different quantizations, etc being able to keep a nearly original quality model around at 1/2 the size is pretty useful.

The actual examples/quantize tool is what should be used most of the time for quantizing, because it supports many formats. Generally speaking, for quality you're better off running a model with more parameters than an unquantized or less quantized model. In other words, if I can run a 16bit 7B model or a 4bit (like q4_0, q4_k) 33B model I'm going to want to use the 4bit 33B. It's also a lot faster when you can run a model on the GPU so quantizing it so it can fit can make a big difference.

For those reasons, I think most people use fairly heavily quantized models (q4_k_m feels like the sweet spot to me) and normally would only use 16bit or q8_0 when running really small models because if you have lots of extra resources why not run the model in full quality?

Quantizing often does improve inference speed but probably the main reason people use it is because you just need so much memory to run big models without it. A 70B 16bit model takes like 140GB RAM even if you're just running it on CPU, however I can run that same 70B model quantized to q4_k_m on my system with 64GB and still have memory left over to do other stuff. Disk space/downloading the models is also a consideration too. Even a q4_k_m 70B is about 40GB. Keeping a 16bit version around would be around 160GB and take much to download in the first place also.

Not telling you to write anything specific, hopefully this information is helpful.

Green-Sky Sep 2, 2023
Collaborator

[...] however I can run that same 70B model quantized to q4_k_m on my system with 64GB and still have memory left over to do other stuff.

You can even run 70B quantized to q3_k_s on 32GB ram + 8GB vram. (imho the most impressive feat of llama.cpp).
There is even enough ram left over to run the model with 8k context size, with a slight quality loss. (--rope-scale 2 -c 8192 -ngl 20)

szymonrucinski · 2023-09-01T14:11:39Z

szymonrucinski
Sep 1, 2023

I have a model trained using Qlora and I can only convert it to min. 8-bit quantization using GGUF. What about q4_K_S quantization why are they not available?

10 replies

ghost Sep 2, 2023

lol Just preparing for the general move to GGUF, so will be doing some experimentation on what is 'best' in my situation. Gotta love moving targets.

Kainat-R Oct 5, 2023

Hey @szymonrucinski how did you convert qlora trained model into GGUF? As qlora trained models do not have a config.json file which is needed by GGUF. kindly refer to #3489 and help me if you can. Thanks!

francisco-lafe Jan 4, 2024

The Python convert tool is mostly for just converting models to GGUF/GGML compatible format. I actually added the q8_0 quantization to that recently since it's very close to the same quality as not quantizing. The idea is basically that it's an okay storage format to use for quantizing to others like q4_k_s and uses half as much space as 16bit.

For quantizing to all the formats llama.cpp supports, use the examples/quantize tool. Also keep in mind what I mentioned: If you quantize from 16bit to q4_k_s you'll get slightly better results than quantizing from q8_0.

Hi, maybe I'm missing something but int that folder, examples/quantize there is no binary or similar, but just a CPP file.
How do you quantize to something smaller than q8_0?
My goal is to go to q4_k

kevinknights29 Mar 3, 2024

Hi @francisco-lafe, in order to run the quantize tool, you need to build the llama.cpp repo.
You can use the following instructions from the README:

mkdir build
cd build
cmake ..
cmake --build . --config Release

Then you can run the quantize tool from binary, located at llama.cpp/build/bin
Example:

cd llama.cpp/build/bin && \
   ./quantize ./models/Llama-2-7b-chat-hf/ggml-model-f16.gguf ./models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf q4_0

francisco-lafe Mar 4, 2024

@kevinknights29 Thanks, I ended up using the pre-compiled binaries as I'm on Windows.
Regards,

AIAnytime · 2023-09-06T05:59:17Z

AIAnytime
Sep 6, 2023

Thanks for the wonderful explanation. I am getting below error:
Loading model file refact-hf/pytorch_model.bin Traceback (most recent call last): File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 1225, in <module> main() File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 1174, in main params = Params.load(model_plus) File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 304, in load params = Params.loadHFTransformerJson(model_plus.model, hf_config_path) File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 214, in loadHFTransformerJson n_embd = config["hidden_size"] KeyError: 'hidden_size'

Can anyone help me debug this?

5 replies

samos123 Sep 6, 2023
Author

Can you show some more steps on what you did? Which model are you trying to load?

AIAnytime Sep 6, 2023

I think the convert.py inside llama.cpp is only for Llama based models? For example I am trying to convert a model whose config.json is below:

`[[{

| "architectures": [
| "GPTRefactForCausalLM"
| ],
| "attention_softmax_in_fp32": false,
| "attn_pdrop": 0.1,
| "auto_map": {
| "AutoConfig": "configuration_gpt_refact.GPTRefactConfig",
| "AutoModelForCausalLM": "modeling_gpt_refact.GPTRefactForCausalLM"
| },
| "bos_token_id": -1,
| "do_sample": true,
| "embd_pdrop": 0.1,
| "eos_token_id": 0,
| "initializer_range": 0.02,
| "layer_norm_epsilon": 1e-05,
| "model_type": "gpt_refact",
| "multi_query": true,
| "n_embd": 2048,
| "n_head": 32,
| "n_inner": null,
| "n_layer": 32,
| "n_positions": 4096,
| "resid_pdrop": 0.1,
| "scale_attention_softmax_in_fp32": false,
| "scale_attn_weights": true,
| "torch_dtype": "float16",
| "transformers_version": "4.31.0",
| "use_cache": true,
| "vocab_size": 49216
| }](url)
](url)
`

Now if you see convert.py, it has many additional params which might not be part of non-llama model? can anyone confirm and help me there?

KerfuffleV2 Sep 6, 2023
Collaborator

convert.py is just for LLaMA models. There's also a converter for Falcon models in a separate script. It looks like you're trying to convert a Refact model? It's not supported yet, there's another discussion open though: #3013

AIAnytime Sep 6, 2023

Thanks for your quick response, appreciate it! One question, is there any script available for Starcoder or any other bigcode LLM family? Or is it only Llama, Falcon?

KerfuffleV2 Sep 6, 2023
Collaborator

No problem. The main GGML repo has some examples and it seems like Starcoder is in there: https://github.com/ggerganov/ggml/tree/master/examples

Those examples tend to be relatively simple and don't have all the functionality of llama.cpp like advanced samplers, interactive mode, etc.

clearsitedesigns · 2023-11-10T20:03:00Z

clearsitedesigns
Nov 10, 2023

Is there a way to directly do this on colab?

1 reply

samos123 Nov 10, 2023
Author

I think it should work. Did you try?

daehuikim · 2023-11-15T00:33:31Z

daehuikim
Nov 15, 2023

This way i can only get one file such ass gguf. Is it available to convert model in reproducable format like TheBloke in huggingface?

I am curious i can produce this kinds of files with llama.cpp

1 reply

dame-cell Dec 10, 2023

hmm maybe we can just store all the different sizes locally in our pc then just upload all of them manually

venturaEffect · 2023-12-22T10:24:06Z

venturaEffect
Dec 22, 2023

Does anyone know why am I getting this error?

(hf2gguf) zasear@zaesarius:~/convert2gguf$ python llama.cpp/convert.py ai_lawyer \ --outfile ai-lawyer-v1.gguf \ --outtype q8_0 usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR] [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--bigendian] [--padvocab] model convert.py: error: unrecognized arguments: --outfile ai-lawyer-v1.gguf --outtype q8_0

I've tried also without backslashes but no success:

(hf2gguf) zasear@zaesarius:~/convert2gguf$ python llama.cpp/convert.py ai_lawyer --outfile ai-lawyer-v1.g guf --outtype q8_0 Traceback (most recent call last): File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1279, in <module> main() File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1207, in main model_plus = load_some_model(args.model) File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1131, in load_some_model raise Exception(f"Can't find model in directory {path}") Exception: Can't find model in directory ai_lawyer

This is how the structure looks like following the steps here:

Maybe it doesn't work because it is a private model in HuggingFace? But it shouldn't because it is locally downloaded...

Appreciate your help.

3 replies

teleprint-me Dec 22, 2023

It can't find the path.

Exception: Can't find model in directory ai_lawyer

Solution is in the error.

If it is private, it will only work if it's a supported architecture. Otherwise, you'll need to integrate it yourself.

venturaEffect Dec 22, 2023

Thanks, I think I found a work around.

It seems the finetuned models with Autotrain doesn't convert into GGUF format. Will have to merge it with the base model and from there convert it into GGUF. There is a space for that on HF.

Appreciate

heyili Jan 17, 2024

Hey man I am facing the same problem. Were you able to merge the adapter weights back to the base model and create a new model ?

andrewtvuong · 2024-01-09T18:04:08Z

andrewtvuong
Jan 9, 2024

Hi @samos123 I'm only used to working with .gguf kind of files for LLM, I have no idea what to do with this kind of models and so did a search and found your post.

Am I right to assume all models structured this way are hf models? Is there any where I can read more about this? It seems all Youtube go straight to the quantized version .gguf. Are hf models considered the raw models that can be further tuned into something else? I have lots of assumptions but hard to verify.

3 replies

samos123 Jan 9, 2024
Author

I guess you could call them raw models. The huggingface model in pytorch.bin format can be converted to a gguf model.

teleprint-me Jan 9, 2024

Any model (or model part) that ends with .pt, .pth, or .bin can be assumed to be a torch model. torch models are created using the PyTorch framework by Meta (formerly Facebook).

transformers is a framework created and maintained by HuggingFace and they typically will use any available framework to automate iterating through pre-training, fine-tuning, and other tasks for models, e.g. TensorFlow, Keras, Onnx, PyTorch, etc.

config.json or params.json usually have the Hyperparameters for a model.

tokenizer.json is a protobuf data structure that is automatically generated by the transformers framework.

tokenizer.model is a trained model created using sentencepiece that usually has all of the essential vocabulary for a model in NLP (Natural Language Processing) tasks.

special_tokens_map.json contains all the extra tokens added after pre-training a model.

You can learn all of this stuff by reading the documentation and playing around with the frameworks. They're all in Python.

andrewtvuong Jan 10, 2024

Thanks so much for the detailed explanation, very helpful to kick off my learning.

DevangPagare002 · 2024-01-17T11:53:41Z

DevangPagare002
Jan 17, 2024

3 replies

francisco-lafe Jan 17, 2024

That's a very basic error and you should first learn how to view files in a linux command line. As that's what's used on Py Notebooks.
It's telling you that the program "quantize" is not in the current path you're in.

DevangPagare002 Jan 17, 2024

But I am on the quantize path.

!pwd

output - /tmp/llama.cpp/examples/quantize

francisco-lafe Jan 17, 2024

You should run an ls command.

Taikono-Himazin · 2024-01-22T05:27:11Z

Taikono-Himazin
Jan 22, 2024

Please tell me the difference between the roles of the following files.

convert.py
convert-hf-to-gguf.py
convert-llama2c-to-ggml
convert-llama-ggml-to-gguf.py
convert-lora-to-ggml.py
convert-persimmon-to-gguf.py

My predictions are as follows.

convert.py: Convert from HuggingFace format to gguf
convert-hf-to-gguf.py: Convert from HuggingFace format to gguf
convert-llama2c-to-ggml: Convert from llama2c format to ggml
convert-llama-ggml-to-gguf.py: Convert from ggml format to gguf
convert-lora-to-ggml.py: Add LORA to base model and convert to ggml
convert-persimmon-to-gguf.py: Convert from persimmon format to gguf

Why aren't convert.py and convert-hf-to-gguf.py the same?

Also, only convert-llama2c-to-ggml is not a Python file, why is this?

1 reply

oldmanjk Apr 16, 2024

This should be documented in the main readme

nan0bug00 · 2024-01-27T20:47:12Z

nan0bug00
Jan 27, 2024

Improved the download.py script:


import sys
from huggingface_hub import snapshot_download

if len(sys.argv) != 2:
    print("Usage: python download.py <model_id>")
    sys.exit(1)

model_id = sys.argv[1]
local_dir = model_id.replace("/", "-")
snapshot_download(repo_id=model_id, local_dir=local_dir,
                  local_dir_use_symlinks=False, revision="main")

This way you can just pass the model name on huggingface in the command line. It will remove the slash and replace it with a dash when creating the directory. Example:

python download.py lmsys/vicuna-13b-v1.5 will create a directory lmsys-vicuna-13b-v1.5 and place the model from huggingface within.

0 replies

nameless0704 · 2024-01-30T02:08:46Z

nameless0704
Jan 30, 2024

I'm having a keyerror: 'transformer.h.0.attn.c_attn.bias' when transforming a Qwen-14B-Chat model using convert.py with --outtype q8_0. Some say using convert_hf_to_gguf.py will fix it, but convert_hf_to_gguf.py doesn't have quantization options. So how to quantize AND convert?

1 reply

francisco-lafe Jan 30, 2024

You can first convert and then quantize.
https://github.com/ggerganov/llama.cpp/releases/tag/b2008
Any DL for Windows, place it in a folder ie llamacppbinaries
.\llamacppbinaries\quantize.exe

IntrepidWanderer · 2024-01-31T22:51:47Z

IntrepidWanderer
Jan 31, 2024

Hi, I ran into an odd error and was really struggling to find any relevant information online. Hoping someone here can help. I know almost nothing about the technical side of things, just an average AI text gen user. I'm trying to convert GGUFs for models and checked out instructions both here and this guide on Reddit:
https://www.reddit.com/r/LocalLLaMA/comments/18av9aw/quick_start_guide_to_converting_your_own_ggufs/?rdt=48304

I managed to get convert.py working, can do FP16 and Q8 converts without issue, but ran into the same mysterious error repeatedly when trying to use quantize.exe to convert pretty much anything. I've tried with both this model Mixtral Erotic and this model CatPPT

The error message is always the same:

PS C:\TextGen\llama-b1983-bin-win-cublas-cu12.2.0-x64> ".\quantize.exe." C:\TextGen\text-generation-webui-main\models\cloudyu_Mixtral_Erotic_13Bx2_MOE_22B C:\Convert\22B.f16.Q5.gguf f16
At line:1 char:19
... ntize.exe." C:\TextGen\text-generation-webui-main\models\cloudyu_Mixt ...
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unexpected token 'C:\TextGen\text-generation-webui-main\models\cloudyu_Mixtral_Erotic_13Bx2_MOE_22B' in expression or
statement.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : UnexpectedToken

The processing always gets stuck on "line: 1 char:19", I'm not sure why and I can't really see what character it is specifically. BtW, I'm running in Powershell, just right clicked on the quantize.exe under Explorer and chose the option to auto navigate to that location. I'm not sure if that makes a difference.

I'm wondering if the error is because I don't have Llama.cpp installed correctly. Running quantize.exe through CMD gives an error about cudart64_12.dll missing, but downloading and putting the cudart files into the same folder doesn't stop the error. If I'm only using convert.py and quantize .exe, do I still need to follow the Cmake instructions on the Llama.cpp main page to "build Llama" from the source code? I've already ran the requirements.txt through pythonnkich is why convert.py is working for me, I think. It's just for some reason quantize.exe doesn't work.

Edit (Update):
Just came back to post an update about this issue. I haven't found a way to make quantize.exe work with the CuBLAS version, even though I was using Nvidia 1080 GPU. In the end, resorted to trying CLBlast, and it worked where CuBLAS wouldn't. I'm really not sure why.

0 replies

shubhamraj216 · 2024-02-27T06:55:32Z

shubhamraj216
Feb 27, 2024

I am getting below error when converting phi-2 model to gguf format. I must be missing something. Please help.

python3 llama.cpp/convert.py myllama-hf --outfile myllama-7b-v0.1.gguf

Error -

Traceback (most recent call last):
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1483, in <module>
    main()
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1419, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1271, in load_some_model
    raise Exception(f"Found multiple models in {path}, not sure which to pick: {files}")
Exception: Found multiple models in myllama-hf, not sure which to pick: [PosixPath('myllama-hf/model-00001-of-00002.safetensors'), PosixPath('myllama-hf/model.safetensors')]

Below is how the myllama-hf directory looks like

0 replies

phymbert · 2024-02-27T07:43:26Z

phymbert
Feb 27, 2024
Collaborator

I am getting below error when converting phi-2 model to gguf format. I must be missing something. Please help.

python3 llama.cpp/convert.py myllama-hf --outfile myllama-7b-v0.1.gguf

Error -

As the errors state, you are mixing multiple models safetensors files format in myllama-hf. You cannot have both "model-00001-of-*.safetensors" and "model.safetensors".

Please properly download files from HF microsoft/phi-2.

Note: you can directly download GGUF quantized Microsoft Phi-2 models from HF with hf.sh, example for a Q4_K_M:

./scripts/hf.sh --repo TheBloke/phi-2-GGUF --file phi-2.Q4_K_M.gguf

0 replies

FantasiaFoundry · 2024-03-03T11:34:52Z

FantasiaFoundry
Mar 3, 2024

This might be useful. If anyone wants to help improving it, it's always welcome.

https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script

0 replies

DevangPagare002 · 2024-03-04T18:42:06Z

DevangPagare002
Mar 4, 2024

While converting a bigcode/starcoder2-7b into q8_0 quantization using convert.py, I got the following error.

Loading model file original_model/model-00001-of-00003.safetensors Loading model file original_model/model-00001-of-00003.safetensors Loading model file original_model/model-00002-of-00003.safetensors Loading model file original_model/model-00003-of-00003.safetensors Traceback (most recent call last): File "/content/llama.cpp/convert.py", line 1479, in <module> main() File "/content/llama.cpp/convert.py", line 1426, in main params = Params.load(model_plus) File "/content/llama.cpp/convert.py", line 317, in load params = Params.loadHFTransformerJson(model_plus.model, hf_config_path) File "/content/llama.cpp/convert.py", line 256, in loadHFTransformerJson f_norm_eps = config["rms_norm_eps"], KeyError: 'rms_norm_eps'

Can anybody please help?

3 replies

francisco-lafe Mar 4, 2024

Not until you paste the entire command you used and the full text of the error.

DevangPagare002 Mar 7, 2024

command - !python llama.cpp/convert.py ./original_model/ --outtype q8_0 --outfile ./quantized_model/starcoder2_Q8_0.gguf

Full-text error -

Loading model file original_model/model-00001-of-00003.safetensors
Loading model file original_model/model-00001-of-00003.safetensors
Loading model file original_model/model-00002-of-00003.safetensors
Loading model file original_model/model-00003-of-00003.safetensors
Traceback (most recent call last):
  File "/content/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/content/llama.cpp/convert.py", line 1413, in main
    params = Params.load(model_plus)
  File "/content/llama.cpp/convert.py", line 317, in load
    params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
  File "/content/llama.cpp/convert.py", line 256, in loadHFTransformerJson
    f_norm_eps        = config["rms_norm_eps"],
KeyError: 'rms_norm_eps'

lukestanley Mar 18, 2024

@DevangPagare002 I've noticed that safe tensors appeared to break converting to GGUF for me, maybe you ran into a similar problem. When saving for GGUF converting, maybe try doing a save with safetensors turned off. For me I was using a save_pretrained method, and started to default to a safe output but this conflicted with GGUF converting, so I learned that I could set safe_serialization=False. I was using PeftModel but this interface is probably used in a few places. Hope this helps!

OE-LUCIFER · 2024-03-24T14:12:29Z

OE-LUCIFER
Mar 24, 2024

Can i make GGUF of model that contains custom code

1 reply

Green-Sky Mar 24, 2024
Collaborator

No.

Wauplin · 2024-03-28T15:32:39Z

Wauplin
Mar 28, 2024

Hi @samos123, maintainer of huggingface_hub here 🤗

Given the popularity of this post I think it'd be good to update it to showcase the huggingface-cli (not sure it was a fully featured tool back in Sept 23). So your post can be updated this way:

Downloading a HuggingFace model

To download the full model to local folder:

huggingface-cli download lmsys/vicuna-13b-v1.5 --local-dir vicuna-hf --local-dir-use-symlinks False --revision main

Or only a file:

huggingface-cli download lmsys/vicuna-13b-v1.5 <FILENAME> --local-dir vicuna-hf --local-dir-use-symlinks False --revision main

Pushing the GGUF model to HuggingFace

To upload a folder to the Hub:

huggingface-cli upload <MODEL-ID> vicuna-13b-v1.5-gguf

Finally, to set an HF token on a machine - it's best to set HF_TOKEN as environment variable (instead of HUGGING_FACE_HUB_TOKEN). It is also possible to do --token=hf_*** in the CLI but this is not the preferred way as it leaks the token in the command history on Unix.

Hope this will help you (and future readers) using the Hub in a more convenient way! 🤗

(thanks @julien-c for the friendly ping)

0 replies

thekevinscott · 2024-04-15T13:35:25Z

thekevinscott
Apr 15, 2024

When specifying --outtype q8_0 I see:

usage: convert-hf-to-gguf.py [-h] [--vocab-only] [--awq-path AWQ_PATH] [--outfile OUTFILE] [--outtype {f32,f16}] [--bigendian] [--use-temp-file] model
convert-hf-to-gguf.py: error: argument --outtype: invalid choice: 'q8_0' (choose from 'f32', 'f16')

It looks like only f32 and f16 are supported options in the script. Am I missing something?

4 replies

phymbert Apr 15, 2024
Collaborator

Q8_0 is not a pytorch type but a GGML/llama.cpp one after GGUF quantization.

You need to convert pytorch model to GGUF in F16, then use quantize to convert F16 to Q8_0

thekevinscott Apr 15, 2024

I see. So you're saying, (1) convert the model from HF to unquantized gguf, then (2) convert unquantized gguf to quantized gguf?

Like this?

python3 convert-hf-to-gguf.py models/phi-1_5_dev --outfile models/susnato_phi-1_5.gguf
python3 convert.py models/susnato_phi-1_5.gguf --outfile models/susnato_phi-1_5_q8_0.gguf --outtype q8_0

I'm guessing I'm missing something because the second command gives me:

    raise ValueError(f"unknown format: {path}")
ValueError: unknown format: models/susnato_phi-1_5.gguf

Maybe another pertinent bit of info is that I'm trying to convert from an ONNX model. I originally tried converting the ONNX directly with python3 code/llama.cpp/convert.py models/phi-1_5_dev/ --outfile models/susnato_phi-1_5_q8_0.gguf --outtype q8_0 but get:

    f_norm_eps        = config["rms_norm_eps"],
KeyError: 'rms_norm_eps'

phymbert Apr 15, 2024
Collaborator

no, quantize is a c++ binary, you cannot quantize in python. You can see a full example here in Tests:

model: support arch DbrxForCausalLM #6515

thekevinscott Apr 15, 2024

Thank you very much for your help. After building I ran quantize with:

quantize models/susnato_phi-1_5.gguf models/susnato_phi-1_5_q8_0.gguf Q8_0

And it works nicely. Cheers!

krecicki · 2024-04-18T00:43:36Z

krecicki
Apr 18, 2024

This

from huggingface_hub import snapshot_download
model_id="lmsys/vicuna-13b-v1.5"
snapshot_download(repo_id=model_id, local_dir="vicuna-hf",
                  local_dir_use_symlinks=False, revision="main")

Helped me a ton. It downloaded my LoRA combined with the base model correctly. I was able to make my guff easily.

1 reply

PhilipAmadasun Apr 18, 2024

@krecicki How did you combine your LoRA with the base model correctly? What was your merging method?

MontassarTn · 2024-04-19T15:17:56Z

MontassarTn
Apr 19, 2024

Error when I try to convert my private Llama3 model

1 reply

teleshen0 Apr 25, 2024

I hit the same issue even with the Meta official released llama 3 70B as well. Any thoughts?

bluebot08 · 2024-04-26T03:17:01Z

bluebot08
Apr 26, 2024

Hello everyone, I hope someone can help me with this error I am getting.

I try running the below:
python3 llama.cpp/convert.py idefics2-8b \ --outfile idefics2-8b-v1.5.gguf \ --outtype f32

However, I get an error of :
KeyError: 'model.connector.modality_projection.down_proj.weight'

Anyone have any ideas? Is this a problem with the model I am trying to convert?

0 replies

Tutorial: How to convert HuggingFace model to GGUF format #2948

Downloading a HuggingFace model

Converting the model

Pushing the GGUF model to HuggingFace

Replies: 22 comments · 41 replies

KerfuffleV2 Sep 1, 2023 Collaborator

samos123 Sep 1, 2023 Author

KerfuffleV2 Sep 1, 2023 Collaborator

Green-Sky Sep 2, 2023 Collaborator

samos123 Sep 6, 2023 Author

`[[{

KerfuffleV2 Sep 6, 2023 Collaborator

KerfuffleV2 Sep 6, 2023 Collaborator

samos123 Nov 10, 2023 Author

samos123 Jan 9, 2024 Author

phymbert Feb 27, 2024 Collaborator

Green-Sky Mar 24, 2024 Collaborator

Downloading a HuggingFace model

Pushing the GGUF model to HuggingFace

phymbert Apr 15, 2024 Collaborator

phymbert Apr 15, 2024 Collaborator

Replies: 22 comments 41 replies

KerfuffleV2
Sep 1, 2023
Collaborator

samos123 Sep 1, 2023
Author

KerfuffleV2 Sep 1, 2023
Collaborator

Green-Sky Sep 2, 2023
Collaborator

samos123 Sep 6, 2023
Author

KerfuffleV2 Sep 6, 2023
Collaborator

KerfuffleV2 Sep 6, 2023
Collaborator

samos123 Nov 10, 2023
Author

samos123 Jan 9, 2024
Author

phymbert
Feb 27, 2024
Collaborator

Green-Sky Mar 24, 2024
Collaborator

phymbert Apr 15, 2024
Collaborator

phymbert Apr 15, 2024
Collaborator