convert-hf : match model part name prefix and suffix #7687

compilade · 2024-06-02T00:57:51Z

TL;DR This should fix conversion of https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

(thanks to @teleprint-me for noticing this problem in #7379 (comment))

Before #7075, convert-hf-to-gguf.py checked for specific model part names according to the number of *.bin files or *.safetensors files.

llama.cpp/convert-hf-to-gguf.py

Lines 224 to 232 in bc4bba3

    
           def _get_part_names(self): 
        
               if self.is_safetensors: 
        
                   if self.num_parts == 1:  # there's only one .safetensors file 
        
                       return ("model.safetensors",) 
        
                   return (f"model-{n:05}-of-{self.num_parts:05}.safetensors" for n in range(1, self.num_parts + 1)) 
        
               if self.num_parts == 1:  # there's only one .bin file 
        
                   return ("pytorch_model.bin",) 
        
               return (f"pytorch_model-{n:05}-of-{self.num_parts:05}.bin" for n in range(1, self.num_parts + 1))

But then, #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part, simply used the same logic as the part count to get the part names. That is, using only the suffix of the files.

llama.cpp/convert-hf-to-gguf.py

Lines 285 to 293 in f98eb31

    
           def get_model_part_names(dir_model: Path, suffix: str) -> list[str]: 
        
               part_names: list[str] = [] 
        
               for filename in os.listdir(dir_model): 
        
                   if filename.endswith(suffix): 
        
                       part_names.append(filename) 
        
               part_names.sort() 
        
               return part_names

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

Note that the previous way it was done would still have failed, since the model count still only relied on the suffixes.

I think matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models.

teleprint-me · 2024-06-02T01:05:44Z

It's an improvement, but the problem continues to persist. This is a good patch for the interim.

teleprint-me · 2024-06-09T06:24:41Z

I think this might be okay for the most part. I've been looking at a bunch of model repos and they all seem to be converging towards a consistent format. I only thought this might be an issue because a model creator could name the file anything they wanted. The only outlier I've been able to find is consolidated which is (kind of) a new one. My rationale for this potentially being an issue is that the raw models distributed by facebook are also named consolidated or prefixed with consolidated.

02:21:19 | ~
  λ tree -I papers /mnt/scsm/models/facebook                 
/mnt/scsm/models/facebook
├── llama-1
│   ├── 13B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── consolidated.01.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   ├── 30B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── consolidated.01.pth
│   │   ├── consolidated.02.pth
│   │   ├── consolidated.03.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   └── llama.sh

That's all I have to say about this for now. I'm still researching and experimenting.

mofosyne · 2024-06-09T06:34:16Z

Do we have any contact with the safetensor devs etc... to encourage formalization or at least encourage best practices?

teleprint-me · 2024-06-09T07:04:37Z

The issue is that PyTorch users can do whatever they want. Safetensors is managed by HuggingFace. So, if I make some toy models, then I can technically name them whatever. I think @compilade is right about the naming not being an issue. The conversion script is solely geared toward huggingface, so it isolates our focus, which isn't necessarily a bad thing considering how much is going on in the repo. You can find the Safetensors specs in their repo which is actually nicely outlined.

https://github.com/huggingface/safetensors/blob/main/docs/source/metadata_parsing.mdx#usage

convert-hf : match model part name prefix and suffix

3af9371

compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix python python script changes labels Jun 2, 2024

mofosyne approved these changes Jun 2, 2024

View reviewed changes

mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 2, 2024

mofosyne merged commit 5795b94 into master Jun 9, 2024
23 checks passed

mofosyne deleted the compilade/convert-hf-model-part-prefix branch June 9, 2024 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-hf : match model part name prefix and suffix #7687

convert-hf : match model part name prefix and suffix #7687

compilade commented Jun 2, 2024 •

edited

Loading

teleprint-me commented Jun 2, 2024

teleprint-me commented Jun 9, 2024

mofosyne commented Jun 9, 2024 •

edited

Loading

teleprint-me commented Jun 9, 2024

	def _get_part_names(self):
	if self.is_safetensors:
	if self.num_parts == 1: # there's only one .safetensors file
	return ("model.safetensors",)
	return (f"model-{n:05}-of-{self.num_parts:05}.safetensors" for n in range(1, self.num_parts + 1))

	if self.num_parts == 1: # there's only one .bin file
	return ("pytorch_model.bin",)
	return (f"pytorch_model-{n:05}-of-{self.num_parts:05}.bin" for n in range(1, self.num_parts + 1))

	def get_model_part_names(dir_model: Path, suffix: str) -> list[str]:
	part_names: list[str] = []
	for filename in os.listdir(dir_model):
	if filename.endswith(suffix):
	part_names.append(filename)

	part_names.sort()

	return part_names

convert-hf : match model part name prefix and suffix #7687

convert-hf : match model part name prefix and suffix #7687

Conversation

compilade commented Jun 2, 2024 • edited Loading

teleprint-me commented Jun 2, 2024

teleprint-me commented Jun 9, 2024

mofosyne commented Jun 9, 2024 • edited Loading

teleprint-me commented Jun 9, 2024

compilade commented Jun 2, 2024 •

edited

Loading

mofosyne commented Jun 9, 2024 •

edited

Loading