Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : match model part name prefix and suffix #7687

Merged
merged 1 commit into from
Jun 9, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jun 2, 2024

TL;DR This should fix conversion of https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

(thanks to @teleprint-me for noticing this problem in #7379 (comment))

Before #7075, convert-hf-to-gguf.py checked for specific model part names according to the number of *.bin files or *.safetensors files.

def _get_part_names(self):
if self.is_safetensors:
if self.num_parts == 1: # there's only one .safetensors file
return ("model.safetensors",)
return (f"model-{n:05}-of-{self.num_parts:05}.safetensors" for n in range(1, self.num_parts + 1))
if self.num_parts == 1: # there's only one .bin file
return ("pytorch_model.bin",)
return (f"pytorch_model-{n:05}-of-{self.num_parts:05}.bin" for n in range(1, self.num_parts + 1))

But then, #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part, simply used the same logic as the part count to get the part names. That is, using only the suffix of the files.

def get_model_part_names(dir_model: Path, suffix: str) -> list[str]:
part_names: list[str] = []
for filename in os.listdir(dir_model):
if filename.endswith(suffix):
part_names.append(filename)
part_names.sort()
return part_names

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

Note that the previous way it was done would still have failed, since the model count still only relied on the suffixes.

I think matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models.

@compilade compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix python python script changes labels Jun 2, 2024
@teleprint-me
Copy link
Contributor

It's an improvement, but the problem continues to persist. This is a good patch for the interim.

@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 2, 2024
@mofosyne mofosyne merged commit 5795b94 into master Jun 9, 2024
23 checks passed
@teleprint-me
Copy link
Contributor

I think this might be okay for the most part. I've been looking at a bunch of model repos and they all seem to be converging towards a consistent format. I only thought this might be an issue because a model creator could name the file anything they wanted. The only outlier I've been able to find is consolidated which is (kind of) a new one. My rationale for this potentially being an issue is that the raw models distributed by facebook are also named consolidated or prefixed with consolidated.

02:21:19 | ~
  λ tree -I papers /mnt/scsm/models/facebook                 
/mnt/scsm/models/facebook
├── llama-1
│   ├── 13B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── consolidated.01.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   ├── 30B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── consolidated.01.pth
│   │   ├── consolidated.02.pth
│   │   ├── consolidated.03.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   ├── params.json
│   │   ├── tokenizer_checklist.chk
│   │   └── tokenizer.model
│   └── llama.sh

That's all I have to say about this for now. I'm still researching and experimenting.

@mofosyne
Copy link
Collaborator

mofosyne commented Jun 9, 2024

Do we have any contact with the safetensor devs etc... to encourage formalization or at least encourage best practices?

@mofosyne mofosyne deleted the compilade/convert-hf-model-part-prefix branch June 9, 2024 07:00
@teleprint-me
Copy link
Contributor

The issue is that PyTorch users can do whatever they want. Safetensors is managed by HuggingFace. So, if I make some toy models, then I can technically name them whatever. I think @compilade is right about the naming not being an issue. The conversion script is solely geared toward huggingface, so it isolates our focus, which isn't necessarily a bad thing considering how much is going on in the repo. You can find the Safetensors specs in their repo which is actually nicely outlined.

https://github.com/huggingface/safetensors/blob/main/docs/source/metadata_parsing.mdx#usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants