Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized prompting metadata #774

Open
MoonRide303 opened this issue Mar 27, 2024 · 3 comments
Open

Standardized prompting metadata #774

MoonRide303 opened this issue Mar 27, 2024 · 3 comments

Comments

@MoonRide303
Copy link

It would be nice to have standardized prompting metadata defined within GGUF files.

Currently when importing GGUF model to tools like Ollama it's necessary to explicitely provide prompting metadata - like the template and stopping sequences specific to given model, sometimes also default system message. It would be useful to include most commonly used prompting parameters in GGUF specification, so it could be read and used by applications like llama.cpp or Ollama.

My proposal, based on Ollama Modelfile definition:

  • prompting.template: string - default prompting template,
  • prompting.system: string - default system string,
  • prompting.stops: array[string] - default list of strings stopping generation.
@ggerganov
Copy link
Owner

There is already tokenizer.chat_template : string that should contain all the necessary info

@MoonRide303
Copy link
Author

@ggerganov Those seem to cover about the same area, but don't Jinja templates require pre-processing? Can those be used directly as input to -p in llama.cpp, for example? I didn't notice them anywhere in Ollama Modelfiles, either.

I am not sure if / how multiple types of templates should be supported. Maybe something like

  • prompting.template_type: string - type of prompting template ("simple", "jinja")
    or
  • prompting.template_jinja: string - jinja prompting template (standardize tokenizer.chat_template)
    would allow to support both styles?

Or, more flexible approach, just allow any set of types and templates via prompting.templates map.

Just an idea, but I feel like standardizing this could make experimenting with different types of models a bit easier.

@teleprint-me
Copy link

teleprint-me commented Mar 27, 2024

@MoonRide303

The model files are converted from torch which uses a zip format based on the python pickle library. Most models are created using transformers which is maintained by HuggingFace. The python scripts used to convert the models to gguf embed data into the model files. There are docs, references, and posts that detailed this already, e.g. PR #765 and #302.

Because HuggingFace uses jinja which is a HTML templating system, the "chat" template is usually in the automatically generated tokenizer_config.json file. It's surprisingly non-trivial attempting to extract a instruct/chat model template from these files as the information is usually inconsistent and unreliable. The reason for this is that the model creator decides how to configure all of these parameters before training begins; There is a deeper rationale for this, but this requires a more in-depth explanation that feels out of scope here. There are plenty of materials, including on HuggingFace, that break this down already.

When the models are converted from torch to gguf, the relevant metadata is read into memory and then written to the converted model file. You can use the GGUFReader class to reference the metadata which there is already an existing example file for doing.

You can get a feel for how a gguf model file is structured in more detail by referencing the gguf.constants module in llama.cpp/gguf-py.

I hope this provides some clarity into the current implementation.

As an aside, I would've personally used a Mapping, not Jinja2. I have no idea what the rationale behind this was, the consequence is that we're all stuck dealing with it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants