Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of Jinja2 Templating #875

Merged
merged 13 commits into from Jan 17, 2024
Merged

Conversation

teleprint-me
Copy link
Contributor

Refactor Chat Templating to Utilize Jinja2

Overview

This pull request introduces a significant refactor of the chat templating system within the llama-cpp-python project. The primary objective is to simplify template management, enhance flexibility, and minimize dependencies by leveraging Jinja2's templating engine.

Changes Introduced

  • Jinja2 Templating: Shifted the templating mechanism to use Jinja2, which allows for more sophisticated templating features such as inheritance, macros, and custom filters while maintaining simplicity.
  • Template Customization: Users can now specify a custom template or fallback to a default provided template, giving them the flexibility to tailor the chat output to their preferences.
  • Reduced Complexity: The codebase simplification removes previously excessive dependencies and streamlines the chat formatting process.

Code Snippet

class AutoChatFormatter(ChatFormatterInterface):
    def __init__(
        self,
        template: Optional[str] = None,
        template_class: Optional[Template] = None,
    ):
        if template is not None:
            self._template = template
        else:
            self._template = llama2_template  # default template
        
        self._renderer = jinja2.Environment(
            loader=jinja2.BaseLoader(),
            trim_blocks=True,
            lstrip_blocks=True,
        ).from_string(
            self._template,
            template_class=template_class,
        )
    
    def __call__(
        self,
        messages: List[Dict[str, str]],
        **kwargs: Any,
    ) -> ChatFormatterResponse:
        formatted_sequence = self._renderer.render(messages=messages, **kwargs)
        return ChatFormatterResponse(prompt=formatted_sequence)

    @property
    def template(self) -> str:
        return self._template

Benefits

  • Consistency: Provides a consistent templating mechanism for formatting chat messages.
  • Performance: Reduces the overhead of managing templates and potentially improves the performance due to a simpler processing pipeline.
  • User Experience: Offers an improved experience for developers who are already familiar with Jinja2's syntax and behavior.
  • Local and Remote Template Support: Enhances the system's ability to reference both local and remote templates, thus improving usability and flexibility.

Discussion Points

  • How we can further optimize the templating engine for our use case.
  • Potential additional features or customizations we might want to include in the future.
  • Feedback on the default template structure and suggestions for improvements.

Conclusion

This update aims to strike an ideal balance between the sophistication of the templating features and the maintenance simplicity desired by the project's contributors and users. I look forward to the community's input on this proposal.

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
- Simplify the `llama2_template` in `llama_jinja_format.py` by removing unnecessary line breaks for readability without affecting functionality.
- Update `ChatFormatterInterface` constructor to accept a more generic `Optional[object]` type for the template parameter, enhancing flexibility.
- Introduce a `template` property to `ChatFormatterInterface` for standardized access to the template string.
- Replace `MetaSingleton` metaclass with `Singleton` for the `ChatFormatterFactory` to streamline the singleton implementation.

These changes enhance code readability, maintain usability, and ensure consistency in the chat formatter's design pattern usage.
@earonesty
Copy link
Contributor

this looks very clean/simple. and flexible enough to handle everything. and if we allow the user to include a template "as a chat format", in addition to registering names, then it can accommodate things we haven't thought of.

@abetlen
Copy link
Owner

abetlen commented Nov 21, 2023

Hey @teleprint-me this looks great! I think with the merging of ggerganov/llama.cpp#4125 we can now use your approach to automagically get a chat format without having to rely only on presets (assuming the chat formats are included for new quantized models).

@teleprint-me
Copy link
Contributor Author

@abetlen

I left it separated to avoid conflicts with any changes you implemented.

How would you like to handle it?

@abetlen
Copy link
Owner

abetlen commented Nov 22, 2023

@teleprint-me next steps I see:

  • Include jinja2 dependency in pyproject.toml
  • Change the behaviour of chat handler selection, order should be
    • Use chat_handler if set, else
    • Use chat_format if set (should be changed to default None in Llama.init), else
    • Use chat format from gguf file if exists, else
    • Use llama2 chat format

I think this order makes the most sense and avoids breaking backwards compatibility.

@teleprint-me
Copy link
Contributor Author

@abetlen

Sounds good!

I'll do my best to get around to it over the weekend.

I haven't had as much free time lately, but this is a high priority for me.

I'll see what I can do and keep you in the loop.

Let me know if anything changes in the mean time.

@abetlen
Copy link
Owner

abetlen commented Nov 22, 2023

@teleprint-me thank you so much, I think this will be very helpful for a lot of people, let me know if you need any extra help!

In terms of conflicts there shouldn't be any I'm just working on some performance features for batching / speculative decoding and that should all be independant of the chat format work.

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
…ormers compatibility

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
@antcodd
Copy link

antcodd commented Nov 27, 2023

It's good to see this supporting multiple system messages, that's one limitation of the current templates. Some clients insert system messages at positions other than the start to provide instructions when using the chat completion API (e.g. SillyTavern, and the Herika Skyrim mod).

Will this PR replace the existing templates? I fixed the system role mapping for one existing format locally but don't want to create unnecessary conflicts. I definitely think Jinja templates are a good way to go.

Function templates might also be worth considering for the future, at least for what is passed to the model. The current system for that is fairly complicated.

@teleprint-me
Copy link
Contributor Author

@antcodd

If I understood correctly, no, it will not replace the currently existing templates seeing as @abetlen is looking to avoid breaking backwards compatibility with the current API.

This makes integration a bit more complicated which is why I'm taking my time with it.

That, and I'm in the middle of a bunch of personal projects, so I only have a limited amount of time to spend on each of them and this isn't including work.

@abetlen

How can I get the metadata from the models gguf file so I can extract the chat template if it exists? I didn't see in the spec... maybe I missed it?

@abetlen
Copy link
Owner

abetlen commented Nov 27, 2023

@teleprint-me FWIW I think we can replace the implementation of some of the existing chat formats with this simpler Jinja2 approach, it shouldn't break anything on the API. The multi-modal models and the function calling models don't quite fit into this approach because they're not just using simple prompts but we can tackle those later.

How can I get the metadata from the models gguf file so I can extract the chat template if it exists? I didn't see in the spec... maybe I missed it?

If a model supports it you should just be able to call llama_model_meta_val_str from llama_cpp with key tokenizer.chat_template, ie something like:

buflen = 2048 # should be enough, unsure
buf = (ctypes.c_char * buflen)(0)
llama_cpp.llama_model_meta_val_str(llama.model, b"tokenizer.chat_template", buf, ctypes.sizeof(buf))

Unfortunately I just checked and it doesn't look to be too popular yet, in theory https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF should support it but I believe the gguf files where generated with a llama.cpp version before the PR was merged. One way to test would be if someone re-quantized from the base model which supports a chat_template in it's tokenizer_config.json https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B/blob/main/tokenizer_config.json

@teleprint-me
Copy link
Contributor Author

Exploring llama.cpp and llama-cpp-python for Metadata Extraction

Objective: The main goal was to extract specific metadata from GGUF-formatted models using the llama.cpp and llama-cpp-python libraries, with a focus on keys like tokenizer.chat_template and tokenizer.huggingface.json.

Key Steps and Discoveries:

  1. Initial Experiments:

    • Utilized llama_cpp.llama_model_meta_val_str to extract metadata from models, focusing on keys such as general.name and tokenizer.ggml.model.
    • Discoveries: Some keys, notably tokenizer.huggingface.json, returned -1, suggesting their absence or non-implementation in tested models.
  2. Deep Dive into Source Code:

    • Investigated the C++ and Python code of llama.cpp and llama-cpp-python for insights into metadata management and extraction.
    • Notable Findings:
      • Examined structures like llama_model, llama_context, and gguf_context.
      • Identified LLM_KV_NAMES in llama.cpp, mapping enums to string metadata keys.
      • Confirmed the non-implementation of tokenizer.huggingface.json in tested models.
  3. Documentation of Findings:

    • Uncovered the complexities in handling metadata extraction from GGUF files.
    • Emphasized the necessity of a comprehensive conversion script for model metadata inclusion.

Challenges:

  • Absence of Key Metadata: Targeted metadata keys (tokenizer.chat_template and tokenizer.huggingface.json) were not found in tested models.
  • Complexity of Source Code: Navigating intricate C++ and Python code presented challenges in understanding metadata handling.

Future Steps:

  1. Conversion Script Development: Aim to create a script for converting models to include desired metadata, vital for requantizing models like openhermes and deepseek.

  2. Enhancing Conversion Script: Focus on improving the script's ability to handle gguf model file metadata effectively, given its potential impact on the project.

  3. Backend Integration with Llama-cpp-python: Concentrate on integrating Jinja2 templates with the backend to streamline the chat template system, which may offer a more immediate solution.

Tasks

  1. Include jinja2 Dependency in pyproject.toml:

    • Status: ✅ Completed
    • Successfully added jinja2 as a dependency in pyproject.toml. Adjusted the initial version constraint from >=2.11.3,<3.0.0 to >=2.11.3 to resolve a dependency conflict with mkdocs-material. This adjustment was thoroughly tested, and the project environment was updated accordingly.
  2. Modify Chat Handler Selection Logic:

    • Objective: Implement a new chat handler selection process as outlined by the project owner. The updated order should prioritize chat_handler if set, followed by chat_format (defaulting to None), then the chat format from the .gguf file, and finally defaulting to the llama2 chat format.
    • Status: 🚧 In Progress / To Do
    • This task involves modifying the existing code to adhere to the new selection criteria. It will require checking the specified conditions in order and applying the appropriate chat format based on the available settings or file configurations.

@lopagela
Copy link

lopagela commented Dec 2, 2023

Hey guys, thanks a lot for working on this 🙏

While doing my investigation for this feature in https://github.com/imartinez/privateGPT I also came to read the parsing of the metadata of the GGUF files, but as well as it's creation, in the gguf-py part of llama.cpp.

While reading their implementation, I found out that tokenizer.chat_template is being added (see the usage of this enum https://github.com/ggerganov/llama.cpp/blob/37c746d687d877bc11803e96b4dc5f378b83c0a0/gguf-py/gguf/constants.py#L73) in new GGUF files (the GGUF version hasn't been increased, but we should see them in gguf files with v3 and onwards.

Looking at the usages of tokenizer.huggingface.json in the GGUF files writer, it seems it is simply not used - maybe a relic from the past.


Is there a discord or something like this for llama-cpp-python? If yes, I'd be interested in joining it ✌️

Do not hesitate to come and say hi at Private's GPT one: Discord (channel #contributors)

messages: List[Dict[str, str]],
**kwargs: Any,
) -> ChatFormatterResponse:
formatted_sequence = self._environment.render(messages=messages, **kwargs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my exploration on existing chat_template, I found out that, usually, they are using functions, such as raise_exception.

It looks like their might be some elegant solutions to define such method, leveraging the Jinja env (see https://stackoverflow.com/a/29262304).

Otherwise, I guess you can heavily inspire yourself from HF's transformers implementation (c.f. the usage guide: https://huggingface.co/docs/transformers/main/chat_templating) of AutoTokenizer.from_pretrained("xxx/model-name").apply_chat_template(chat, tokenize=False)

Examples of chat_templates:

Copy link

@lopagela lopagela Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is a nice entry-point line in transformers to follow to see how they are rendering this jinja template (I basically did a Ctrl + F to find it): https://github.com/huggingface/transformers/blob/74a3cebfa51b539bfcfa79b33686cc090b7074e8/src/transformers/tokenization_utils_base.py#L1600

@teleprint-me
Copy link
Contributor Author

teleprint-me commented Dec 14, 2023

@abetlen

I feel like I'm holding this PR hostage and I haven't had time to dig into to it. It technically needs to be integrated, so I'm going to allow it for review so you can do what you need to do. Let me know if you need anything.

The chat templates should be integrated into the latest gguf's. I've been testing Mixtral and they show up during the conversion process when the config.json is available.

@teleprint-me teleprint-me marked this pull request as ready for review December 14, 2023 04:13
@teleprint-me
Copy link
Contributor Author

Thought I'd add this here because I was experimenting with the latest llama.cpp updates.

The new convert.py script has the Keys.Tokenizer values baked into it now.

The new gguf package makes it a lot easier to extract the templates from the models metadata.

The following is an example on how to do it at a high-level.

"""
main.py - example file to experiment with extracting the chat template from the models metadata
"""
from __future__ import annotations

from gguf import GGUFReader, Keys


def get_chat_template(model_file: str) -> str:
    reader = GGUFReader(model_file)

    # Access the 'chat_template' field directly using its key
    chat_template_field = reader.fields[Keys.Tokenizer.CHAT_TEMPLATE]

    # Extract the chat template string from the field
    chat_template_memmap = chat_template_field.parts[-1]
    chat_template_string = chat_template_memmap.tobytes().decode("utf-8")

    return chat_template_string


def main() -> None:
    # this is just an exercise to determine how it might be done in practice
    model_file = "models/mistralai/Mixtral-8x7B-Instruct-v0.1/Mixtral-8x7B-Instruct-v0.1-q4_0.gguf"
    chat_template = get_chat_template(model_file)
    print(chat_template)


if __name__ == "__main__":
    main()

Which results in:

00:49:13 | ~/Local/llama_cpp_client
(.venv) git:(main | Δ) λ python main.py 
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

So, llama_cpp.llama_model_meta_val_str should work now, but this assumes the gguf model file has the embedded metadata.

@abetlen abetlen merged commit 6bfe98b into abetlen:main Jan 17, 2024
@abetlen
Copy link
Owner

abetlen commented Jan 17, 2024

Hey @teleprint-me thanks for all the work on this, I've merged it in now. I still need to do a little bit to adapt this to the CompletionChatHandler interface so we can have a single method for adding new chat formats using jinja, will likely migrate the existing templates over as well and add support for auto loading from the gguf files but I think this will be really useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants