Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

Closed
4 tasks done
snowyu opened this issue Jan 11, 2024 · 5 comments
Closed
4 tasks done
Labels
enhancement New feature or request stale

Comments

@snowyu
Copy link
Contributor

snowyu commented Jan 11, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Currently llama.cpp lacks support for HuggingFace's tokenization pipeline.

it is crucial to address its current limitations regarding integrated tokenization pipeline configurations from HuggingFace's Tokenizers library, which are stored in a separate JSON file named "tokenizer.json". These configuration files contain essential information for implementing advanced features like subword regularization and customizable pre-processing techniques that improve language model performance.

By incorporating these metadata into the gguf format, llama.cpp can offer users a more seamless experience by providing access to HuggingFace's comprehensive tokenization pipeline within its single file implementation of language models.

Motivation

Related Issues: #2872 #3502

Possible Implementation

Only need to add the contents of the relevant subkeys (normalizer, pretokenizer, model, postprocessor, decoder) in the tokenizer.json file to the metadata of gguf. Also don’t forget the tokenizer_config.json file for special tokenizer configruation.

Subsequently, the next step can proceed to implement the tokenizer within HF (Hugging Face). The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.

Alternatively, it is possible to implement all tokenizer functionalities solely using pure C++ code without any external libraries or dependencies if so desired. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

Related JSON Example in tokenizer.json
{
  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Prepend",
        "prepend": ""
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": ""
      }
    ]
  },
  "pre_tokenizer": {
     "type": "Sequence",
     "pretokenizers": [
        {"type": "WhitespaceSplit"},
        {"type": "Metaspace","replacement": "", ...}
     ]
  },
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": true,
    "vocab": { ... }
  },

  "post_processor": {
    "type": "TemplateProcessing",
    "single": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      }
    ],
    "pair": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 1
        }
      },
      {
        "Sequence": {
          "id": "B",
          "type_id": 1
        }
      }
    ],
    "special_tokens": {
      "<s>": {
        "id": "<s>",
        "ids": [
          1
        ],
        "tokens": [
          "<s>"
        ]
      }
    }
  },
  "decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "Replace",
        "pattern": {
          "String": ""
        },
        "content": " "
      },
      {
        "type": "ByteFallback"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Strip",
        "content": " ",
        "start": 1,
        "stop": 0
      }
    ]
  },
}
@snowyu snowyu added the enhancement New feature or request label Jan 11, 2024
@sroussey
Copy link
Contributor

What about items in tokenizer_config.json?

@snowyu
Copy link
Contributor Author

snowyu commented Jan 13, 2024

Thanks, I forget the special configuration file for tokenizer.

@ggerganov
Copy link
Owner

The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.

Using a 3rd party dependency is not desired. We should try to implement these within llama.cpp.

snowyu added a commit to snowyu/llama.cpp that referenced this issue Jan 26, 2024
The content of the OBJ type is actually a list of all key names of the object.

* GGUFWriter:
  * add `def add_kv(self, key: str, val: Any) -> None`:  This will be added based on the val type
  * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value
* constants:
  * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer.
* gguf_reader:
  * add `ReaderField.get`: to return the value of the field
* Unit test added.

Related Issues: ggerganov#4868, ggerganov#2872
snowyu added a commit to snowyu/llama.cpp that referenced this issue Jan 26, 2024
The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase:

* GGUFWriter:
  * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val.
  * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value
* constants:
  * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer.
* gguf_reader
  * Added `ReaderField.get()` method: get the value of this ReaderField
* Unit tests have been added to cover these changes.

Related Issues: ggerganov#4868, ggerganov#2872
snowyu added a commit to snowyu/llama.cpp that referenced this issue Jan 26, 2024
The content of the OBJ type is actually a list of all key names of the object.

* GGUFWriter:
  * add `def add_kv(self, key: str, val: Any) -> None`:  This will be added based on the val type
  * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value
* constants:
  * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer.
* gguf_reader:
  * add `ReaderField.get`: to return the value of the field
* Unit test added.

Related Issues: ggerganov#4868, ggerganov#2872
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants