feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

snowyu · 2024-01-26T13:34:30Z

The content of the OBJ type is actually a list of all key names of the object, designed to keep compatibility using the simplest flat structure.

Here's an example demonstrating its usage:

eg,

{
  "obj": {
    "subKey1": 1,
    "subKey2": {"k": 2},
  }
}

Write(OBJ): Key is obj, Object Value is ["subKey1", "subKey2"]
Write The subKeys:
1. Write(UINT8): Key is .obj.subKey1, Simple Value is 1
2. Write(STRING): Key is .obj.subKey2, Object Value is ["k"]
  1. Write(UINT8): Key is .obj.subKey1.k, Simple Value is 2

Now the all json structure supported:

object supported
mixed-type array supported
nested array supported

The agreement is as follows:

Key convention: "." indicates that the key is a subkey, not an independent key.
Subkey name convention in OBJ types:
- If the first letter of the subkey name is "/", it means referencing the full name of other keys.
- If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used.

For Example,

// tokenizer.jsonr:
{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "▁", ...}
        ]
    }
}

// tokenizer_config.json
{
  ...
  "bos_token": "<s>",
  "eos_token": "</s>",
  "add_bos_token": true,
}

Convert the JSON example above into a flat structure as follows:

// tokenizer.json
tokenizer = ["pre_tokenizer", ...], type OBJ
.tokenizer.pre_tokenizer.type = "Sequence", type STRING
.tokenizer.pre_tokenizer.pretokenizers = 2, type ARRAY, sub type OBJ
.tokenizer.pre_tokenizer.pretokenizers[0].type = "WhitespaceSplit", type STRING
.tokenizer.pre_tokenizer.pretokenizers[1].type = "Metaspace", type STRING
.tokenizer.pre_tokenizer.pretokenizers[1].replace = "▁", type STRING

// tokenizer_config.json
// there already are `tokenizer.ggml.bos_token_id`, `tokenizer.ggml.eos_token_id`
// and `tokenizer.ggml.add_bos_token` exists in GGUF, so just use them.
tokenizer_config = ["/tokenizer.ggml.bos_token_id:bos_token", 
  "/tokenizer.ggml.eos_token_id:eos_token", 
  "/tokenizer.ggml.add_bos_token",
  ...
], type OBJ

This change includes several improvements and additions to the codebase:

Python
- gguf_writer.py:
  - Added def add_kv(self, key: str, val: Any) -> None: Automatically determines the appropriate value type based on val.
  - Added def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None: Adds object (dict) values, It will recursively add all subkeys.
  - Added add_array_ex to support the nested and mixed-type array.
- constants.py:
  - Added GGUFValueType.get_type_ex(val): Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer.
- gguf_reader.py:
  - Added functionality to retrieve values from specific fields using ReaderField.get() method.
- Unit test added
CPP
- ggml:
  - Added GGUF_TYPE_OBJ to the gguf_type enum type.
  - Use gguf_get_arr_n and gguf_get_arr_str to get the subKey names of GGUF_TYPE_OBJ.
  - Added gguf_set_obj_str function to set object subkey names
  - Added gguf_set_arr_obj function to set object array count
  - Added gguf_set_arr_arr function to set nested array count
- llama:
  - Modified gguf_kv_to_str
  - Added LLAMA_API char * gguf_kv_to_c_str function to get the c_str value as JSON format.
    - Maybe this API should be moved into ggml as gguf_get_val_json.
  - Added basic support to GGUF_TYPE_OBJ and nested array
- Unit test added

Related Issues: #4868, #2872

ggerganov · 2024-01-27T10:09:47Z

I'll need a bit of time to consider this change - I like the implementation, but I'm not yet convinced it is necessary.

The way I'm thinking is that the Python writer can directly serialize complex dictionaries into array of KVs. In your example we would write straight up obj.subkey1 and obj.subkey2.k without the OBJ lists.

The composed keys (e.g. obj.subkey1) would be mapped to llama-cpp-known keys like we already do for all KV pairs and tensor names (see gguf/constants.py and gguf/tensor_mapping.py). This way in the C++ world, we straight up read the KVs that we are interested in, and don't deal with parsing dictionaries and synchronizing the key strings (the synchronization already happened thanks to the mapping during writing the GGUF file).

With the proposed approach here, I imagine that we would have to iteratively parse the OBJ KVs into std::maps for example and do some extra work I guess.

In your example:

{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "▁", ...}
        ]
    }
}

How do you imagine the C++ code would function to query the Metaspace replacement?

Without the GGUF.OBJ extension, this should ideally map to a string KV called pretokenizer.sequences.metaspace_replacement: "_" and in C++ we simply get this KV as usual.

snowyu · 2024-01-28T03:28:20Z

I'll need a bit of time to consider this change - I like the implementation, but I'm not yet convinced it is necessary.

The way I'm thinking is that the Python writer can directly serialize complex dictionaries into array of KVs. In your example we would write straight up obj.subkey1 and obj.subkey2.k without the OBJ lists.

Actually, at first, I just wanted to implement flat structure objects in Python using existing types, without introducing new types in CPP. However, I found that for two reasons, I had to add the OBJ type:

There are already some tokenizer.json and tokenizer_config.json data embedded in GGUF files. Eg, tokenizer.ggml.bos_token_id, tokenizer.chat_template, etc. Therefore, keys indexing must be maintained for backward compatibility.
- This allows us to make the following agreement to define "tokenizer_config": ["/tokenizer.ggml.bos_token_id:bos_token", "/tokenizer.ggml.add_bos_token"]
  - An initial slash (/) indicates a reference to the full name of another key.
  - A colon(:) separator indicates that the string after the colon is the name of a sub-key in this object, otherwise it uses the referenced sub-key name.
- If backward compatibility isn't considered, we can use the number of keys as the content of OBJ type, which requires the next specified number of keys to be the child keys of this object.
The current GGUF ARRAY only supports simple type arrays. To support object arrays, the OBJ type must be added. This allows mixed-type arrays to also be supported.

How do you imagine the C++ code would function to query the Metaspace replacement?

Without the GGUF.OBJ extension, this should ideally map to a string KV called pretokenizer.sequences.metaspace_replacement: "_" and in C++ we simply get this KV as usual.

Great question! This is what I'm considering: how to add OBJ support for GGUF ARRAY. The key here lies in object arrays. If pre_tokenizer type is Sequence, then pretokenizers should be executed in order of the array sequence. So convert the JSON example above into a flat structure as follows:

tokenizer.pre_tokenizer.type = "Sequence", type STRING
tokenizer.pre_tokenizer.pretokenizers = 2, type ARRAY, sub type OBJ
pre_tokenizer.pretokenizers[0].type = "WhitespaceSplit", type STRING
pre_tokenizer.pretokenizers[1].type = "Metaspace", type STRING
pre_tokenizer.pretokenizers[1].replace = "▁", type STRING

Thank you for your question! I hope this explanation helps clarify the need for adding an OBJ type in GGUF arrays and how it can be implemented effectively with backward compatibility considerations. Please let me know if there's anything else I can help with.

ggerganov · 2024-01-29T09:00:12Z

Uh, I don't know. Curious if other devs have opinion on this functionality.

If pre_tokenizer type is Sequence, then pretokenizers should be executed in order of the array sequence.

I find this extremely complicated. Overall, I have a strong hesitation of supporting all these tokenizer options, templates, configs and what not in llama.cpp. It seems like an endless way of over-engineering something that should be very simple. (sorry for the rant, it's not towards this PR)

Let me think about this for a while, but right now I'd prefer if we just picked 1 or 2 items from the tokenizer options that are more important and useful and just support those with the existing GGUF types (like a boolean for whitespace split, etc.).

slaren · 2024-01-29T12:39:59Z

To me this seems too much effort to shoehorn a solution into the current implementation. We could just include the entire tokenizer json file as a string, we are not going to bundle a json parser in llama.cpp, but I think it is safe to assume that any application that wants to support templates has a json parser as well.

snowyu · 2024-01-29T13:54:55Z

@ggerganov

I find this extremely complicated. Overall, I have a strong hesitation of supporting all these tokenizer options, templates, configs and what not in llama.cpp. It seems like an endless way of over-engineering something that should be very simple. (sorry for the rant, it's not towards this PR)

Therefore, it is easiest to use the ready-made tokenizer-cpp library. Of course, it should not be difficult to implement it one by one slowly using CPP. After all, js has been implemented, including the simple Jinja template engine.

The advantage of embedding tokenizer, template and other configurations is that they can be provided to js, python, etc. for use.

Now this PR can fully support JSON format includes mixed-type array and nested array. See unit-test and updated document.

llama_model_loader: loaded meta data with 29 key-value pairs and 3 tensors from tests/test_writer.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:         general.architecture str              = "llama"
llama_model_loader: - kv   1:            llama.block_count u32              = 12
llama_model_loader: - kv   2:                       answer u32              = 42
llama_model_loader: - kv   3:              answer_in_float f32              = 42.000000
llama_model_loader: - kv   4:                        uint8 u8               = 1
llama_model_loader: - kv   5:                        nint8 i8               = 1
llama_model_loader: - kv   6:                        dict1 obj[str,3]       = {"key1":2, "key2":"hi", "obj":{"k":1}}
llama_model_loader: - kv  11:                       oArray arr[obj,2]       = [{"k":4, "o":{"o1":6}}, {"k":9}]
llama_model_loader: - kv  18:                       cArray arr[obj,3]       = [3, "hi", [1, 2]]
llama_model_loader: - kv  22:                 arrayInArray arr[arr,2]       = [[2, 3, 4], [5, 7, 8]]
llama_model_loader: - kv  25:  tokenizer.ggml.bos_token_id str              = "bos"
llama_model_loader: - kv  26: tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  27:             tokenizer_config obj[str,2]       = {"bos_token":"bos", "add_bos_token":t...
llama_model_loader: - kv  28:            general.alignment u32              = 64
llama_model_loader: Dumping metadata keys/values Done.

The content of the OBJ type is actually a list of all key names of the object. * Python * `gguf_writer.py`: * Added `def add_kv(self, key: str, val: Any) -> None`: Automatically determines the appropriate value type based on `val`. * Added `def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None`: Adds object (dict) values, It will recursively add all subkeys. * Added `add_array_ex` to support the nested and mixed-type array. * `constants.py`: * Added `GGUFValueType.get_type_ex(val)`: Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer. * `gguf_reader.py`: * Added functionality to retrieve values from specific fields using `ReaderField.get()` method. * Unit test added * CPP * `ggml`: * Added `GGUF_TYPE_OBJ` to the `gguf_type` enum type. * Use `gguf_get_arr_n` and `gguf_get_arr_str` to get the subKey names of `GGUF_TYPE_OBJ`. * Added `gguf_set_obj_str` function to set object subkey names * Added `gguf_set_arr_obj` function to set object array count * Added `gguf_set_arr_arr` function to set nested array count * `llama`: * Modified `gguf_kv_to_str` * Added `LLAMA_API char * gguf_kv_to_c_str` function to get the c_str value as JSON format. * Maybe this API should be moved into `ggml` as `gguf_get_val_json`. (问题是 ggml.c 用的是C语言,而这里大量用了C++的功能) * Added basic support to `GGUF_TYPE_OBJ` and nested array * Unit test added feat: add basic support to GGUF_TYPE_OBJ on cpp feat(gguf.py): add OBJ and mixed-type array supports to GGUF ARRAY feat: add OBJ and mixed-type array supports to GGUF ARRAY(CPP) feat: add nested array supported feat: * Subkey name convention in OBJ types: * If the first letter of the subkey name is "/", it means referencing the full name of other keys. * If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used. feat: add LLAMA_API gguf_kv_to_c_str to llama.h test: write test gguf file to tests folder directly(py) test: add test-gguf-meta.cpp feat: Key convention: "." indicates that the key is a subkey, not an independent key. feat: add excludes argument to add_dict(gguf_write.py) feat: add_array_ex to supports nested and mix-typed array, and keep the add_array to the same fix(constant.py): rollback the get_type function and add the new get_type_ex test: add test compatibility fix: use GGML_MALLOC instead of malloc

snowyu force-pushed the feat/obj_virtual_type branch from 927bb36 to a06767b Compare January 27, 2024 00:26

snowyu force-pushed the feat/obj_virtual_type branch 2 times, most recently from 84dc536 to fe25927 Compare February 3, 2024 09:41

snowyu force-pushed the feat/obj_virtual_type branch from fe25927 to 95a492a Compare February 6, 2024 06:12

snowyu mentioned this pull request Feb 6, 2024

Add unit tests iamlemec/bert.cpp#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

snowyu commented Jan 26, 2024 •

edited

Loading

ggerganov commented Jan 27, 2024

snowyu commented Jan 28, 2024

ggerganov commented Jan 29, 2024

slaren commented Jan 29, 2024

snowyu commented Jan 29, 2024

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

Are you sure you want to change the base?

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

Conversation

snowyu commented Jan 26, 2024 • edited Loading

ggerganov commented Jan 27, 2024

snowyu commented Jan 28, 2024

ggerganov commented Jan 29, 2024

slaren commented Jan 29, 2024

snowyu commented Jan 29, 2024

snowyu commented Jan 26, 2024 •

edited

Loading