Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate vocab support and model conversion #7379

Draft
wants to merge 102 commits into
base: master
Choose a base branch
from

Conversation

teleprint-me
Copy link
Contributor

@teleprint-me teleprint-me commented May 19, 2024

Automate Vocab Support and Model Conversion

Motivation:

This PR aims to automate the downloading of supported model architectures from HuggingFace, handle conversions (if needed), and address existing fragile implementations related to tokenizers for proper training, finetuning, and inference. The goal is to streamline the process by reducing manual implementations while making it easier for users.

Overview:

  1. User navigates to huggingface.co website and downloads a supported model architecture (torch, safetensors, or GGUF). If the desired format (e.g., GGUF) is not available, we will download it in one of the other formats for conversion.

  2. Download necessary files:

    • Vocab files: tokenizer.json, tokenizer_config.json, config.json, and optionally tokenizer.model
    • Model weights: The format may be *.pt(h), *.bin, or *.safetensors
  3. Convert the model files to a GGUF model file (consolidated by default but can be split if needed). During conversion, metadata is extracted and written into the resulting GGUF model file. However, some metadata like normalizer and pre-tokenizer information are currently missing in this process.

Ideally, users should only need to add the vocab type, model type, and the repository name.

Goals:

  • Automate downloading of models from HuggingFace
  • Handle conversions (if needed)
  • Fix issues related to hardcoded handling of tokenizers
  • Improve overall user experience by reducing manual implementations

Notes:

Model Paths:

Models are still added to the models path by default unless explicitly passed as an option to be written somewhere else. This enables user preference.

The models are organized in a way that mirrors the expectation of a HuggingFace user, e.g.:

(.venv) git:(auto-model-support | θ) λ tree models -I private 
models
├── mistralai
│   ├── Mistral-7B-Instruct-v0.2
│   │   ├── config.json
│   │   ├── ggml-vocab-mistral-7b-instruct-v0.gguf
│   │   ├── tokenizer_config.json
│   │   ├── tokenizer.json
│   │   └── tokenizer.model
│   └── Mixtral-8x7B-Instruct-v0.1
│       ├── config.json
│       ├── ggml-vocab-mixtral-8x7b-instruct-v0.gguf
│       ├── tokenizer_config.json
│       ├── tokenizer.json
│       └── tokenizer.model

If a user does not have access to the repo, the path will be empty. This is a bug. I still need to fix that. There are workarounds currently implemented to ignore this unintended side-effect, e.g.:

├── databricks
│   └── dbrx-base

Advantages:

This increases and enables organization as a result. Keeping the models path clean and easy to navigate as needed. It also consolidates model paths as well. This allows us to reliably and predictably reference a model path.

What I like about this approach is that it will allow us to get the vocab, model, or both if desired. It also removes any concern of the growing repo size as we no longer need to include vocab files. We can dynamically generate them before or at runtime.

The model will be automatically downloaded if it is not locally available to the user. This allows us to fetch the model as needed.

Validation:

A JSON file, checksums.json, is automatically generated removing the need to update the conversion scripts every time a hash is generated. We can commit this file instead so we have a trusted reference we can all use.

This allows us to read from the dynamically generated JSON file as a source for validating tokenizers as well as other models. Using a regular expression to read, match, and then write to the file is error prone. This is a cleaner solution and I have considered other alternatives.

I'm open to feedback on how to handle this. e.g. It is possible to dynamically construct a AST, but this can introduce other issues without careful consideration.

Storing Tokenizer Models:

A key issue I've noticed is name conflicts. As more tokenizer models are added, the higher the likelihood we will run into issues. Instead, we can rely upon what type of tokenizer is available.

We already have an enum for the tokenizer type setup with enumerations set for BPE, SPM, and WPM. We can use a string or integer to identify the type of vocabulary, effectively reducing the amount of code and redundancy that would pile up as a result otherwise. Personally, I'm leaning towards a string and this would be the tokenizer.ggml.pre field value. e.g.

     16: STRING     |        1 | tokenizer.ggml.pre = 'jina-v2-en'

Would become

     STRING     |        1 | tokenizer.ggml.name = 'jina-v2-en'
     STRING     |        1 | tokenizer.ggml.type = 'WPM'
     ARRAY      |        1 | tokenizer.ggml.pre  = {"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"}

This is cleaner and easier to identify.

I've been considering how to do this for quite some time now. If this is undesirable, then I genuinely recommend adding a field for properly identifying the tokenizer model and properly labeling the type.

This is crucial to avoid confusion, keep in sync with modern terminology, and reduce redundancy in the code base that could result otherwise. The behavior, use, and application of this "pre-tokenizer" term is in direct conflict with what I assume most of us would think otherwise.

Identifying and Labeling Tokenizers:

Note that this is a rough sketch. I'm open to suggestions here:

    class Tokenizer:
        MODEL            = "tokenizer.ggml.model"   # We have enums for this
        HASH             = "tokenizer.ggml.hash"    # Embed the tokenizer vocab hash
        TYPE             = "tokenizer.ggml.type"    # BPE, SPM, WPM, etc.
        LIST             = "tokenizer.ggml.tokens"  # ...Rest remains the same for now
        TOKEN_TYPE       = "tokenizer.ggml.token_type"
        TOKEN_TYPE_COUNT = "tokenizer.ggml.token_type_count"  # ...BERT
        SCORES           = "tokenizer.ggml.scores"
        MERGES           = "tokenizer.ggml.merges"

Possible options might be:

  • tokenizer.ggml.pre: A member variable to set the regular expression.
  • tokenizer.ggml.model: A member variable that has the models name.
  • tokenizer.model.type: A member variable that has the tokenizer type.

TODO

Adds the following support:

  • Vocab support for phi-1, phi-1.5, and phi-2
  • Vocab support for stablelm models
  • Vocab support for mistral v0.1 and v0.2
  • Vocab support for mixtral v0.1
  • Optionally generate a generate-vocab.sh script
  • Optionally generate tokenizer model tests
  • TODO

I have other plans as well. I Will update this as I progress.

Tagging @CISC @compilade @ggerganov

- Add imports for json and hashlib
- Add missing models: phi, stablelm, mistral, and mixtral
- Fix constructor logic
- Fix how models are accessed
- Apply model schema to download_model method
@teleprint-me teleprint-me marked this pull request as draft May 19, 2024 02:56
@github-actions github-actions bot added the python python script changes label May 19, 2024
@teleprint-me
Copy link
Contributor Author

Update:

The JSON structure is clear, but the hierarchical nature of associated metadata presents some challenges.

Challenges:

There are two paths I am considering to implement this solution effectively:

  1. Embed key-value pairs from tokenizer.json directly into GGUF tokenizers and create a simplified C API for handling these data partitions. This approach would require translating the C API, which is not trivial and may result in duplicating Rust APIs unnecessarily.

  2. Embed non-null JSON values themselves within the tokenizer model files. However, this method becomes complicated with hierarchical structures (e.g., Sequence or Metaspace) that are difficult to handle through the tokenizers API directly. In these cases, a custom API would be required to access additional configuration details.

Considerations:

I am currently developing a pseudo registry to aid in my understanding of the underlying implementation and making it easier to add new models using a CLI tool without any complications.

The sha256sum for encodings is primarily used to validate vocabularies, but I'm starting to question its usefulness as it feels more like a distraction than an essential feature.

To mitigate potential issues with regular expressions, validating them in aggregate might be beneficial. Regular expressions can be difficult to read and understand unless you are familiar with the author's intentions, making them write-only by nature.

@teleprint-me
Copy link
Contributor Author

Notes:

Depending on the model used, the rules applied differ. Deepseek uses a type Sequence and each rule is applied in sequence.

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

You could associate each normalizer and pre-tokenizer rule via an array and associated index, but this would be a lot more cumbersome than one might hope, even if it were simplified to the way it is currently implemented.

I like the apparent elegance of simply taking each pre-tokenizer and treating them as an array, but this is problematic at best because now we have no idea whether or not to trim, add spaces, or use the associated regular expression.

Some rules don't even need it at all. A reg ex could be available, but that doesn't necessarily mean it should be used. I've noticed this is quite common in trained tokenizers.

This is obviously affecting the tokenized output.

@teleprint-me
Copy link
Contributor Author

Notes:

Pair scores can be calculated using the formula:

freq / (n_1 * n_2)

This only applies to Word Piece Models.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 30, 2024

Notes:

::format is confusing and ambiguous. Should be renamed to llama_format or ggml_format for backward compat.

// NOTE: Renamed to prevent name conflicts and support backward compatibility
static std::string ggml_format(const char* fmt, ...) {
    va_list ap;
    va_list ap2;
    va_start(ap, fmt);
    va_copy(ap2, ap);
    int size = vsnprintf(NULL, 0, fmt, ap);
    GGML_ASSERT(size >= 0 && size < INT_MAX); // NOLINT
    std::vector<char> buf(size + 1);
    int               size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
    GGML_ASSERT(size2 == size);
    va_end(ap2);
    va_end(ap);
    return std::string(buf.data(), size);
}

I spent way more time on this than I would have liked. C++ 20 and higher has std::format, but any standards below do not and the default standard for llama.cpp is 11. Renaming this function resolves this ambiguity.

Possible remedy is to utilize C++ namespacing and apply proper prefix for each respective namespace, e.g. GGML and LLAMA namespaces.

@teleprint-me
Copy link
Contributor Author

I can definitely do this. The only issue is how much I'm allowed to refactor to achieve this goal. It will break backwards compatibility but will end up with much cleaner code and a more automated process as a result. The workflow would remain the same.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented Jun 1, 2024

02:22:41 | ~/Local/code/tokenizers
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /home/austin/Local/code/tokenizers/.venv/bin/python
>>> import logging
>>> import pprint
>>> from pathlib import Path
>>> from tok.gguf.huggingface_hub import HFHubModel
>>>
>>> auth_token = "<auth_read_token>"
>>> path = Path("models")
>>> logger = logging.getLogger('test')
>>> hub_model = HFHubModel(auth_token, path, logger)
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (cache).
Your token has been saved to /home/austin/.cache/huggingface/token
Login successful
>>> hub_model.request.list_remote_files
<bound method HFHubRequest.list_remote_files of <tok.gguf.huggingface_hub.HFHubRequest object at 0x787e7ab5e990>>
>>> model_repo = "stabilityai/stablelm-2-zephyr-1_6b"
>>> model_files = hub_model.request.list_remote_model_parts(model_repo)
>>> pprint.pprint(model_files, indent=2)
['model.safetensors']
>>> stablelm = "stabilityai/stablelm-2-zephyr-1_6b"
>>> llama3 = "meta-llama/Meta-Llama-3-8B"
>>> llama2 = "meta-llama/Llama-2-7b-hf"
>>> hub_model.request.list_remote_model_parts(llama2)
['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors']
>>> hub_model.request.list_remote_model_parts(llama3)
['model-00001-of-00004.safetensors', 'model-00002-of-00004.safetensors', 'model-00003-of-00004.safetensors', 'model-00004-o
f-00004.safetensors']
>>>

This works fine in most cases, but fails every now and then.

>>> falcon = "tiiuae/falcon-7b-instruct"
>>> hub_model.request.list_remote_model_parts(falcon)
['coreml/text-generation/falcon-7b-64-float32.mlpackage/Data/com.apple.CoreML/weights/weight.bin', 'pytorch_model-00001-of-
00002.bin', 'pytorch_model-00002-of-00002.bin']

Not sure if should hardcode the model parts on a model by model basis or use a selection method for edge cases, e.g. expected 2 model parts but got 3 instead and then have a selection pop up to mediate it.

This is nice because these aren't local. These are completely remote. So if you have read access on a gated model, it works. If it's public, no problem.

@compilade
Copy link
Collaborator

compilade commented Jun 1, 2024

Not sure if should hardcode the model parts on a model by model basis or use a selection method for edge cases, e.g. expected 2 model parts but got 3 instead and then have a selection pop up to mediate it.

Model parts are arbitrary. The same model with the same tokenizer and weights could be split into a different number of valid model parts. I don't think this should be hardcoded.

Missing model parts are already detected in convert-hf-to-gguf.py when the weight map doesn't match what was read from the model parts.

This works fine in most cases, but fails every now and then.

convert-hf-to-gguf.py only checks the toplevel directory, so maybe ignore all deeper files when listing remote model parts with huggingface_hub?

@jaime-m-p jaime-m-p mentioned this pull request Jun 1, 2024
@teleprint-me
Copy link
Contributor Author

teleprint-me commented Jun 1, 2024

@compilade Yes, I understand, but that's not always true.

>>> mistral3 = "mistralai/Mistral-7B-Instruct-v0.3"
>>> 
>>> model_parts = hub_model.request.list_remote_model_parts(mistral3)
>>> 
>>> model_parts
['consolidated.safetensors', 'model-00001-of-00003.safetensors', 'model-00002-of-00003.safetensors', 'model-00003-of-00003.
safetensors']

The convert-hf-to-gguf.py script crashed until I removed the consolidated.safetensors model file. It could not intelligently handle this context.

This isn't the only situation like this.

For example, llama3 returns the expected results, but falcon doesn't even though they both have nested paths (e.g. llama3 has the original model files and the HF files). The APIs results are inconsistent.

Regardless, this might hold true depending on the user.

It's easier to have a list that a user can select from to verify whenever a deviation is encountered.

The only other mitigation I can think of is for the user to specify the model file types, parts, and number of parts on a model by model basis.

Another potential solution have a list of these files available in each of the model definitions.

If the desire is to not encounter any prompts and just "magically" convert, the only rational move here is to list the model parts on a model-by-model basis in each model definition and then reference that list as necessary. It's the most sane thing I can think of rather than coming up with hacky solutions to constantly work around the issue which will always cause issues unless the problem is known and understood by the user.

@compilade
Copy link
Collaborator

compilade commented Jun 2, 2024

If the desire is to not encounter any prompts and just "magically" convert, the only rational move here is to list the model parts on a model-by-model basis in each model definition and then reference that list as necessary. It's the most sane thing I can think of rather than coming up with hacky solutions to constantly work around the issue which will always cause issues unless the problem is known and understood by the user.

Do all valid model parts match the globs pytorch_model*.bin and model*.safetensors? If so, maybe this should be checked instead of listing everything.

I'm not sure listing model parts on a model-by-model is possible, since there are many different models using the same model arch. Some have multiple sizes, some are used by completely different types of models (e.g. MixtralForCausalLM is used by some MoE merges too (and some use model-00001-of-00001.safetensors instead of model.safetensors when there's a single part), and I don't think it's reasonable to expect such a list to be exhaustive. Model merges can change the size of a model, yet still remain compatible with llama.cpp. I think that's something that should continue to work.

@teleprint-me
Copy link
Contributor Author

Hm. I agree that this can indeed be more challenging than initially anticipated, especially with the arbitrary nature of model files across different architectures. Most of the code for handling tokenizers is already set up on the Python end, and my goal here is to integrate the rest into the pipeline.

For simplicity's sake, let's address this issue by providing users a list of available file types to select from when fetching files remotely or using local ones if they exist. This way, we ensure that the user has control over the process and can easily handle edge cases as needed:

def select_file_type(self, tensor_types: set[str]) -> str:
    # If more than one type found, ask the user
    if len(tensor_types) > 1:
        questions = [
            inquirer.List(
                "reader",
                message="Which tensor type would you like to read?",
                choices=list(tensor_types),
            ),
        ]
        answers = inquirer.prompt(questions)
        return answers["reader"]
    elif len(tensor_types) == 1:
        return next(iter(tensor_types))
    else:
        raise ValueError("No recognized tensor files found.")

This approach was used for reading metadata from Safetensors or Torch models, but the idea is similar. We read the repo file listing, list the available files, have the user select the ones they need, and proceed accordingly.

The tokenizer files are relatively straightforward as they're almost always the same aside from a few edge cases, but these usually don't cause significant issues due to their simplicity.

On the other hand, model files can be completely arbitrary with no way to determine the file name in advance. It is impressive that the convert script currently fails so infrequently for supported architectures given how difficult it is to guess the model file names accurately. However, we know from experience that this implementation isn't robust enough to handle all edge cases consistently.

By providing a user-friendly selection method, we can make the process more manageable and less prone to unexpected errors when dealing with arbitrary model files across various architectures.

@compilade
Copy link
Collaborator

compilade commented Jun 2, 2024

For simplicity's sake, let's address this issue by providing users a list of available file types to select from when fetching files remotely or using local ones if they exist. This way, we ensure that the user has control over the process and can easily handle edge cases as needed:

There are likely people using the convert scripts non-interactively, so I don't think asking the user a question like this would be appropriate for all cases. But I might be wrong.

For selecting the file type, a command-line flag should be appropriate, which could be used to override the default of "preferring model*.safetensors files over pytorch_model*.bin when safetensors model files are present".

On the other hand, model files can be completely arbitrary with no way to determine the file name in advance. It is impressive that the convert script currently fails so infrequently for supported architectures given how difficult it is to guess the model file names accurately.

I don't know if they really are that arbitrary, since models on HuggingFace pretty much all seem to fit in the model*.safetensors and pytorch_model*.bin globs, from what I've seen. Maybe it's defined somewhere in HuggingFace's transformers library. There are some exceptions, like the original LLaMa1/2 files from Meta, but that's handled by convert-legacy-llama.py.

It seems most model makers have converged on the HuggingFace naming of model files, and that's what convert-hf-to-gguf.py relies on, down to even reading to weight map (in either model.safetensors.index.json or pytorch_model.bin.index.json) to check for missing tensors during conversion from a multi-part model.

I agree that the errors still have a lot of room for user-friendliness, though, like in #7215 (comment), which was caused by missing files.

Comment on lines +960 to +973
UINT8 = auto()
INT8 = auto()
UINT16 = auto()
INT16 = auto()
UINT32 = auto()
INT32 = auto()
UINT64 = auto()
INT64 = auto()
FLOAT32 = auto()
FLOAT64 = auto()
BOOL = auto()
STRING = auto()
ARRAY = auto()
OBJECT = auto()
Copy link
Collaborator

@compilade compilade Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are stored in the GGUF model files, so I think they should stay explicitly defined with a number, to more easily match the C/C++ side.

Also note that they are explicitly defined in the GGUF spec: https://github.com/ggerganov/ggml/blob/2aae01fd9b8f9399f343cf18f46f38996ef52e2c/docs/gguf.md?plain=1#L155-L186

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I knew this when I modified this. It was easier to parse and quickly scan which is why I did this. Adding the object and grouping like objects helped reduce mental overhead. Technically, enumerations shouldn't be so coupled, but I understand why that's the case here.

Comment on lines +984 to +985
# NOTE: Maybe use numpy, e.g. np.dtypes to determine data type?
# Using base types is unreliable in python as all numbers in python are 64-bits.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integers in Python 3 have arbitrary precision, they are not merely 64-bits wide. (for the float type, I don't know)

I think using Numpy number types could be done, if that kind of control is needed, but unfortunately the numbers have to be explicitly converted to the desired Numpy number type beforehand for type comparisons to work properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This something left for another PR, but I felt the note was worth adding. I'll dig up where I read that if I can when I have some time.

Comment on lines 80 to 99
MODEL = "tokenizer.model" # STRING: e.g. llama, gpt2, etc...
TYPE = "tokenizer.type" # STRING: BPE, SPM, WPM, etc.
NORM = "tokenizer.norm" # OBJECT {"type": "ByteLevel", ...}
PRE = "tokenizer.pre" # OBJECT {"type": "ByteLevel", ...}
ADDED = "tokenizer.added" # ARRAY of OBJECTs: [{"id": 1, ...}, ...]
VOCAB = "tokenizer.vocab" # ARRAY of STRINGs: ["[BOS]", ...]
MERGES = "tokenizer.merges" # ARRAY of STRINGs: ["▁ t", ...]
TOKEN_TYPE = "tokenizer.token_type" # ARRAY of INT [2, ...]
TOKEN_TYPE_COUNT = "tokenizer.token_type_count" # BERT token types
SCORES = "tokenizer.scores" # WPM only
BOS_ID = "tokenizer.bos_token_id"
EOS_ID = "tokenizer.eos_token_id"
UNK_ID = "tokenizer.unknown_token_id"
SEP_ID = "tokenizer.seperator_token_id"
PAD_ID = "tokenizer.padding_token_id"
CLS_ID = "tokenizer.cls_token_id"
MASK_ID = "tokenizer.mask_token_id"
ADD_BOS = "tokenizer.add_bos_token"
ADD_EOS = "tokenizer.add_eos_token"
ADD_PREFIX = "tokenizer.add_space_prefix"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so this is what you meant in #7379 (comment).

Renaming the GGUF metadata keys is definitely a breaking change. If you really want to rename these, you'll have to also rename them in the llama.cpp file, as well as in https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#tokenizer. This would also likely need a GGUF format version bump. There were probably reasons for using tokenizer.ggml. instead of simply tokenizer.. I don't know for sure, but the structure of the spec might mean it's intended to be possible to have multiple tokenizer types in the same model file.

(also note that tokenizer.ggml.scores are also used by SentencePiece models, so the comment # WPM only is probably incomplete)

Copy link
Contributor Author

@teleprint-me teleprint-me Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need consensus on this. If anyone disagrees with this, I'm open to alternatives, but this will become more clear once I'm ready to full flesh everything out. I already looked at the C/C++ source code, but I'm waiting on @jaime-m-p and @mofosyne PRs because it will cause conflicts otherwise.

I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there. I needed to separate it from this branch though so I could freely experiment with some ideas without radically changing anything here.

The change here is "contained".

Copy link
Collaborator

@compilade compilade Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there.

Interesting! I think there's likely a way to make your idea of the registry file work with the intentions of the existing GGUF metadata keys. It could even be included directly in the GGUF model files. Something similar to tokenizer.huggingface.json, but for a subset of it (to avoid including the vocab twice (or not?)), to include the pretokenizer, normalizer, and other relevant configs in the GGUF models.

From what I see in your vocab.cpp I think this is a great start! I didn't yet think about parsing JSON in C++. Sometimes I forget that common/json.hpp (aka nlohmann/json) is there.

I'm throwing some ideas here.

Maybe convert-hf-to-gguf.py should simply start including tokenizer.json (and tokenizer_config.json?) in the tokenizer.huggingface.json GGUF metadata key?

This way the tokenizer problems could be solved later without necessarily having to re-convert the model files. But I don't know to what extent parsing JSON (especially something like Llama 3's 9MB tokenizer.json) would affect the load time. If it's too slow, maybe subsets of it could be in different GGUF metadata keys?


Even though I tend to nitpick a lot, it's nice to see that you are experimenting with this.

Copy link
Contributor Author

@teleprint-me teleprint-me Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the refactored structure here is aligned with that idea. Partition the most common key-value pairs so we only load, read, and write once. This should negligibly affect the load time isolating what's needed at runtime.

I'm keeping in mind that we also rely on the SPM tokenizer.model files as well which is why I kept the rest "in-tact" for the most part. I'm thinking about it in terms of compatibility.

I'd also like to eventually build a BPE tokenizer for llama.cpp that we can use natively. I think this would be cool and potentially useful for training. This would effectively reduce the reliance on third-party sources for in-house models.

I'm very interested in training my own models which is why this fascinates me. I have a few older cards, so I'm thinking I can rig them up and use vulkan, but that's another task entirely.

I digress, but in short, I think I may partially revert the structure for backwards compatibility as I understand this is a valid concern. Maybe a set of transitional attributes could aide in streamlining this more effectively in both conversion and inference for the interim.

@Galunid Galunid mentioned this pull request Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues devops improvements to build systems and github actions enhancement New feature or request examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants