Automate vocab support and model conversion #7379

teleprint-me · 2024-05-19T02:56:24Z

Automate Vocab Support and Model Conversion

Motivation:

This PR aims to automate the downloading of supported model architectures from HuggingFace, handle conversions (if needed), and address existing fragile implementations related to tokenizers for proper training, finetuning, and inference. The goal is to streamline the process by reducing manual implementations while making it easier for users.

Overview:

User navigates to huggingface.co website and downloads a supported model architecture (torch, safetensors, or GGUF). If the desired format (e.g., GGUF) is not available, we will download it in one of the other formats for conversion.
Download necessary files:
- Vocab files: tokenizer.json, tokenizer_config.json, config.json, and optionally tokenizer.model
- Model weights: The format may be *.pt(h), *.bin, or *.safetensors
Convert the model files to a GGUF model file (consolidated by default but can be split if needed). During conversion, metadata is extracted and written into the resulting GGUF model file. However, some metadata like normalizer and pre-tokenizer information are currently missing in this process.

Ideally, users should only need to add the vocab type, model type, and the repository name.

Goals:

Automate downloading of models from HuggingFace
Handle conversions (if needed)
Fix issues related to hardcoded handling of tokenizers
Improve overall user experience by reducing manual implementations

Notes:

This PR will focus on the conversion process for non-GGUF formats, such as torch and safetensors, into GGUF format.
Ref PR #6920 comment
This PR supersedes PR chore: Add hashsum for stablelm models #7018
This PR is a spinoff of PR chore: Add model vocab support #7117

Model Paths:

Models are still added to the models path by default unless explicitly passed as an option to be written somewhere else. This enables user preference.

The models are organized in a way that mirrors the expectation of a HuggingFace user, e.g.:

(.venv) git:(auto-model-support | θ) λ tree models -I private 
models
├── mistralai
│   ├── Mistral-7B-Instruct-v0.2
│   │   ├── config.json
│   │   ├── ggml-vocab-mistral-7b-instruct-v0.gguf
│   │   ├── tokenizer_config.json
│   │   ├── tokenizer.json
│   │   └── tokenizer.model
│   └── Mixtral-8x7B-Instruct-v0.1
│       ├── config.json
│       ├── ggml-vocab-mixtral-8x7b-instruct-v0.gguf
│       ├── tokenizer_config.json
│       ├── tokenizer.json
│       └── tokenizer.model

If a user does not have access to the repo, the path will be empty. This is a bug. I still need to fix that. There are workarounds currently implemented to ignore this unintended side-effect, e.g.:

├── databricks
│   └── dbrx-base

Advantages:

This increases and enables organization as a result. Keeping the models path clean and easy to navigate as needed. It also consolidates model paths as well. This allows us to reliably and predictably reference a model path.

What I like about this approach is that it will allow us to get the vocab, model, or both if desired. It also removes any concern of the growing repo size as we no longer need to include vocab files. We can dynamically generate them before or at runtime.

The model will be automatically downloaded if it is not locally available to the user. This allows us to fetch the model as needed.

Validation:

A JSON file, checksums.json, is automatically generated removing the need to update the conversion scripts every time a hash is generated. We can commit this file instead so we have a trusted reference we can all use.

This allows us to read from the dynamically generated JSON file as a source for validating tokenizers as well as other models. Using a regular expression to read, match, and then write to the file is error prone. This is a cleaner solution and I have considered other alternatives.

I'm open to feedback on how to handle this. e.g. It is possible to dynamically construct a AST, but this can introduce other issues without careful consideration.

Storing Tokenizer Models:

A key issue I've noticed is name conflicts. As more tokenizer models are added, the higher the likelihood we will run into issues. Instead, we can rely upon what type of tokenizer is available.

We already have an enum for the tokenizer type setup with enumerations set for BPE, SPM, and WPM. We can use a string or integer to identify the type of vocabulary, effectively reducing the amount of code and redundancy that would pile up as a result otherwise. Personally, I'm leaning towards a string and this would be the tokenizer.ggml.pre field value. e.g.

     16: STRING     |        1 | tokenizer.ggml.pre = 'jina-v2-en'

Would become

     STRING     |        1 | tokenizer.ggml.name = 'jina-v2-en'
     STRING     |        1 | tokenizer.ggml.type = 'WPM'
     ARRAY      |        1 | tokenizer.ggml.pre  = {"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"}

This is cleaner and easier to identify.

I've been considering how to do this for quite some time now. If this is undesirable, then I genuinely recommend adding a field for properly identifying the tokenizer model and properly labeling the type.

This is crucial to avoid confusion, keep in sync with modern terminology, and reduce redundancy in the code base that could result otherwise. The behavior, use, and application of this "pre-tokenizer" term is in direct conflict with what I assume most of us would think otherwise.

Identifying and Labeling Tokenizers:

Note that this is a rough sketch. I'm open to suggestions here:

    class Tokenizer:
        MODEL            = "tokenizer.ggml.model"   # We have enums for this
        HASH             = "tokenizer.ggml.hash"    # Embed the tokenizer vocab hash
        TYPE             = "tokenizer.ggml.type"    # BPE, SPM, WPM, etc.
        LIST             = "tokenizer.ggml.tokens"  # ...Rest remains the same for now
        TOKEN_TYPE       = "tokenizer.ggml.token_type"
        TOKEN_TYPE_COUNT = "tokenizer.ggml.token_type_count"  # ...BERT
        SCORES           = "tokenizer.ggml.scores"
        MERGES           = "tokenizer.ggml.merges"

Possible options might be:

tokenizer.ggml.pre: A member variable to set the regular expression.
tokenizer.ggml.model: A member variable that has the models name.
tokenizer.model.type: A member variable that has the tokenizer type.

TODO

Adds the following support:

Vocab support for phi-1, phi-1.5, and phi-2
Vocab support for stablelm models
Vocab support for mistral v0.1 and v0.2
Vocab support for mixtral v0.1
Optionally generate a generate-vocab.sh script
Optionally generate tokenizer model tests
TODO

I have other plans as well. I Will update this as I progress.

Tagging @CISC @compilade @ggerganov

…FaceHub

…igate confusion

…ing tokenizers

- Add imports for json and hashlib - Add missing models: phi, stablelm, mistral, and mixtral - Fix constructor logic - Fix how models are accessed - Apply model schema to download_model method

… to a json file

…ecksums and tests

… option

teleprint-me · 2024-05-27T19:50:28Z

Update:

The JSON structure is clear, but the hierarchical nature of associated metadata presents some challenges.

Challenges:

There are two paths I am considering to implement this solution effectively:

Embed key-value pairs from tokenizer.json directly into GGUF tokenizers and create a simplified C API for handling these data partitions. This approach would require translating the C API, which is not trivial and may result in duplicating Rust APIs unnecessarily.
Embed non-null JSON values themselves within the tokenizer model files. However, this method becomes complicated with hierarchical structures (e.g., Sequence or Metaspace) that are difficult to handle through the tokenizers API directly. In these cases, a custom API would be required to access additional configuration details.

Considerations:

I am currently developing a pseudo registry to aid in my understanding of the underlying implementation and making it easier to add new models using a CLI tool without any complications.

The sha256sum for encodings is primarily used to validate vocabularies, but I'm starting to question its usefulness as it feels more like a distraction than an essential feature.

To mitigate potential issues with regular expressions, validating them in aggregate might be beneficial. Regular expressions can be difficult to read and understand unless you are familiar with the author's intentions, making them write-only by nature.

teleprint-me · 2024-05-28T19:35:52Z

Notes:

Depending on the model used, the rules applied differ. Deepseek uses a type Sequence and each rule is applied in sequence.

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

You could associate each normalizer and pre-tokenizer rule via an array and associated index, but this would be a lot more cumbersome than one might hope, even if it were simplified to the way it is currently implemented.

I like the apparent elegance of simply taking each pre-tokenizer and treating them as an array, but this is problematic at best because now we have no idea whether or not to trim, add spaces, or use the associated regular expression.

Some rules don't even need it at all. A reg ex could be available, but that doesn't necessarily mean it should be used. I've noticed this is quite common in trained tokenizers.

This is obviously affecting the tokenized output.

teleprint-me · 2024-05-28T20:07:08Z

Notes:

Pair scores can be calculated using the formula:

freq / (n_1 * n_2)

This only applies to Word Piece Models.

…rences

…in constants.py

teleprint-me · 2024-05-30T19:21:52Z

Notes:

::format is confusing and ambiguous. Should be renamed to llama_format or ggml_format for backward compat.

// NOTE: Renamed to prevent name conflicts and support backward compatibility
static std::string ggml_format(const char* fmt, ...) {
    va_list ap;
    va_list ap2;
    va_start(ap, fmt);
    va_copy(ap2, ap);
    int size = vsnprintf(NULL, 0, fmt, ap);
    GGML_ASSERT(size >= 0 && size < INT_MAX); // NOLINT
    std::vector<char> buf(size + 1);
    int               size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
    GGML_ASSERT(size2 == size);
    va_end(ap2);
    va_end(ap);
    return std::string(buf.data(), size);
}

I spent way more time on this than I would have liked. C++ 20 and higher has std::format, but any standards below do not and the default standard for llama.cpp is 11. Renaming this function resolves this ambiguity.

Possible remedy is to utilize C++ namespacing and apply proper prefix for each respective namespace, e.g. GGML and LLAMA namespaces.

teleprint-me · 2024-05-31T18:21:20Z

I can definitely do this. The only issue is how much I'm allowed to refactor to achieve this goal. It will break backwards compatibility but will end up with much cleaner code and a more automated process as a result. The workflow would remain the same.

…zers

teleprint-me · 2024-06-01T06:48:28Z

02:22:41 | ~/Local/code/tokenizers
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /home/austin/Local/code/tokenizers/.venv/bin/python
>>> import logging
>>> import pprint
>>> from pathlib import Path
>>> from tok.gguf.huggingface_hub import HFHubModel
>>>
>>> auth_token = "<auth_read_token>"
>>> path = Path("models")
>>> logger = logging.getLogger('test')
>>> hub_model = HFHubModel(auth_token, path, logger)
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (cache).
Your token has been saved to /home/austin/.cache/huggingface/token
Login successful
>>> hub_model.request.list_remote_files
<bound method HFHubRequest.list_remote_files of <tok.gguf.huggingface_hub.HFHubRequest object at 0x787e7ab5e990>>
>>> model_repo = "stabilityai/stablelm-2-zephyr-1_6b"
>>> model_files = hub_model.request.list_remote_model_parts(model_repo)
>>> pprint.pprint(model_files, indent=2)
['model.safetensors']
>>> stablelm = "stabilityai/stablelm-2-zephyr-1_6b"
>>> llama3 = "meta-llama/Meta-Llama-3-8B"
>>> llama2 = "meta-llama/Llama-2-7b-hf"
>>> hub_model.request.list_remote_model_parts(llama2)
['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors']
>>> hub_model.request.list_remote_model_parts(llama3)
['model-00001-of-00004.safetensors', 'model-00002-of-00004.safetensors', 'model-00003-of-00004.safetensors', 'model-00004-o
f-00004.safetensors']
>>>

This works fine in most cases, but fails every now and then.

>>> falcon = "tiiuae/falcon-7b-instruct"
>>> hub_model.request.list_remote_model_parts(falcon)
['coreml/text-generation/falcon-7b-64-float32.mlpackage/Data/com.apple.CoreML/weights/weight.bin', 'pytorch_model-00001-of-
00002.bin', 'pytorch_model-00002-of-00002.bin']

Not sure if should hardcode the model parts on a model by model basis or use a selection method for edge cases, e.g. expected 2 model parts but got 3 instead and then have a selection pop up to mediate it.

This is nice because these aren't local. These are completely remote. So if you have read access on a gated model, it works. If it's public, no problem.

compilade · 2024-06-01T14:17:11Z

Not sure if should hardcode the model parts on a model by model basis or use a selection method for edge cases, e.g. expected 2 model parts but got 3 instead and then have a selection pop up to mediate it.

Model parts are arbitrary. The same model with the same tokenizer and weights could be split into a different number of valid model parts. I don't think this should be hardcoded.

Missing model parts are already detected in convert-hf-to-gguf.py when the weight map doesn't match what was read from the model parts.

This works fine in most cases, but fails every now and then.

convert-hf-to-gguf.py only checks the toplevel directory, so maybe ignore all deeper files when listing remote model parts with huggingface_hub?

teleprint-me · 2024-06-01T23:52:53Z

@compilade Yes, I understand, but that's not always true.

>>> mistral3 = "mistralai/Mistral-7B-Instruct-v0.3"
>>> 
>>> model_parts = hub_model.request.list_remote_model_parts(mistral3)
>>> 
>>> model_parts
['consolidated.safetensors', 'model-00001-of-00003.safetensors', 'model-00002-of-00003.safetensors', 'model-00003-of-00003.
safetensors']

The convert-hf-to-gguf.py script crashed until I removed the consolidated.safetensors model file. It could not intelligently handle this context.

This isn't the only situation like this.

For example, llama3 returns the expected results, but falcon doesn't even though they both have nested paths (e.g. llama3 has the original model files and the HF files). The APIs results are inconsistent.

Regardless, this might hold true depending on the user.

It's easier to have a list that a user can select from to verify whenever a deviation is encountered.

The only other mitigation I can think of is for the user to specify the model file types, parts, and number of parts on a model by model basis.

Another potential solution have a list of these files available in each of the model definitions.

If the desire is to not encounter any prompts and just "magically" convert, the only rational move here is to list the model parts on a model-by-model basis in each model definition and then reference that list as necessary. It's the most sane thing I can think of rather than coming up with hacky solutions to constantly work around the issue which will always cause issues unless the problem is known and understood by the user.

compilade · 2024-06-02T00:14:34Z

If the desire is to not encounter any prompts and just "magically" convert, the only rational move here is to list the model parts on a model-by-model basis in each model definition and then reference that list as necessary. It's the most sane thing I can think of rather than coming up with hacky solutions to constantly work around the issue which will always cause issues unless the problem is known and understood by the user.

Do all valid model parts match the globs pytorch_model*.bin and model*.safetensors? If so, maybe this should be checked instead of listing everything.

I'm not sure listing model parts on a model-by-model is possible, since there are many different models using the same model arch. Some have multiple sizes, some are used by completely different types of models (e.g. MixtralForCausalLM is used by some MoE merges too (and some use model-00001-of-00001.safetensors instead of model.safetensors when there's a single part), and I don't think it's reasonable to expect such a list to be exhaustive. Model merges can change the size of a model, yet still remain compatible with llama.cpp. I think that's something that should continue to work.

teleprint-me · 2024-06-02T01:01:07Z

Hm. I agree that this can indeed be more challenging than initially anticipated, especially with the arbitrary nature of model files across different architectures. Most of the code for handling tokenizers is already set up on the Python end, and my goal here is to integrate the rest into the pipeline.

For simplicity's sake, let's address this issue by providing users a list of available file types to select from when fetching files remotely or using local ones if they exist. This way, we ensure that the user has control over the process and can easily handle edge cases as needed:

def select_file_type(self, tensor_types: set[str]) -> str:
    # If more than one type found, ask the user
    if len(tensor_types) > 1:
        questions = [
            inquirer.List(
                "reader",
                message="Which tensor type would you like to read?",
                choices=list(tensor_types),
            ),
        ]
        answers = inquirer.prompt(questions)
        return answers["reader"]
    elif len(tensor_types) == 1:
        return next(iter(tensor_types))
    else:
        raise ValueError("No recognized tensor files found.")

This approach was used for reading metadata from Safetensors or Torch models, but the idea is similar. We read the repo file listing, list the available files, have the user select the ones they need, and proceed accordingly.

The tokenizer files are relatively straightforward as they're almost always the same aside from a few edge cases, but these usually don't cause significant issues due to their simplicity.

On the other hand, model files can be completely arbitrary with no way to determine the file name in advance. It is impressive that the convert script currently fails so infrequently for supported architectures given how difficult it is to guess the model file names accurately. However, we know from experience that this implementation isn't robust enough to handle all edge cases consistently.

By providing a user-friendly selection method, we can make the process more manageable and less prone to unexpected errors when dealing with arbitrary model files across various architectures.

compilade · 2024-06-02T02:03:20Z

For simplicity's sake, let's address this issue by providing users a list of available file types to select from when fetching files remotely or using local ones if they exist. This way, we ensure that the user has control over the process and can easily handle edge cases as needed:

There are likely people using the convert scripts non-interactively, so I don't think asking the user a question like this would be appropriate for all cases. But I might be wrong.

For selecting the file type, a command-line flag should be appropriate, which could be used to override the default of "preferring model*.safetensors files over pytorch_model*.bin when safetensors model files are present".

On the other hand, model files can be completely arbitrary with no way to determine the file name in advance. It is impressive that the convert script currently fails so infrequently for supported architectures given how difficult it is to guess the model file names accurately.

I don't know if they really are that arbitrary, since models on HuggingFace pretty much all seem to fit in the model*.safetensors and pytorch_model*.bin globs, from what I've seen. Maybe it's defined somewhere in HuggingFace's transformers library. There are some exceptions, like the original LLaMa1/2 files from Meta, but that's handled by convert-legacy-llama.py.

It seems most model makers have converged on the HuggingFace naming of model files, and that's what convert-hf-to-gguf.py relies on, down to even reading to weight map (in either model.safetensors.index.json or pytorch_model.bin.index.json) to check for missing tensors during conversion from a multi-part model.

I agree that the errors still have a lot of room for user-friendliness, though, like in #7215 (comment), which was caused by missing files.

compilade · 2024-06-02T01:29:47Z

gguf-py/gguf/constants.py

+    UINT8   = auto()
+    INT8    = auto()
+    UINT16  = auto()
+    INT16   = auto()
+    UINT32  = auto()
+    INT32   = auto()
+    UINT64  = auto()
+    INT64   = auto()
+    FLOAT32 = auto()
+    FLOAT64 = auto()
+    BOOL    = auto()
+    STRING  = auto()
+    ARRAY   = auto()
+    OBJECT  = auto()


These are stored in the GGUF model files, so I think they should stay explicitly defined with a number, to more easily match the C/C++ side.

Also note that they are explicitly defined in the GGUF spec: https://github.com/ggerganov/ggml/blob/2aae01fd9b8f9399f343cf18f46f38996ef52e2c/docs/gguf.md?plain=1#L155-L186

Yeah, I knew this when I modified this. It was easier to parse and quickly scan which is why I did this. Adding the object and grouping like objects helped reduce mental overhead. Technically, enumerations shouldn't be so coupled, but I understand why that's the case here.

compilade · 2024-06-02T01:34:04Z

gguf-py/gguf/constants.py

+        # NOTE: Maybe use numpy, e.g. np.dtypes to determine data type?
+        # Using base types is unreliable in python as all numbers in python are 64-bits.


Integers in Python 3 have arbitrary precision, they are not merely 64-bits wide. (for the float type, I don't know)

I think using Numpy number types could be done, if that kind of control is needed, but unfortunately the numbers have to be explicitly converted to the desired Numpy number type beforehand for type comparisons to work properly.

This something left for another PR, but I felt the note was worth adding. I'll dig up where I read that if I can when I have some time.

compilade · 2024-06-02T01:51:44Z

gguf-py/gguf/constants.py

+        MODEL            = "tokenizer.model"  # STRING: e.g. llama, gpt2, etc...
+        TYPE             = "tokenizer.type"  # STRING: BPE, SPM, WPM, etc.
+        NORM             = "tokenizer.norm"  # OBJECT {"type": "ByteLevel", ...}
+        PRE              = "tokenizer.pre"  # OBJECT {"type": "ByteLevel", ...}
+        ADDED            = "tokenizer.added"  # ARRAY of OBJECTs: [{"id": 1, ...}, ...]
+        VOCAB            = "tokenizer.vocab"  # ARRAY of STRINGs: ["[BOS]", ...]
+        MERGES           = "tokenizer.merges"  # ARRAY of STRINGs: ["▁ t", ...]
+        TOKEN_TYPE       = "tokenizer.token_type"  # ARRAY of INT [2, ...]
+        TOKEN_TYPE_COUNT = "tokenizer.token_type_count"  # BERT token types
+        SCORES           = "tokenizer.scores"  # WPM only
+        BOS_ID           = "tokenizer.bos_token_id"
+        EOS_ID           = "tokenizer.eos_token_id"
+        UNK_ID           = "tokenizer.unknown_token_id"
+        SEP_ID           = "tokenizer.seperator_token_id"
+        PAD_ID           = "tokenizer.padding_token_id"
+        CLS_ID           = "tokenizer.cls_token_id"
+        MASK_ID          = "tokenizer.mask_token_id"
+        ADD_BOS          = "tokenizer.add_bos_token"
+        ADD_EOS          = "tokenizer.add_eos_token"
+        ADD_PREFIX       = "tokenizer.add_space_prefix"


Right, so this is what you meant in #7379 (comment).

Renaming the GGUF metadata keys is definitely a breaking change. If you really want to rename these, you'll have to also rename them in the llama.cpp file, as well as in https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#tokenizer. This would also likely need a GGUF format version bump. There were probably reasons for using tokenizer.ggml. instead of simply tokenizer.. I don't know for sure, but the structure of the spec might mean it's intended to be possible to have multiple tokenizer types in the same model file.

(also note that tokenizer.ggml.scores are also used by SentencePiece models, so the comment # WPM only is probably incomplete)

I need consensus on this. If anyone disagrees with this, I'm open to alternatives, but this will become more clear once I'm ready to full flesh everything out. I already looked at the C/C++ source code, but I'm waiting on @jaime-m-p and @mofosyne PRs because it will cause conflicts otherwise.

I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there. I needed to separate it from this branch though so I could freely experiment with some ideas without radically changing anything here.

The change here is "contained".

I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there.

Interesting! I think there's likely a way to make your idea of the registry file work with the intentions of the existing GGUF metadata keys. It could even be included directly in the GGUF model files. Something similar to tokenizer.huggingface.json, but for a subset of it (to avoid including the vocab twice (or not?)), to include the pretokenizer, normalizer, and other relevant configs in the GGUF models.

From what I see in your vocab.cpp I think this is a great start! I didn't yet think about parsing JSON in C++. Sometimes I forget that common/json.hpp (aka nlohmann/json) is there.

I'm throwing some ideas here.

Maybe convert-hf-to-gguf.py should simply start including tokenizer.json (and tokenizer_config.json?) in the tokenizer.huggingface.json GGUF metadata key?

This way the tokenizer problems could be solved later without necessarily having to re-convert the model files. But I don't know to what extent parsing JSON (especially something like Llama 3's 9MB tokenizer.json) would affect the load time. If it's too slow, maybe subsets of it could be in different GGUF metadata keys?

Even though I tend to nitpick a lot, it's nice to see that you are experimenting with this.

Yeah, the refactored structure here is aligned with that idea. Partition the most common key-value pairs so we only load, read, and write once. This should negligibly affect the load time isolating what's needed at runtime.

I'm keeping in mind that we also rely on the SPM tokenizer.model files as well which is why I kept the rest "in-tact" for the most part. I'm thinking about it in terms of compatibility.

I'd also like to eventually build a BPE tokenizer for llama.cpp that we can use natively. I think this would be cool and potentially useful for training. This would effectively reduce the reliance on third-party sources for in-house models.

I'm very interested in training my own models which is why this fascinates me. I have a few older cards, so I'm thinking I can rig them up and use vulkan, but that's another task entirely.

I digress, but in short, I think I may partially revert the structure for backwards compatibility as I understand this is a valid concern. Maybe a set of transitional attributes could aide in streamlining this more effectively in both conversion and inference for the interim.

teleprint-me added 23 commits May 17, 2024 20:00

feat: Add prototype for managing huggingface hub content

dbdf6c2

feat: Add utils for logging and writing when interacting with Hugging…

ba13d64

…FaceHub

refactor: Add log for status and fix url path variable name

742abeb

patch: Apply minor fixes for handling headers and writing content

98cf788

feat: Add prototype for requesting vocab related files

4790f76

feat: Add download_model method and fix references for clarity to mit…

5c8144e

…igate confusion

refactor: Clean up variable names and separate concerns when download…

1a286c8

…ing tokenizers

chore: Fix spacing

3ba01c7

feat: Add tokenizer types, model types, and model repos

f7515ab

refactor: Apply model schema to tokenizer downloads

3022587

- Add imports for json and hashlib - Add missing models: phi, stablelm, mistral, and mixtral - Fix constructor logic - Fix how models are accessed - Apply model schema to download_model method

feat: Add method for generating the checksums and writing the results…

b2ca23c

… to a json file

feat: Add pre-tokenizer logging

5eda2c9

refactor: Apply SoC for HF requests, vocab, and weights

2ef73ee

chore: Apply isort to package gguf init

04fb788

feat: Add pre-tokenizer CLI tooling

832b449

chore: Fix line spacing

b6f70b8

chore: Fix model path references

006bb60

feat: Add example script for automating generating tokenizer model ch…

1a82573

…ecksums and tests

chore: Remove cluttered vocab files

4b3735c

patch: Add exception handling for non-existent vocab related files

0479e96

feat: Add function for generating vocab script and fix CLI opts

bd32266

feat: Add vocab generation script

d02a0f4

Merge branch 'master' into auto-model-support

ce777c8

teleprint-me marked this pull request as draft May 19, 2024 02:56

github-actions bot added the python python script changes label May 19, 2024

teleprint-me mentioned this pull request May 19, 2024

chore: Add model vocab support #7117

Closed

teleprint-me added 4 commits May 18, 2024 23:34

fix: Apply fix to verbose help description and generating vocab tests…

da5deeb

… option

patch: Fix CLI option for generating vocab tests

316b404

refactor: Simplify the get_vocab_base_pre method

5840b6f

fix: Remove dangling if statement

dcc5d42

teleprint-me added 6 commits May 28, 2024 18:42

refactor: Simplify tokenizers implementation

9dbc957

chore: Fix import path, token comparisons, and update token type refe…

aa28cfe

…rences

refactor: Simplify huggingface hub api and update to reflect changes …

f1d067e

…in constants.py

refactor: Apply updates to example script for generating the registry

5c92809

Merge branch 'master' into auto-model-support

6a725cf

Merge branch 'master' into auto-model-support

de0f0d0

Merge branch 'master' into auto-model-support

c2e4897

teleprint-me mentioned this pull request May 31, 2024

feat: add changes to handle jina v2 base code #7596

Merged

teleprint-me added 3 commits May 31, 2024 20:35

refactor: Add prototyped bridge interface for tokenizers and llama.cpp

47ef615

refactor: Add prototyped bridge interface for transformers and tokeni…

c447010

…zers

patch: Apply fix for backward compat for source repo

647d252

jaime-m-p mentioned this pull request Jun 1, 2024

Per token attributes #7685

Merged

compilade mentioned this pull request Jun 2, 2024

convert-hf : match model part name prefix and suffix #7687

Merged

Merge branch 'master' into auto-model-support

250bddf

compilade reviewed Jun 2, 2024

View reviewed changes

teleprint-me added 3 commits June 1, 2024 22:23

chore: Add ignore rule for generated server themes

e2b7608

Merge branch 'ignore-gen-themes' into auto-model-support

ce8524a

refactor: Clean up constants and simplify the custom hf hub api

5836d6c

Galunid mentioned this pull request Jun 15, 2024

support MiniCPM-V-2.5 #7599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate vocab support and model conversion #7379

Automate vocab support and model conversion #7379

teleprint-me commented May 19, 2024 •

edited

Loading

teleprint-me commented May 27, 2024

teleprint-me commented May 28, 2024

teleprint-me commented May 28, 2024

teleprint-me commented May 30, 2024 •

edited

Loading

teleprint-me commented May 31, 2024

teleprint-me commented Jun 1, 2024 •

edited

Loading

compilade commented Jun 1, 2024 •

edited

Loading

teleprint-me commented Jun 1, 2024 •

edited

Loading

compilade commented Jun 2, 2024 •

edited

Loading

teleprint-me commented Jun 2, 2024

compilade commented Jun 2, 2024 •

edited

Loading

compilade Jun 2, 2024 •

edited

Loading

teleprint-me Jun 2, 2024

compilade Jun 2, 2024

teleprint-me Jun 2, 2024

compilade Jun 2, 2024

teleprint-me Jun 2, 2024 •

edited

Loading

compilade Jun 2, 2024 •

edited

Loading

teleprint-me Jun 2, 2024 •

edited

Loading

		# NOTE: Maybe use numpy, e.g. np.dtypes to determine data type?
		# Using base types is unreliable in python as all numbers in python are 64-bits.

Automate vocab support and model conversion #7379

Are you sure you want to change the base?

Automate vocab support and model conversion #7379

Conversation

teleprint-me commented May 19, 2024 • edited Loading

Automate Vocab Support and Model Conversion

TODO

teleprint-me commented May 27, 2024

teleprint-me commented May 28, 2024

teleprint-me commented May 28, 2024

teleprint-me commented May 30, 2024 • edited Loading

teleprint-me commented May 31, 2024

teleprint-me commented Jun 1, 2024 • edited Loading

compilade commented Jun 1, 2024 • edited Loading

teleprint-me commented Jun 1, 2024 • edited Loading

compilade commented Jun 2, 2024 • edited Loading

teleprint-me commented Jun 2, 2024

compilade commented Jun 2, 2024 • edited Loading

compilade Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

teleprint-me Jun 2, 2024

Choose a reason for hiding this comment

compilade Jun 2, 2024

Choose a reason for hiding this comment

teleprint-me Jun 2, 2024

Choose a reason for hiding this comment

compilade Jun 2, 2024

Choose a reason for hiding this comment

teleprint-me Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

compilade Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

teleprint-me Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

teleprint-me commented May 19, 2024 •

edited

Loading

teleprint-me commented May 30, 2024 •

edited

Loading

teleprint-me commented Jun 1, 2024 •

edited

Loading

compilade commented Jun 1, 2024 •

edited

Loading

teleprint-me commented Jun 1, 2024 •

edited

Loading

compilade commented Jun 2, 2024 •

edited

Loading

compilade commented Jun 2, 2024 •

edited

Loading

compilade Jun 2, 2024 •

edited

Loading

teleprint-me Jun 2, 2024 •

edited

Loading

compilade Jun 2, 2024 •

edited

Loading

teleprint-me Jun 2, 2024 •

edited

Loading