Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

Closed
jim-plus opened this issue Mar 20, 2024 · 7 comments
Closed

Comments

@jim-plus
Copy link

jim-plus commented Mar 20, 2024

I did a DARE TIES merge of 3 fp16 and 1 bf16 model, outputting fp16 or bf16. The result can be converted to Q8_0 GGUF, but there are runtime errors when converting to 8.0bpw h8 exl2. This has happened for two different yaml formulas. Example formula for bf16 output on this model card: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B

@jim-plus
Copy link
Author

Can mergekit operate correctly on a mix of fp16 and bf16 models by automatically converting to the target dtype?

@cg123
Copy link
Collaborator

cg123 commented Mar 22, 2024

It does convert everything to the target dtype you specify if you have dtype set in the configuration. The fact that GGUF quantization works makes me think this is probably not an issue on the mergekit side.

I do know that DARE TIES with lower density values can result in an unusual distribution of parameter magnitudes. The rescaling step tends to introduce large outliers that are critical to the behavior of the model. I saw this happen more with densities around 0.1 but it could be happening here too. Maybe this is throwing exl2 for a loop? I'm not sure.

@turboderp any thoughts?

@turboderp
Copy link

ExLlama converts everything to FP16 when loading or quantizing, and the difference in dynamic range compared to BF16 could conceivably be an issue, but I've never seen it in practice. I tried downloading that model and converting it here, and it seems to work fine? (Both in 4bpw h6 and 8bpw h8.)

@jim-plus
Copy link
Author

jim-plus commented Mar 23, 2024

I updated to the current version and retried. The 8.0bpw h8 exl2 quant completed, but the truncation length that got read in was only 2048 when I loaded it in with ooba. The Q8_0 GGUF truncation length remained at 8192. I forgot to mention I ran this under Windows 11. Ryzen 3 2200G, if that matters.

@turboderp
Copy link

turboderp commented Mar 24, 2024

I wonder if this is a bug in TGW. From v0.0.15 EXL2 adds a "quantization_config" key to the config.json, which is the only place a length of 2048 would be mentioned in the model. It's only under that key, though, as the calibration length. The model itself still lists "max_position_embeddings" of 8192.

There's also a "max_new_tokens" key in generation_config.json that TGW might be reading? Not sure why it would use that key but it might explain it? (was looking at a different model, scratch that.)

@jim-plus
Copy link
Author

jim-plus commented Mar 24, 2024

I looked at the config.json in the generated result for the exl2 quant. Not sure why the length under calibration ended up at 2048. Manually adjusting it to 8192 resulted in ooba reporting the truncation length to be 8192 rather than 2048. 2048 does not show up at all in the config.json of the merged result output from mergekit, so I should probably take this issue up with exllamav2 at this point.

{
    "_name_or_path": "SanjiWatsuki/Kunoichi-7B",
    "architectures": [
        "MistralForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "max_position_embeddings": 8192,
    "model_type": "mistral",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 10000.0,
    "sliding_window": 4096,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.38.2",
    "use_cache": true,
    "vocab_size": 32000,
    "quantization_config": {
        "quant_method": "exl2",
        "version": "0.0.16",
        "bits": 8.0,
        "head_bits": 6,
        "calibration": {
            "rows": 100,
            "length": 2048,
            "dataset": "(default)"
        }
    }
}

@turboderp
Copy link

The length listed under quantization config has nothing to do with inference. It's just metadata for troubleshooting purposes. You'd have to ask ooba what config option they're (not) reading to arrive at 2048 as the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants