DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

jim-plus · 2024-03-20T20:32:33Z

I did a DARE TIES merge of 3 fp16 and 1 bf16 model, outputting fp16 or bf16. The result can be converted to Q8_0 GGUF, but there are runtime errors when converting to 8.0bpw h8 exl2. This has happened for two different yaml formulas. Example formula for bf16 output on this model card: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B

jim-plus · 2024-03-22T02:57:27Z

Can mergekit operate correctly on a mix of fp16 and bf16 models by automatically converting to the target dtype?

cg123 · 2024-03-22T17:40:53Z

It does convert everything to the target dtype you specify if you have dtype set in the configuration. The fact that GGUF quantization works makes me think this is probably not an issue on the mergekit side.

I do know that DARE TIES with lower density values can result in an unusual distribution of parameter magnitudes. The rescaling step tends to introduce large outliers that are critical to the behavior of the model. I saw this happen more with densities around 0.1 but it could be happening here too. Maybe this is throwing exl2 for a loop? I'm not sure.

@turboderp any thoughts?

turboderp · 2024-03-23T05:51:46Z

ExLlama converts everything to FP16 when loading or quantizing, and the difference in dynamic range compared to BF16 could conceivably be an issue, but I've never seen it in practice. I tried downloading that model and converting it here, and it seems to work fine? (Both in 4bpw h6 and 8bpw h8.)

jim-plus · 2024-03-23T20:13:35Z

I updated to the current version and retried. The 8.0bpw h8 exl2 quant completed, but the truncation length that got read in was only 2048 when I loaded it in with ooba. The Q8_0 GGUF truncation length remained at 8192. I forgot to mention I ran this under Windows 11. Ryzen 3 2200G, if that matters.

turboderp · 2024-03-24T05:08:40Z

I wonder if this is a bug in TGW. From v0.0.15 EXL2 adds a "quantization_config" key to the config.json, which is the only place a length of 2048 would be mentioned in the model. It's only under that key, though, as the calibration length. The model itself still lists "max_position_embeddings" of 8192.

~~There's also a "max_new_tokens" key in generation_config.json that TGW might be reading? Not sure why it would use that key but it might explain it?~~ (was looking at a different model, scratch that.)

jim-plus · 2024-03-24T17:24:16Z

I looked at the config.json in the generated result for the exl2 quant. Not sure why the length under calibration ended up at 2048. Manually adjusting it to 8192 resulted in ooba reporting the truncation length to be 8192 rather than 2048. 2048 does not show up at all in the config.json of the merged result output from mergekit, so I should probably take this issue up with exllamav2 at this point.

{
    "_name_or_path": "SanjiWatsuki/Kunoichi-7B",
    "architectures": [
        "MistralForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "max_position_embeddings": 8192,
    "model_type": "mistral",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 10000.0,
    "sliding_window": 4096,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.38.2",
    "use_cache": true,
    "vocab_size": 32000,
    "quantization_config": {
        "quant_method": "exl2",
        "version": "0.0.16",
        "bits": 8.0,
        "head_bits": 6,
        "calibration": {
            "rows": 100,
            "length": 2048,
            "dataset": "(default)"
        }
    }
}

turboderp · 2024-03-25T08:29:45Z

The length listed under quantization config has nothing to do with inference. It's just metadata for troubleshooting purposes. You'd have to ask ooba what config option they're (not) reading to arrive at 2048 as the default.

jim-plus closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

jim-plus commented Mar 20, 2024 •

edited

Loading

jim-plus commented Mar 22, 2024

cg123 commented Mar 22, 2024

turboderp commented Mar 23, 2024

jim-plus commented Mar 23, 2024 •

edited

Loading

turboderp commented Mar 24, 2024 •

edited

Loading

jim-plus commented Mar 24, 2024 •

edited

Loading

turboderp commented Mar 25, 2024

DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204

Comments

jim-plus commented Mar 20, 2024 • edited Loading

jim-plus commented Mar 22, 2024

cg123 commented Mar 22, 2024

turboderp commented Mar 23, 2024

jim-plus commented Mar 23, 2024 • edited Loading

turboderp commented Mar 24, 2024 • edited Loading

jim-plus commented Mar 24, 2024 • edited Loading

turboderp commented Mar 25, 2024

jim-plus commented Mar 20, 2024 •

edited

Loading

jim-plus commented Mar 23, 2024 •

edited

Loading

turboderp commented Mar 24, 2024 •

edited

Loading

jim-plus commented Mar 24, 2024 •

edited

Loading