Add `gemma` model #5631

postmasters · 2024-02-21T12:44:50Z

There are couple things in this architecture:

Shared input and output embedding parameters.
Key length and value length are not derived from n_embd.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.

There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.

pablodz · 2024-02-21T14:06:47Z

that was fast

girmay · 2024-02-21T15:28:22Z

Holly Moses. This was fast. Thank you

akx · 2024-02-21T15:35:35Z

A model converted and quantized from the safetensors weights still fails with

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found

for me.
There's a

[ 254/ 254]                   output_norm.weight - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB

tensor visible in the conversion and quantization output though.

sroecker · 2024-02-21T15:57:13Z

More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.

Interesting, is there a reason why the GGUF file is twice as large as the safetensors?

postmasters · 2024-02-21T16:34:21Z

A model converted and quantized from the safetensors weights still fails with
llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found

This depends on how your conversion is done. Two things to make sure: 1) the arch must be gemma and 2) there is no output weight in this arch because it shares the same embedding weights as the input layer. The error you see suggests that the arch is likely not set / copied correctly by the converter.

Interesting, is there a reason why the GGUF file is twice as large as the safetensors?

The weights here are as close to the internal checkpoints as you can get. They are in float32. We are leaning into the community to experiment with other quantized versions ;). For example, you could use the quantize tool included in this repository to produce an F16 version.

cebtenzzre · 2024-02-21T16:55:44Z

This PR doesn't make any changes to the convert scripts. How do I convert a Gemma model to GGUF?

postmasters · 2024-02-21T16:57:44Z

This PR doesn't make any changes to the convert scripts. How do I convert a Gemma model to GGUF?

You could simply download the models released on HuggingFace, for example https://huggingface.co/google/gemma-2b/blob/main/gemma-2b.gguf.

cebtenzzre · 2024-02-21T17:03:41Z

You could simply download the models released on HuggingFace

Are there plans to open-source the conversion scripts used, or will the community have to implement them? The safetensors checkpoint is a smaller download (presumably because of BF16 being converted to F32?) and one would imagine that people would like to be able to manipulate the Transformers weights (merge, finetune, etc.) before converting to GGUF, just as they do with other model architectures.

postmasters · 2024-02-21T17:08:52Z

I don't work with SafeTensors so I can't promise I will take this up personally. I'm sure folks will contribute later though 🤞 .

ggerganov · 2024-02-21T17:52:50Z

Yup, hope we get some insights. I tried updating convert-hf-to-gguf.py to support the conversion, but something is missing because the inference produces garbage:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 9771fccf..d328e524 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -218,6 +218,8 @@ class Model:
             return BertModel
         if model_architecture == "NomicBertModel":
             return NomicBertModel
+        if model_architecture in "GemmaForCausalLM":
+            return GemmaModel
         return Model
 
     def _is_model_safetensors(self) -> bool:
@@ -277,6 +279,8 @@ class Model:
             return gguf.MODEL_ARCH.BERT
         if arch == "NomicBertModel":
             return gguf.MODEL_ARCH.NOMIC_BERT
+        if arch in "GemmaForCausalLM":
+            return gguf.MODEL_ARCH.GEMMA
 
         raise NotImplementedError(f'Architecture "{arch}" not supported!')
 
@@ -1785,6 +1789,24 @@ class NomicBertModel(BertModel):
             yield name, data
 
 
+class GemmaModel(Model):
+    def set_vocab(self):
+        self._set_vocab_sentencepiece()
+
+    def set_gguf_parameters(self):
+        hparams = self.hparams
+        block_count = hparams["num_hidden_layers"]
+
+        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
+        self.gguf_writer.add_embedding_length(hparams["hidden_size"])
+        self.gguf_writer.add_block_count(block_count)
+        self.gguf_writer.add_feed_forward_length(hparams["intermediate_size"])
+        self.gguf_writer.add_head_count(hparams["num_attention_heads"])
+        self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
+        self.gguf_writer.add_head_count_kv(self.hparams["num_key_value_heads"] if "num_key_value_heads" in hparams else hparams["num_attention_heads"])
+
+

One thing that is strange is the vocab size in config.json specified as 256000, but the token_embd.weight tensor in the F32 GGUF files that are already provided has dimensions of [2048, 256128]. I tried padding with 128 but still producing garbage.

The F32 GGUF files work as expected

I'm currently testing just with the 2B model

akx · 2024-02-21T17:53:54Z

if model_architecture in "GemmaForCausalLM":

That won't do the right thing...

postmasters · 2024-02-21T18:29:13Z

I would not be surprised if the Gemma implementation in HF Transformers requires different transposes of the weight tensors than the implementation in this PR.

ggerganov · 2024-02-21T20:15:09Z

Huh very weird. I've been dumping the tensors from the locally converted models and comparing the values with the provided F32 GGUF models. The values are not transposed.

However, all norm tensors have values that are ~1.0f less compared to the F32 GGUF data.
So if I apply this change, the conversion starts working:

# Huh? Why is this needed?
if name.endswith(("norm.weight")):
    data_torch = data_torch + 1

Here is the full convert-hf-to-gguf.py patch that produces working models:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 9771fccf..e88308dc 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -218,6 +218,8 @@ class Model:
             return BertModel
         if model_architecture == "NomicBertModel":
             return NomicBertModel
+        if model_architecture in "GemmaForCausalLM":
+            return GemmaModel
         return Model
 
     def _is_model_safetensors(self) -> bool:
@@ -277,6 +279,8 @@ class Model:
             return gguf.MODEL_ARCH.BERT
         if arch == "NomicBertModel":
             return gguf.MODEL_ARCH.NOMIC_BERT
+        if arch in "GemmaForCausalLM":
+            return gguf.MODEL_ARCH.GEMMA
 
         raise NotImplementedError(f'Architecture "{arch}" not supported!')
 
@@ -1785,6 +1789,64 @@ class NomicBertModel(BertModel):
             yield name, data
 
 
+class GemmaModel(Model):
+    def set_vocab(self):
+        self._set_vocab_sentencepiece()
+
+    def set_gguf_parameters(self):
+        hparams = self.hparams
+        block_count = hparams["num_hidden_layers"]
+
+        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
+        self.gguf_writer.add_embedding_length(hparams["hidden_size"])
+        self.gguf_writer.add_block_count(block_count)
+        self.gguf_writer.add_feed_forward_length(hparams["intermediate_size"])
+        self.gguf_writer.add_head_count(hparams["num_attention_heads"])
+        self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
+        self.gguf_writer.add_head_count_kv(self.hparams["num_key_value_heads"] if "num_key_value_heads" in hparams else hparams["num_attention_heads"])
+
+    def write_tensors(self):
+        block_count = self.hparams.get("n_layers", self.hparams.get("num_hidden_layers", self.hparams.get("n_layer")))
+        tensor_map = gguf.get_tensor_name_map(self.model_arch, block_count)
+
+        for name, data_torch in self.get_tensors():
+            # we don't need these
+            if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq", ".attn.bias", ".attn.masked_bias")):
+                continue
+
+            # Huh? Why is this needed?
+            if name.endswith(("norm.weight")):
+                data_torch = data_torch + 1
+
+            old_dtype = data_torch.dtype
+
+            # convert any unsupported data types to float32
+            if data_torch.dtype not in (torch.float16, torch.float32):
+                data_torch = data_torch.to(torch.float32)
+
+            data = data_torch.squeeze().numpy()
+
+            # map tensor names
+            new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
+            if new_name is None:
+                print(f"Can not map tensor {name!r}")
+                sys.exit()
+
+            n_dims = len(data.shape)
+            data_dtype = data.dtype
+
+            data = data.astype(np.float32)
+
+            # if f16 desired, convert any float32 2-dim weight tensors to float16
+            if self.ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2:
+                data = data.astype(np.float16)
+
+            print(f"{new_name}, n_dims = {n_dims}, {old_dtype} --> {data.dtype}")
+
+            self.gguf_writer.add_tensor(new_name, data)
+
+

Edit: Ah, there it is:

https://github.com/huggingface/transformers/blob/fc37f38915372c15992b540dfcbbe00a916d4fc6/src/transformers/models/gemma/modeling_gemma.py#L89

Edit2: here is a PR with the conversion script:

#5647

alfred-liu96 · 2024-02-22T02:29:51Z

so fast!!!

yiakwy-xpu-ml-framework-team · 2024-02-22T02:49:56Z

Gemma model to GGUF?

Google has already provided GGUF (float32) in hf repo.

yiakwy-xpu-ml-framework-team · 2024-02-22T06:51:27Z

@ggerganov could we add check against bfloat16 dtype ?

ggerganov · 2024-02-22T08:08:54Z

What checks do you have in mind specifically?

yiakwy-xpu-ml-framework-team · 2024-02-22T08:17:14Z

Google uses bfloat16 for inference, while llama.cpp does not.

But I personally believe not too much degression after converting from bf16 to float16. I guess in the converter, if no float16, then we convert them to float32 (or even better if no overflow happens).

If we check if tensor.dtype is bf16 and keep it as fp16, we will have a 17 GB GGUF file instead of a 34 GB GGUF file.

fmichaelobrien · 2024-02-22T19:18:53Z

Team, thank you for integrating Gemma support into llama.cpp yesterday - this was an extremely fast and efficient alignment with a model that just came out a couple hours before.
I personally am very grateful to your efforts.
A wider community thank you is in order
thank you for
#5631

cebtenzzre · 2024-02-22T23:00:55Z

Trying out the F32 ggml-7b-it.gguf provided by Google, I'm getting a perplexity of "nan" at 2048 context - around 20 for the first few chunks. Also around 25 for the first chunk at 8192 context. For reference, llama-2-7b Q4_0 perplexity is about 5.16 at 4096 context.

@postmasters Are you sure the implementation is correct?

postmasters · 2024-02-22T23:08:26Z

DANtm in #5635 (comment) suggested that setting repeat-penalty would give better inference outputs.

slaren · 2024-02-22T23:37:59Z

With the gemma-7b.gguf base model (without instruction tuning), converted to f16, I get 6.5376 PPL at 2048 context, 6.2240 at 8912 context.

cebtenzzre · 2024-02-23T16:16:04Z

With the gemma-7b.gguf base model (without instruction tuning), converted to f16, I get 6.5376 PPL at 2048 context, 6.2240 at 8912 context.

Weird. Here is what I just tried:

Download safetensors model from https://huggingface.co/google/gemma-7b
Checkout latest llama.cpp master (commit 15499eb)
./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
make -C build perplexity
Run perplexity on my Tesla P40:

$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,

And there's no point in running it longer than that because the running average will stay NaN.

slaren · 2024-02-23T16:26:30Z

I didn't convert from the HF model, I downloaded the fp32 gguf and converted it to fp16 with the quantize tool.

ggerganov · 2024-02-23T16:51:48Z

It works with Metal and CPU using the convert model from HF data:

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   896.00 MiB, (17182.56 / 147456.00)
llama_kv_cache_init:      Metal KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    11.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   506.00 MiB, (17688.56 / 147456.00)
llama_new_context_with_model:      Metal compute buffer size =   506.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     6.00 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 680.214 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 2.20 seconds per pass - ETA 5.20 minutes
[1]5.6440,[2]6.8762,[3]7.2915,[4]6.5856,[5]6.1074,^C

I haven't tried CUDA

slaren · 2024-02-23T17:19:42Z

With the 7B it model, both fp32 and fp16, I get PPL around ~20, but I didn't get any nan. Using CUDA with a 3090 Ti.

./perplexity -m models/gemma-7b-it.gguf -ngl 16 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048

[1]18.7209,[2]24.9808,[3]27.5477,[4]24.3773,[5]23.0816,[6]18.8618,[7]17.1438,[8]16.7933,[9]17.6730,[10]17.7396,[11]17.5431,[12]18.4702,[13]18.4808,[14]18.8670,[15]19.0879,[16]19.8607,[17]19.9045,[18]20.0591,[19]19.5361,[20]19.5234,[21]19.4770,[22]19.6030,[23]19.7331,[24]19.7435,[25]20.3328,[26]20.3213,[27]21.0581,[28]21.2913,[29]21.3306,[30]21.3418,[31]21.2371,[32]20.8257,[33]20.9147,[34]20.7695,[35]20.3417,[36]19.8174,[37]19.3932,[38]19.1021,[39]18.6685,[40]18.3326,[41]18.4187,[42]18.7619,[43]19.1801,[44]19.2661,[45]19.5229,[46]19.7044,[47]19.8079,[48]19.8868,[49]19.7144,[50]19.8063,[51]19.6146,[52]19.4874,[53]19.3591,[54]19.1463,[55]19.0401,[56]18.7300,[57]18.7026,[58]18.6677,[59]18.7592,[60]18.9321,[61]19.0706,[62]19.3321,[63]19.3754,[64]19.1777,[65]19.1974,[66]19.0348,[67]18.9782,[68]18.9456,[69]18.7255,[70]18.6829,[71]18.9174,[72]19.0924,[73]19.0105,[74]19.0003,[75]19.0218,[76]19.0347,[77]19.0468,[78]19.0965,[79]19.2280,[80]19.1060,[81]19.0114,[82]18.8994,[83]18.8548,[84]18.8064,[85]18.6885,[86]18.6932,[87]18.7991,[88]18.8360,[89]18.9913,[90]19.2310,[91]19.3942,[92]19.4716,[93]19.6063,[94]19.7548,[95]19.8119,[96]19.8152,[97]19.8260,[98]19.8841,[99]19.8545,[100]19.8863,[101]19.9525,[102]20.0558,[103]20.1068,[104]20.1235,[105]20.1545,[106]20.1087,[107]20.1187,[108]20.1356,[109]20.0118,[110]19.9838,[111]19.9104,[112]19.9309,[113]19.9644,[114]20.0094,[115]19.9969,[116]19.9885,[117]19.9382,[118]19.9754,[119]19.9458,[120]19.8893,[121]19.8274,[122]19.8208,[123]19.8898,[124]19.8708,[125]19.8470,[126]19.7959,[127]19.7669,[128]19.8034,[129]19.7301,[130]19.7389,[131]19.7456,[132]19.8056,[133]19.8714,[134]19.7818,[135]19.5798,[136]19.6104,[137]19.6654,[138]19.7277,[139]19.7196,[140]19.7987,[141]19.7966,[142]19.8815,
Final estimate: PPL = 19.8815 +/- 0.21118

./perplexity -m models/gemma-7b-it-f16.gguf -ngl 99 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048

[1]20.1795,[2]26.9640,[3]29.7014,[4]26.0533,[5]24.6819,[6]20.1910,[7]18.3000,[8]17.9538,[9]18.9682,[10]19.0602,[11]18.8272,[12]19.8369,[13]19.8639,[14]20.2951,[15]20.5834,[16]21.4171,[17]21.4557,[18]21.5971,[19]21.0191,[20]21.0192,[21]20.9496,[22]21.0968,[23]21.2142,[24]21.2150,[25]21.8738,[26]21.8679,[27]22.6807,[28]22.9258,[29]22.9705,[30]22.9568,[31]22.8573,[32]22.4253,[33]22.5072,[34]22.3343,[35]21.8594,[36]21.2822,[37]20.8135,[38]20.5065,[39]20.0285,[40]19.6789,[41]19.7705,[42]20.1286,[43]20.5842,[44]20.6894,[45]20.9686,[46]21.1656,[47]21.2816,[48]21.3709,[49]21.1856,[50]21.2821,[51]21.0609,[52]20.9323,[53]20.8012,[54]20.5745,[55]20.4565,[56]20.1180,[57]20.0956,[58]20.0513,[59]20.1503,[60]20.3293,[61]20.4832,[62]20.7717,[63]20.8173,[64]20.6016,[65]20.6253,[66]20.4452,[67]20.3804,[68]20.3380,[69]20.0944,[70]20.0470,[71]20.3014,[72]20.4836,[73]20.3948,[74]20.3817,[75]20.4047,[76]20.4281,[77]20.4366,[78]20.4864,[79]20.6221,[80]20.4927,[81]20.3864,[82]20.2693,[83]20.2190,[84]20.1664,[85]20.0367,[86]20.0338,[87]20.1434,[88]20.1821,[89]20.3557,[90]20.6163,[91]20.7865,[92]20.8807,[93]21.0238,[94]21.1747,[95]21.2370,[96]21.2429,[97]21.2521,[98]21.3145,[99]21.2806,[100]21.3140,[101]21.3920,[102]21.5062,[103]21.5592,[104]21.5844,[105]21.6126,[106]21.5651,[107]21.5795,[108]21.5964,[109]21.4639,[110]21.4302,[111]21.3496,[112]21.3736,[113]21.4072,[114]21.4540,[115]21.4443,[116]21.4348,[117]21.3798,[118]21.4199,[119]21.3902,[120]21.3283,[121]21.2562,[122]21.2507,[123]21.3275,[124]21.3050,[125]21.2840,[126]21.2304,[127]21.2018,[128]21.2447,[129]21.1617,[130]21.1693,[131]21.1789,[132]21.2450,[133]21.3154,[134]21.2170,[135]20.9955,[136]21.0295,[137]21.0900,[138]21.1598,[139]21.1552,[140]21.2385,[141]21.2370,[142]21.3300,
PPL = 21.3300 +/- 0.22891

cebtenzzre · 2024-02-26T19:23:26Z

I didn't convert from the HF model, I downloaded the fp32 gguf and converted it to fp16 with the quantize tool.

Fun fact: This will leave you with a Q6_K output tensor unless you pass --pure. @ikawrakow This probably isn't intended, right?

6. Run perplexity on my Tesla P40:

$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,

And there's no point in running it longer than that because the running average will stay NaN.

I've discovered that these NaNs occur with -ngl 2 and above, but not with -ngl 1 or with --no-kv-offload. I can reproduce them on my P40 with either the FP16 converted from safetensors, or the FP16 quantized from Google's provided GGUF.

@slaren I wonder if you can reproduce if you build with -DLLAMA_CUDA_FORCE_MMQ=ON? That's effectively always enabled on my P40.

slaren · 2024-02-27T02:00:25Z

I tried with LLAMA_CUDA_FORCE_MMQ. With the FP32 model the results are identical, which is expected since MMQ is only used with quants. With FP16 they are slightly differently, probably due to the Q6_K output tensor, but still no nan.

$ ./perplexity -m models/gemma-7b-it-f16.gguf -ngl 160 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048
main: build = 2276 (b11a93df)
main: built with cc (Ubuntu 12.3.0-9ubuntu2) 12.3.0 for x86_64-linux-gnu
main: seed  = 1708998783
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 254 tensors from models/gemma-7b-it-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 28
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 1
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type  f16:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 15.04 GiB (15.13 BPW)
llm_load_print_meta: general.name     = gemma-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.19 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   615.54 MiB
llm_load_tensors:      CUDA0 buffer size = 15400.21 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    56.04 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2025.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.00 MiB
llama_new_context_with_model: graph splits (measure): 2

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 699.902 ms
perplexity: calculating perplexity over 142 chunks, batch_size=2048
perplexity: 0.74 seconds per pass - ETA 1.73 minutes
[1]19.2199,[2]25.5085,[3]28.2863,[4]24.9815,[5]23.6883,[6]19.3050,[7]17.5079,[8]17.1901,[9]18.1072,[10]18.1731,[11]17.9541,[12]18.8803,[13]18.8942,[14]19.3041,[15]19.5662,[16]20.3611,[17]20.3960,[18]20.5681,[19]20.0183,[20]20.0213,[21]19.9828,[22]20.1108,[23]20.2374,[24]20.2426,[25]20.8742,[26]20.8718,[27]21.6406,[28]21.8823,[29]21.9235,[30]21.9278,[31]21.8361,[32]21.4179,[33]21.5008,[34]21.3394,[35]20.8930,[36]20.3415,[37]19.8962,[38]19.5981,[39]19.1508,[40]18.8111,[41]18.8966,[42]19.2478,[43]19.6744,[44]19.7701,[45]20.0372,[46]20.2223,[47]20.3328,[48]20.4142,[49]20.2323,[50]20.3273,[51]20.1208,[52]19.9948,[53]19.8622,[54]19.6426,[55]19.5273,[56]19.2086,[57]19.1838,[58]19.1437,[59]19.2485,[60]19.4229,[61]19.5662,[62]19.8336,[63]19.8759,[64]19.6712,[65]19.6941,[66]19.5257,[67]19.4613,[68]19.4237,[69]19.1952,[70]19.1513,[71]19.3910,[72]19.5703,[73]19.4842,[74]19.4706,[75]19.4894,[76]19.5041,[77]19.5165,[78]19.5623,[79]19.6986,[80]19.5719,[81]19.4743,[82]19.3572,[83]19.3111,[84]19.2598,[85]19.1350,[86]19.1354,[87]19.2384,[88]19.2723,[89]19.4347,[90]19.6804,[91]19.8457,[92]19.9296,[93]20.0646,[94]20.2134,[95]20.2746,[96]20.2791,[97]20.2875,[98]20.3473,[99]20.3163,[100]20.3460,[101]20.4164,[102]20.5213,[103]20.5729,[104]20.5908,[105]20.6238,[106]20.5783,[107]20.5890,[108]20.6058,[109]20.4771,[110]20.4473,[111]20.3707,[112]20.3964,[113]20.4287,[114]20.4769,[115]20.4692,[116]20.4649,[117]20.4149,[118]20.4545,[119]20.4266,[120]20.3705,[121]20.3041,[122]20.2998,[123]20.3681,[124]20.3486,[125]20.3271,[126]20.2752,[127]20.2481,[128]20.2866,[129]20.2116,[130]20.2200,[131]20.2285,[132]20.2901,[133]20.3579,[134]20.2678,[135]20.0590,[136]20.0929,[137]20.1507,[138]20.2150,[139]20.2084,[140]20.2878,[141]20.2857,[142]20.3748,
Final estimate: PPL = 20.3748 +/- 0.21724

$ sha1sum models/gemma-7b*                                                                                                                                       
2cf27aa925ef6bb98255232f85f0df9a43278f4a  models/gemma-7b-f16.gguf
dffe52093bed13608d55387cbebbb3861bd072ff  models/gemma-7b-it-f16.gguf
0476921538163089c7564854a7d89417fcdc3b21  models/gemma-7b-it.gguf
5fc6c1bddc756971a56672fe26638eed9bc30a67  models/gemma-7b.gguf

There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.

ggerganov approved these changes Feb 21, 2024

View reviewed changes

ggerganov merged commit 580111d into ggerganov:master Feb 21, 2024
53 checks passed

the-crypt-keeper mentioned this pull request Feb 21, 2024

Evaluate google/gemma-2b/7b-it the-crypt-keeper/can-ai-code#163

Closed

postmasters deleted the gemma branch February 21, 2024 14:09

bullno1 mentioned this pull request Feb 21, 2024

Need support for GemmaForCausalLM #5635

Closed

4 tasks

mudler mentioned this pull request Feb 21, 2024

deps(llama.cpp): update, support Gemma models mudler/LocalAI#1734

Merged

1 task

dusty-nv mentioned this pull request Feb 21, 2024

[Model Request] Google Gemma 2B/7B mlc-ai/mlc-llm#1802

Closed

NeonBohdan mentioned this pull request Feb 21, 2024

Please support gemma arch OpenNMT/CTranslate2#1627

Closed

jteration mentioned this pull request Feb 21, 2024

Gemma: unable to load GGUF Quant #5636

Closed

pythops mentioned this pull request Feb 21, 2024

[Feature request] Add simple HTTP API server like in llama.cpp with api like OpenAI google/gemma.cpp#1

Open

obriensystems mentioned this pull request Feb 22, 2024

Google Gemma 7B 2B OSS models are available on Hugging Face as of 20240221 ObrienlabsDev/machine-learning#13

Open

moqimoqidea mentioned this pull request Feb 23, 2024

Support Google Gemma-7b TabbyML/tabby#1518

Closed

This was referenced Mar 1, 2024

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Closed

quantize to F32/F16/Q8_0 can result in a Q6_K output tensor #5818

Closed

Assume tied weights if lm_head/output weights is missing. #5824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `gemma` model #5631

Add `gemma` model #5631

postmasters commented Feb 21, 2024

pablodz commented Feb 21, 2024

girmay commented Feb 21, 2024

akx commented Feb 21, 2024

sroecker commented Feb 21, 2024 •

edited

Loading

postmasters commented Feb 21, 2024

cebtenzzre commented Feb 21, 2024

postmasters commented Feb 21, 2024

cebtenzzre commented Feb 21, 2024 •

edited

Loading

postmasters commented Feb 21, 2024

ggerganov commented Feb 21, 2024

akx commented Feb 21, 2024

postmasters commented Feb 21, 2024

ggerganov commented Feb 21, 2024 •

edited

Loading

alfred-liu96 commented Feb 22, 2024

yiakwy-xpu-ml-framework-team commented Feb 22, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Feb 22, 2024

ggerganov commented Feb 22, 2024

yiakwy-xpu-ml-framework-team commented Feb 22, 2024

fmichaelobrien commented Feb 22, 2024

cebtenzzre commented Feb 22, 2024 •

edited

Loading

postmasters commented Feb 22, 2024

slaren commented Feb 22, 2024

cebtenzzre commented Feb 23, 2024 •

edited

Loading

slaren commented Feb 23, 2024

ggerganov commented Feb 23, 2024 •

edited

Loading

slaren commented Feb 23, 2024

cebtenzzre commented Feb 26, 2024 •

edited

Loading

slaren commented Feb 27, 2024 •

edited

Loading

Add gemma model #5631

Add gemma model #5631

Conversation

postmasters commented Feb 21, 2024

pablodz commented Feb 21, 2024

girmay commented Feb 21, 2024

akx commented Feb 21, 2024

sroecker commented Feb 21, 2024 • edited Loading

postmasters commented Feb 21, 2024

cebtenzzre commented Feb 21, 2024

postmasters commented Feb 21, 2024

cebtenzzre commented Feb 21, 2024 • edited Loading

postmasters commented Feb 21, 2024

ggerganov commented Feb 21, 2024

akx commented Feb 21, 2024

postmasters commented Feb 21, 2024

ggerganov commented Feb 21, 2024 • edited Loading

alfred-liu96 commented Feb 22, 2024

yiakwy-xpu-ml-framework-team commented Feb 22, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Feb 22, 2024

ggerganov commented Feb 22, 2024

yiakwy-xpu-ml-framework-team commented Feb 22, 2024

fmichaelobrien commented Feb 22, 2024

cebtenzzre commented Feb 22, 2024 • edited Loading

postmasters commented Feb 22, 2024

slaren commented Feb 22, 2024

cebtenzzre commented Feb 23, 2024 • edited Loading

slaren commented Feb 23, 2024

ggerganov commented Feb 23, 2024 • edited Loading

slaren commented Feb 23, 2024

cebtenzzre commented Feb 26, 2024 • edited Loading

slaren commented Feb 27, 2024 • edited Loading

Add `gemma` model #5631

Add `gemma` model #5631

sroecker commented Feb 21, 2024 •

edited

Loading

cebtenzzre commented Feb 21, 2024 •

edited

Loading

ggerganov commented Feb 21, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Feb 22, 2024 •

edited

Loading

cebtenzzre commented Feb 22, 2024 •

edited

Loading

cebtenzzre commented Feb 23, 2024 •

edited

Loading

ggerganov commented Feb 23, 2024 •

edited

Loading

cebtenzzre commented Feb 26, 2024 •

edited

Loading

slaren commented Feb 27, 2024 •

edited

Loading