Skip to content

Conversation

@philip-essential
Copy link
Contributor

@philip-essential philip-essential commented Dec 6, 2025

This adds support for Rnj-1, which is an 8B model we just released. We've been using llama.cpp to play around with the model internally, and we released a GGUF checkpoint for the instruction-tuned version.

The model architecture is similar enough to Gemma3 that in Transformers/VLLM/SGLang we can reuse the same model file. However, in llama.cpp we need some small changes, so I've added a new implementation, based closely on the Gemma3 one. The changes are:

  • All layers use global attention.
  • Long-context is via YaRN.
  • (edited to add:) Uses final_logit_softcapping

Because our huggingface config.json uses "Gemma3ForCausalLM" as the architecture, convert_hf_to_gguf.py is unable to tell that these configs are for Rnj-1. The solution I came up with is to manually change the architecture to Rnj1ForCausalLM before converting the checkpoint. I added a note in convert_hf_to_gguf.py about this. But perhaps there's a better solution?

@CISC
Copy link
Collaborator

CISC commented Dec 6, 2025

Because our huggingface config.json uses "Gemma3ForCausalLM" as the architecture, convert_hf_to_gguf.py is unable to tell that these configs are for Rnj-1. The solution I came up with is to manually change the architecture to Rnj1ForCausalLM before converting the checkpoint. I added a note in convert_hf_to_gguf.py about this. But perhaps there's a better solution?

Instead change llm_build_gemma3_iswa into a templated llm_build_gemma3, like f.ex. smallthinker and add support for YaRN and non-SWA in Gemma3Model conversion.

@faisal-fida
Copy link

faisal-fida commented Dec 7, 2025

@philip-essential Just following up on PR #17811 (Rnj-1 support).

Currently hitting an error: unknown model architecture: 'rnj1' when trying to load the GGUF. Any chance we can prioritize merging this so the community can use Rnj-1?

@sirmo
Copy link

sirmo commented Dec 7, 2025

I tested the current fork of this PR and it works pretty well with the published gguf Q4 quants. The model follows OpenCode (TUI coding agent) instructions well in my brief testing. Neat model!

(though 32K context size is a bit limiting for local coding agents) this might be a great agentic model for efficient execution. Thank you for all your work!

Hardware tested on: 7900xtx with ROCm backend.

@philip-essential
Copy link
Contributor Author

Instead change llm_build_gemma3_iswa into a templated llm_build_gemma3, like f.ex. smallthinker and add support for YaRN and non-SWA in Gemma3Model conversion.

That makes sense. I can try and do that soon.

@philip-essential
Copy link
Contributor Author

Q: There are a few GGUF quantizations of Rnj-1 out now that use rnj1 for the architecture. We can update the one in our official repo, but there are some created by 3rd parties. Do you prefer that we add rnj1 as an alias of gemma3, or only use gemma3? Since rnj1 was never on master, I would guess the latter.

@CISC
Copy link
Collaborator

CISC commented Dec 8, 2025

Q: There are a few GGUF quantizations of Rnj-1 out now that use rnj1 for the architecture. We can update the one in our official repo, but there are some created by 3rd parties. Do you prefer that we add rnj1 as an alias of gemma3, or only use gemma3? Since rnj1 was never on master, I would guess the latter.

Yes, the latter please, it can't be helped that GGUFs pop up pre-merge that end up being incompatible, if they are not updated post-merge so be it.

@philip-essential
Copy link
Contributor Author

I just pushed a commit that refactors it as you describe. While doing this I also noticed a bug, which is that Rnj-1 should use final_logit_softcapping and it wasn't. I fixed this as well. I'm currently pushing updated weights to our HF repo.

If you try to run rnj-1 without this PR (exactly as though it were gemma3), it does start and give superficially coherent responses at short context. As expected, at long context it breaks down. I tested this by attaching this paper and asking it who wrote the paper. This gives coherent responses with this PR and not without it.

@philip-essential
Copy link
Contributor Author

Thanks, I applied those changes and reconverted the checkpoint to get the metadata differences. It seems to run as expected.

@CISC
Copy link
Collaborator

CISC commented Dec 9, 2025

Thanks, I applied those changes and reconverted the checkpoint to get the metadata differences. It seems to run as expected.

Just beware your previous one will run as SWA. :)

@philip-essential
Copy link
Contributor Author

Ah yes, I missed that. The new checkpoint conversion ensures that sliding_window is not present at all in the metadata if sliding_window_pattern==1, and we use that to determine whether to run with sliding window. I can see in the logs of llama-server that sliding_window is no longer present, and I just ran a 28k-token example that seems correct.

@CISC
Copy link
Collaborator

CISC commented Dec 9, 2025

Ah yes, I missed that. The new checkpoint conversion ensures that sliding_window is not present at all in the metadata if sliding_window_pattern==1, and we use that to determine whether to run with sliding window. I can see in the logs of llama-server that sliding_window is no longer present, and I just ran a 28k-token example that seems correct.

Yep, just be sure to update GGUFs on HF, will merge in a bit.

@philip-essential
Copy link
Contributor Author

Yes, this commit is the one that I tested, it's deployed to HF.

@CISC CISC merged commit 1d2a1ab into ggml-org:master Dec 9, 2025
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants