GPT-OSS Java Inference

A pure Java implementation of OpenAI's gpt-oss inference in ~1000 lines of code optimized for CPU execution.

Inspired by llama.cpp, llama2.c, this repo ports the gpt-oss PyTorch model.py to efficient Java code, emphasizing minimalism, simplicity, and educational purpose.

Key Features

Pure Java - No native dependencies, runs anywhere with Java 23+
Complete gpt-oss architecture - Full implementation of MoE transformer with GQA, sliding window attention, RoPE, and SwiGLU
CPU inference - No GPU required, designed for consumer-grade commodity hardware on local machines or cloud compute instances
Memory efficient - Run on 24GB+ RAM using memory-mapped weights via Java Foreign Memory API
Performance optimized - Support KVCache and exploit modern JDK GC/JIT, parallel processing, SIMD Vector API, and fused operations
Educational - Clean, readable code for understanding LLM transformer internals
Handy CLI - Interactive chat and single-shot generation modes

Requirements

Java 23+
Minimum 24GB memory, ideally 48GB+

Quick Start

1. Model Preparation

Download the gpt-oss model weights from Hugging Face Hub. Start with gpt-oss-20b as it runs efficiently on consumer CPU-based hardware.

pip install -U "huggingface_hub[cli]"

# gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/

# gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/

2. Convert to customized format

The original gpt-oss model uses safetensors format with MXFP4 quantized weights for MoE experts and BF16 for other tensors. Since Java and CPU don't natively support MXFP4, I upcast all weights to a custom binary format with BF16 precision, this allows simple and lightweight conversion from BF16 to in-memory FP32 during Java runtime inference without dequantization overhead (see detailed reason).

For conversion, install the Python dependencies and then use the convert_model.py to convert safetensors file to customized binary format (e.g., gpt-oss-20b.bin file).

pip install -r requirements.txt
python convert_model.py /path/to/gpt-oss-20b/original/model.safetensors /path/to/model-gpt-oss-20b.bin

Take gpt-oss-20B as an example, the converted file size is about 39GB. It achieves good performance when the entire model fit in RAM's page cache (only for MLP weights) and JVM heap (other parameters).

2. Build

Build the project to generate the executable JAR located at build/libs/gpt-oss-java-1.0.0-all.jar.

./gradlew build shadowJar

Note: you can download JDK and configure the Java version using either method below:

Create a gradle.properties and add org.gradle.java.home=/path/to/jdk-23+.
Set the environment variable

export JAVA_HOME=/path/to/jdk-23+

3. Run

java --add-modules jdk.incubator.vector -jar build/libs/gpt-oss-java-1.0.0-all.jar /path/to/model-gpt-oss-20b.bin

Command Line Options:

Usage: java GPTOSSCli <model_path> [options]

Examples:
  java --add-modules jdk.incubator.vector -jar gpt-oss-java-1.0.0-all.jar /path/to/model-gpt-oss-20b.bin -m generate -p "Hello world" -n 50
  java --add-modules jdk.incubator.vector -jar gpt-oss-java-1.0.0-all.jar /path/to/model-gpt-oss-20b.bin -m chat -t 0.1
  java --add-modules jdk.incubator.vector -jar gpt-oss-java-1.0.0-all.jar /path/to/model-gpt-oss-20b.bin -t 0.2 -n 32768 --multi-turn

Options:
  -m <mode>        Inference mode: 'generate' (single shot) | 'chat' (interactive multi-turn) [default: chat]
  -p <prompt>      Input prompt (required for generate mode)
  -n <tokens>      Maximum tokens to generate [default: 100]
  -t <temperature> Sampling temperature (0 to inf) [default: 0.1]
  -s <ids>         Stop token IDs (comma-separated) [default: 0,199999,200002]
  --debug          Enable debug logging [default: false]
  --multi-turn     Enable multi-turn conversation (chat mode only) [default: false]
  --model-size     gpt-oss model size 20b or 120b [default: 20b]

Examples:

# Interactive chat (default if -m not set)
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m chat

# Keeps conversation history in one session
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m chat \
  --multi-turn

# Single-shot generation with max of 100 tokens and temperature of 0.2
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m generate \
  -t 0.2 \
  -p "Why do people use umbrellas when it rains?" \
  -n 100

# Override stop IDs (default: 0,199999,200002)
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m generate \
  -p "Write a short story about AI." \
  -s 3392

# Debug logging to show performance metrics
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m generate \
  -p "Explain TCP vs UDP" \
  --debug

# Switch to 120B
java --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model-120b.bin \
  --model-size 120B

Tuning knobs

Thread and memory configuration

Control matrix multiplication and scaled dot product parallelism by setting the ForkJoinPool size. By default, it uses all available vCPU cores.

# Use 16 threads
java -Djava.util.concurrent.ForkJoinPool.common.parallelism=16 --add-modules jdk.incubator.vector \
  -jar gpt-oss-java-1.0.0-all.jar /path/to/model.bin \
  -m chat

You can specify -Xmx in JVM options, normally it requires less than 16GB JVM heap memory. The memory-mapped MLP weights require an additional ~30GB of system memory.

KV Cache

KV cache allocation goes with the max-tokens -n CLI parameter, default a lower bound of 4096 tokens.

Performance Benchmarks

This Java implementation delivers the following CPU inference performance on gpt-oss-20b:

Apple M3 Pro (12 cores, 36GB RAM):
- Decode: avg ~3.3 tokens/sec
- Prefill: avg ~2.2 tokens/sec
AWS EC2 m5.4xlarge (Intel Xeon Platinum 8175M, 8 physical cores, 16 vCPUs, 64GB RAM)
- Decode: avg ~7.0 tokens/sec
- Prefill: avg ~10.4 tokens/sec

For detailed benchmark results, hardware, and performance data, see benchmark/README.md.

References

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
benchmark		benchmark
gradle/wrapper		gradle/wrapper
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
convert_model.py		convert_model.py
gradlew		gradlew
gradlew.bat		gradlew.bat
requirements.txt		requirements.txt
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-OSS Java Inference

Key Features

Requirements

Quick Start

1. Model Preparation

2. Convert to customized format

2. Build

3. Run

Tuning knobs

Thread and memory configuration

KV Cache

Performance Benchmarks

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

amzn/gpt-oss.java

Folders and files

Latest commit

History

Repository files navigation

GPT-OSS Java Inference

Key Features

Requirements

Quick Start

1. Model Preparation

2. Convert to customized format

2. Build

3. Run

Tuning knobs

Thread and memory configuration

KV Cache

Performance Benchmarks

References

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages