Try it live: https://dcarrascosa.github.io/TOPS_Calculator/ — no install needed, auto-deployed from
mainon every merge.
Interactive calculator that estimates how many TOPS (Tera Operations Per Second) and how much memory bandwidth you need to run local language models — Llama 3 & 4, Mistral, Qwen 2.5 & 3, Gemma 2 & 3, DeepSeek V3, Phi-3 & 4 — at different quantization levels (FP16, 8-bit, 4-bit, 3-bit, 2-bit) and target speeds, across Apple Silicon, NVIDIA, AMD and Copilot+ PC hardware.
Originally built to answer the question: are 38–50 TOPS enough for professional use on a Mac or a new Copilot+ PC? (spoiler: the answer depends more on memory bandwidth than on TOPS, and the calculator explains why).
flowchart LR
User([User opens TOPS Calculator]) --> SelectModel[Select model preset or HF URL or custom params]
SelectModel --> SelectHardware[Select hardware preset or custom TOPS + bandwidth]
SelectHardware --> SetQuantSpeed[Set quantization and target tokens per second]
SetQuantSpeed --> Compute[Calculator computes required TOPS and bandwidth]
Compute --> Verdict[Show verdict good / warn / bad and bandwidth ceiling]
Verdict --> Chart[Show comparison chart of which models fit selected hardware]
Verdict --> Share[Optional: share URL / copy markdown / download .md]
Pick a model, a quantization, a target speed and your hardware. The calculator gives you:
- Required raw and effective TOPS to hit that speed.
- Memory needed to hold the weights + KV cache (grows with context length).
- A verdict combining compute and memory bandwidth, classified
good/warn/bad. - A reminder that for batch-1 inference, the bottleneck is almost always memory bandwidth, not compute — so it also estimates the bandwidth ceiling for tokens/sec.
- A comparison chart showing which curated models fit on your selected hardware at your chosen quantization.
Built-in presets covering Llama 3.2 / 3.1 / 3 / 3.3 / 4 (Scout & Maverick MoE), Mistral / Mixtral / Nemo / Small, Qwen 2.5 / 3 (including 235B MoE), Gemma 2 / 3, DeepSeek V3 (671B MoE), Phi-3 / 4. Plus:
- Hugging Face URL import — paste any model URL and the calculator pulls architecture and parameter count straight from
config.json+safetensorsmetadata. Share links round-trip via?hf=org/repo. - Gated-repo fallback — for
meta-llama/*,google/gemma-*,mistralai/*and other gated repos, paste the model'sconfig.jsoninto the disclosure. No token, no network call. - Custom — define a fully custom model by parameter count.
Built-in presets across:
- Apple Silicon: M1 / M2 / M3 / M4 family (Pro, Max, Ultra).
- NVIDIA GeForce: RTX 3060 through RTX 5090.
- NVIDIA Datacenter: A100, H100.
- AMD Radeon: RX 7900 XT / XTX.
- Copilot+ PC NPUs: Snapdragon X Elite, Intel Core Ultra 200V (Lunar), AMD Ryzen AI (Strix, Strix Halo).
Or punch in custom TOPS + memory bandwidth.
- Full configuration encoded in the URL — share link button copies it to the clipboard.
- Copy as markdown or download
.mdfor write-ups and slides. - EN / ES language toggle (community translations welcome — see CONTRIBUTING).
- Auto / Light / Dark theme, persisted in
localStorage.
Easiest: open the live demo. Nothing to install.
To run it locally — it's pure HTML + CSS + JS. No build step. Just open it.
- Clone the repo.
- Double-click
index.html.
Or run the bundled dev server (cross-platform):
bun install # only the first time
bun run serve
# then open http://localhost:8000Need Bun installed. The project uses Bun as its runtime, package manager and test runner.
npmis not supported.
Unit tests for the calculator math (bun:test):
bun testEnd-to-end tests with Playwright (drives a real browser against the live page):
bun install # only the first time
bunx playwright install chromium # only the first time, downloads the browser
bun run test:e2eRun both in one go:
bun run test:allFor each generated token, an LLM does roughly 2 × N operations, where N is the parameter count. So:
Required TOPS = (2 × parameters × tokens_per_second) / 10^12
That's the theoretical floor. Real hardware rarely hits more than 20–40% of its peak TOPS for LLM inference, so the calculator multiplies by an efficiency factor (configurable) to give a realistic number.
It also computes the memory-bandwidth ceiling:
Max tokens/sec ≈ memory_bandwidth / model_size_in_memory
For batch=1 inference (the common case on a laptop), this is almost always the real bottleneck.
- These are estimates, not benchmarks. Actual performance depends on the framework (llama.cpp, MLX, Ollama, vLLM…), the kernel implementations, thermals, and a dozen other things.
- Vendor-published TOPS numbers measure different things. Apple's headline TOPS (38 on M4) is the Neural Engine, but most LLM runtimes use the GPU (via Metal / MLX). NVIDIA quotes Tensor Core TOPS at INT8 (sparsity off in this calc), AMD at FP16, Copilot+ PC NPUs at INT8. The math is generic; pick the chip preset that matches the runtime path you actually use.
- MoE models (Mixtral, etc.) are tricky — the calculator uses active parameters per token, not total.
MIT. Use it however you like.
See CONTRIBUTING.md. PRs welcome.
Ideated, designed and implemented by David Carrascosa Bolaños.
Built with the support of AI-assisted coding tools.
