Skip to content

dcarrascosa/TOPS_Calculator

Repository files navigation

LLM TOPS Calculator

tests live demo License: MIT

HTML5 CSS3 JavaScript Bun Playwright GitHub Actions

Try it live: https://dcarrascosa.github.io/TOPS_Calculator/ — no install needed, auto-deployed from main on every merge.

Interactive calculator that estimates how many TOPS (Tera Operations Per Second) and how much memory bandwidth you need to run local language models — Llama 3 & 4, Mistral, Qwen 2.5 & 3, Gemma 2 & 3, DeepSeek V3, Phi-3 & 4 — at different quantization levels (FP16, 8-bit, 4-bit, 3-bit, 2-bit) and target speeds, across Apple Silicon, NVIDIA, AMD and Copilot+ PC hardware.

Originally built to answer the question: are 38–50 TOPS enough for professional use on a Mac or a new Copilot+ PC? (spoiler: the answer depends more on memory bandwidth than on TOPS, and the calculator explains why).

TOPS aren't the whole story — same 38-50 TOPS class, very different tokens/sec because of memory bandwidth

What it does

flowchart LR
  User([User opens TOPS Calculator]) --> SelectModel[Select model preset or HF URL or custom params]
  SelectModel --> SelectHardware[Select hardware preset or custom TOPS + bandwidth]
  SelectHardware --> SetQuantSpeed[Set quantization and target tokens per second]
  SetQuantSpeed --> Compute[Calculator computes required TOPS and bandwidth]
  Compute --> Verdict[Show verdict good / warn / bad and bandwidth ceiling]
  Verdict --> Chart[Show comparison chart of which models fit selected hardware]
  Verdict --> Share[Optional: share URL / copy markdown / download .md]
Loading

Pick a model, a quantization, a target speed and your hardware. The calculator gives you:

  • Required raw and effective TOPS to hit that speed.
  • Memory needed to hold the weights + KV cache (grows with context length).
  • A verdict combining compute and memory bandwidth, classified good / warn / bad.
  • A reminder that for batch-1 inference, the bottleneck is almost always memory bandwidth, not compute — so it also estimates the bandwidth ceiling for tokens/sec.
  • A comparison chart showing which curated models fit on your selected hardware at your chosen quantization.

Models

Built-in presets covering Llama 3.2 / 3.1 / 3 / 3.3 / 4 (Scout & Maverick MoE), Mistral / Mixtral / Nemo / Small, Qwen 2.5 / 3 (including 235B MoE), Gemma 2 / 3, DeepSeek V3 (671B MoE), Phi-3 / 4. Plus:

  • Hugging Face URL import — paste any model URL and the calculator pulls architecture and parameter count straight from config.json + safetensors metadata. Share links round-trip via ?hf=org/repo.
  • Gated-repo fallback — for meta-llama/*, google/gemma-*, mistralai/* and other gated repos, paste the model's config.json into the disclosure. No token, no network call.
  • Custom — define a fully custom model by parameter count.

Hardware

Built-in presets across:

  • Apple Silicon: M1 / M2 / M3 / M4 family (Pro, Max, Ultra).
  • NVIDIA GeForce: RTX 3060 through RTX 5090.
  • NVIDIA Datacenter: A100, H100.
  • AMD Radeon: RX 7900 XT / XTX.
  • Copilot+ PC NPUs: Snapdragon X Elite, Intel Core Ultra 200V (Lunar), AMD Ryzen AI (Strix, Strix Halo).

Or punch in custom TOPS + memory bandwidth.

Sharing and UX

  • Full configuration encoded in the URL — share link button copies it to the clipboard.
  • Copy as markdown or download .md for write-ups and slides.
  • EN / ES language toggle (community translations welcome — see CONTRIBUTING).
  • Auto / Light / Dark theme, persisted in localStorage.

How to run it

Easiest: open the live demo. Nothing to install.

To run it locally — it's pure HTML + CSS + JS. No build step. Just open it.

  1. Clone the repo.
  2. Double-click index.html.

Or run the bundled dev server (cross-platform):

bun install   # only the first time
bun run serve
# then open http://localhost:8000

Need Bun installed. The project uses Bun as its runtime, package manager and test runner. npm is not supported.

Tests

Unit tests for the calculator math (bun:test):

bun test

End-to-end tests with Playwright (drives a real browser against the live page):

bun install                              # only the first time
bunx playwright install chromium         # only the first time, downloads the browser
bun run test:e2e

Run both in one go:

bun run test:all

The math (short version)

For each generated token, an LLM does roughly 2 × N operations, where N is the parameter count. So:

Required TOPS = (2 × parameters × tokens_per_second) / 10^12

That's the theoretical floor. Real hardware rarely hits more than 20–40% of its peak TOPS for LLM inference, so the calculator multiplies by an efficiency factor (configurable) to give a realistic number.

It also computes the memory-bandwidth ceiling:

Max tokens/sec ≈ memory_bandwidth / model_size_in_memory

For batch=1 inference (the common case on a laptop), this is almost always the real bottleneck.

Caveats

  • These are estimates, not benchmarks. Actual performance depends on the framework (llama.cpp, MLX, Ollama, vLLM…), the kernel implementations, thermals, and a dozen other things.
  • Vendor-published TOPS numbers measure different things. Apple's headline TOPS (38 on M4) is the Neural Engine, but most LLM runtimes use the GPU (via Metal / MLX). NVIDIA quotes Tensor Core TOPS at INT8 (sparsity off in this calc), AMD at FP16, Copilot+ PC NPUs at INT8. The math is generic; pick the chip preset that matches the runtime path you actually use.
  • MoE models (Mixtral, etc.) are tricky — the calculator uses active parameters per token, not total.

License

MIT. Use it however you like.

Contributing

See CONTRIBUTING.md. PRs welcome.

Author

Ideated, designed and implemented by David Carrascosa Bolaños.

Built with the support of AI-assisted coding tools.

About

Estimate the TOPS and memory bandwidth needed to run local LLMs at different quantizations on Mac, NVIDIA, AMD and Copilot+ PC hardware.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors