Skip to content

Releases: eladser/mtop

mtop 1.3.0

18 Jun 08:52

Choose a tag to compare

Watch more than one box, see more per request, and a couple of new numbers.

  • Multi-host: give -ollama a comma list and mtop stacks the models and GPUs from each machine, tagged by host. Handy if you run models on a couple of boxes.
  • GPU util and memory now draw as sparklines over time, next to the live numbers.
  • Request inspector: run with -inspect, press i, and you get the last request's prompt, completion, and a load/prompt/decode timing split. Off by default; the text it captures is stripped of control bytes so a model can't smuggle escape sequences into your terminal.
  • Session energy on the TOK/S line: watt-hours used and tokens per watt-hour. It's whole-GPU power, so read it as a rough efficiency number.
  • compare -openai <url> runs the comparison against llama.cpp, LM Studio or vLLM, not just ollama.
  • -mem-alert and -temp-alert to set the alert thresholds instead of the built-in 93% and 87C.

brew and scoop pick this up as usual; winget follows once Microsoft merges the bump.

mtop 1.2.0

11 Jun 09:07

Choose a tag to compare

A few new things.

mtop compare "<prompt>" model1 model2 runs the same prompt past a few ollama models one at a time and prints tok/s side by side, fastest first. Handy for deciding whether the bigger model is worth the wait.

-notify pops a desktop notification the first time a GPU crosses the alert line (high memory or temperature). No new dependencies, it just uses whatever the OS already ships: osascript on a Mac, notify-send on Linux, a powershell toast on Windows.

-history keeps recent requests in ~/.mtop/history.jsonl and reads them back on the next run, so the stats panes aren't empty after a restart.

On Apple Silicon, running mtop with sudo now gets you real GPU utilization from powermetrics, on top of the memory figure you already had.

The GPU pane shows how much VRAM the loaded models account for, and /metrics now includes the GPU gauges, not just the request counters.

mtop 1.1.3

10 Jun 08:54

Choose a tag to compare

Closes a hole in the request proxy.

It forwards everything to ollama, and that includes the endpoints that delete or pull models. A browser tab can reach 127.0.0.1, so a page you have open could hit the proxy and drive that API without you knowing — and rebind dns to read the responses back. The proxy now only answers loopback callers and rejects anything carrying a cross-origin Origin. Normal ollama clients and SDKs don't send that header, so nothing changes for them.

While in there:

  • /metrics was building its per-model rows and its percentiles from two separate reads, so under load they could disagree. One read now.
  • the footer still said "proxy on ..." when started with -no-proxy.
  • the 1 MiB cap on stream reads wasn't applied to the final flush.

mtop 1.1.2

10 Jun 08:18

Choose a tag to compare

Hardening, most of it picked up from @tiberiuichim's fork — thanks.

  • The proxy stops buffering after 1 MiB while looking for the final chunk, so a misbehaving server can't grow memory without bound. Responses past that size still reach the client in full, they just don't get counted.
  • Values from ~/.mtop.conf no longer pass through the process environment, where child processes like nvidia-smi could see them.
  • Binding the proxy to anything beyond loopback now prints a warning: traffic through it is plain http, and prompts ride on it.
  • golang.org/x/text was pinned to a 2022 version with known CVEs; now current.
  • The response tap moved from ModifyResponse to a RoundTripper, which also drops the deprecated Director hook.

mtop 1.1.1

10 Jun 07:51

Choose a tag to compare

Model rows wrapped and broke the layout when more than one server was up (ollama next to llama.cpp, say). Rows now truncate to the pane width instead.

mtop 1.1.0

10 Jun 01:15

Choose a tag to compare

mtop now watches more than ollama.

  • llama.cpp, LM Studio and vLLM show up in the models pane next to ollama, detected on their usual ports (8080, 1234, 8000 — all configurable, empty flag to skip). llama.cpp wants --metrics on launch for the kv-cache numbers.
  • The proxy counts OpenAI-style requests too (/v1/chat/completions, /v1/completions), so clients of llama.cpp and LM Studio get tok/s in the requests pane. Their responses carry no timings, so that number is tokens over wall time — close to decode speed, not identical. Ollama requests still use ollama's own timings.
  • AMD GPUs through rocm-smi. Apple Silicon shows unified-memory use; real GPU utilization needs root for powermetrics, so that's still open.
  • Press c for a per-model table: requests, average tok/s, p50/p95, tokens out.
  • The proxy port serves /metrics in prometheus format.
  • Status line turns red when GPU memory passes 93% or temperature passes 87°C.
  • ~/.mtop.conf holds MTOP_* settings for hosts you don't want to retype.

The gif in the README is from a live run, as always.

mtop 1.0.0

09 Jun 21:08

Choose a tag to compare

First release.

mtop is one terminal window for your local AI: loaded models and the VRAM they hold, GPU state, every request with its tok/s, and a throughput sparkline.

What's in 1.0:

  • Models, GPU, requests and throughput panes. Zero config — run mtop, it finds Ollama on localhost.
  • Per-request tok/s via a pass-through proxy on 127.0.0.1:4321. Ollama has no metrics endpoint; the response stream is the only place those numbers exist, so mtop sits in the middle and reads them as they pass. Point OLLAMA_HOST at it.
  • Model unload: arrows to select, u to evict. Models that blow past their expiry get marked overdue.
  • -idle-unload 15m evicts anything that hasn't served a request in 15 minutes, for the times ollama forgets to.
  • Binaries for Windows, Linux and macOS below. GPU stats are NVIDIA-only for now; AMD and Apple Silicon are next on the roadmap, along with llama.cpp and LM Studio.

The gif in the README is a real run against a live model, not a mockup.