A search-enabled local CLI assistant for Ollama models like Gemma.
It keeps normal chat behavior, but automatically searches the web and reads pages when a question needs current information. It also stores useful page content locally for follow-up questions through memory and embeddings.
- Normal local chat through Ollama
- Automatic web search using DDGS
- Page reading and text extraction
- Local memory with embeddings
- Streaming responses
- Cross-platform support for Windows, macOS, and Linux
- Configurable chat and embedding models
Local models like Gemma are great for private, offline-style workflows, but they do not know live web information by default. This project adds a lightweight retrieval layer around Ollama so the assistant can:
- answer normal questions naturally
- search current information when needed
- read and extract text from pages
- remember useful sources for follow-up questions
- Python 3.10 or newer
- Ollama installed and running
- A pulled chat model, such as
gemma4:26b - A pulled embedding model, such as
embeddinggemma
git clone https://github.com/ghostmountain3/gemma_web_cli.git
cd gemma_web_cliThis is the cleanest way to install the CLI as a command.
pipx install git+https://github.com/ghostmountain3/gemma_web_cli.gitThen run:
gemma-webCreate and activate a virtual environment:
py -m venv .venv
.venv\Scripts\activateInstall the project:
pip install -e .Run the CLI:
gemma-webCreate and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activateInstall the project:
pip install -e .Run the CLI:
gemma-webBefore running the app, make sure you have a chat model and an embedding model.
Example:
ollama pull gemma4:26b
ollama pull embeddinggemmaOnce Ollama is running and the models are pulled, launch the app:
gemma-webThen type naturally, for example:
what is recursion
what is the current price of bitcoin
what changed in ollama recently
based on what you read earlier, summarize the main point
The chat and embedding models can be changed with environment variables.
$env:GEMMA_WEB_MODEL="gemma4:26b"
$env:GEMMA_WEB_EMBED_MODEL="embeddinggemma"
gemma-webexport GEMMA_WEB_MODEL="gemma4:26b"
export GEMMA_WEB_EMBED_MODEL="embeddinggemma"
gemma-webYou can replace gemma4:26b with another Ollama chat model if you want to test a different setup.
Examples:
$env:GEMMA_WEB_MODEL="gemma3"
gemma-webexport GEMMA_WEB_MODEL="gemma3"
gemma-webAt a high level, the app does this:
- Accepts a normal user message
- Routes the request as local-only, memory-only, web search, or web search plus memory
- Uses DDGS to search when fresh information is needed
- Reads and extracts text from selected pages
- Chunks and embeds useful content for local memory
- Sends the strongest evidence to the Ollama chat model
- Streams the final answer back in the terminal
- Ollama must be running locally
- Web lookups use DDGS
- The tool stores local memory in the
data/folder - Saved memory is not meant to be committed to Git
- Streaming may fall back to non-streaming if the connection is interrupted
Make sure Ollama is installed and running, then test a model manually:
ollama run gemma4:26bPull it manually:
ollama pull embeddinggemmaIf you used pipx, make sure pipx is on your PATH.
If you used developer mode, make sure you ran:
pip install -e .That is normal for larger chat models and multi-page retrieval. You can reduce the number of pages read or lower the amount of page text kept per request in config.py.
- Better router confidence handling
- Improved page deduplication and refresh logic
- Similarity thresholds for memory filtering
- Source inspection commands
- Better support for dynamic pages
- Cleaner packaging and release workflow
Issues and pull requests are welcome.
If you want to contribute:
- keep changes focused
- include a clear description of what changed
- test on at least one platform before opening a PR
This project is licensed under the MIT License. See the LICENSE file for details.