Skip to content

gcp64/AI-BOB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI BOB - The Serverless Offline Intelligent Desktop Agent for Windows

AI BOB is a fully local, serverless, and offline desktop assistant and agentic operating system manager designed specifically for Windows environments. It bridges a modern HTML5/JS/CSS graphical user interface rendered via PyWebView with a local Large Language Model (LLM) inference engine running quantized GGUF models (such as Qwen 2.5 and the new Google Gemma 4).

AI BOB features native OS integration, allowing it to monitor, control, and automate Windows tasks entirely offline without sending user data to external servers. It incorporates a sophisticated bilingual linguistic engine that automatically switches between a warm, natural Iraqi dialect for daily chat and system management, and formal, structured Arabic for complex academic or technical explanations.


Technical Philosophy

AI BOB is built on the premise of "local-first, serverless desktop automation." Unlike cloud-based assistants (e.g., Copilot, Siri, Alexa) that rely on heavy APIs, network latency, and remote data processing, AI BOB runs locally on the user's CPU or GPU. This ensures:

  1. Absolute Privacy: No conversations, system telemetry, or files leave the host computer.
  2. Zero Latency: Immediate intent processing and offline fallback loops.
  3. Hardware Accessibility: Dynamic performance scaling allows the agent to run efficiently on low-spec PCs with under 4GB of RAM.

Architectural Overview

The application is structured into decoupled, modular engines that handle frontend rendering, orchestration, model management, system automation, persistent storage, and linguistic post-processing.

+-------------------------------------------------------------+
|                        Web UI (HTML5/JS/CSS)                |
|               (Rendered locally inside PyWebView)           |
+-------------------------------------------------------------+
                               |
                               v (Serverless JS/Python Bridge)
+-------------------------------------------------------------+
|                      main.py (WebView API)                  |
+-------------------------------------------------------------+
                               |
                               v
+-------------------------------------------------------------+
|                 llm_agent.py (Coordinating Agent)           |
+-------------------------------------------------------------+
      |               |               |                |
      v               v               v                v
+-----------+   +-----------+   +-----------+    +------------+
| model_mgr |   | thinking  |   | system_mgr|    | memory_mgr |
|  (GGUF)   |   | (CoT/Pol) |   | (Win OS)  |    |   (JSON)   |
+-----------+   +-----------+   +-----------+    +------------+
                      |               |
                      v               v
                +---------------------------+
                |        task_planner       |
                |   (Compound Instructions) |
                +---------------------------+

1. Frontend & Communication Bridge (main.py / web_ui)

  • pywebview Integration: Instead of exposing local HTTP server ports (which can lead to security vulnerabilities and port collisions), the backend instantiates a native Edge Chromium webview instance. It exposes a direct, asynchronous Python API (WebViewApi) that can be invoked from JavaScript using window.pywebview.api.
  • Frontend Engine: Consists of a single-file, highly optimized React-like GUI containing vanilla HTML5, modern CSS, and raw Javascript. It implements a beautiful dark-mode glassmorphism interface, real-time system performance dials, scrollable chat logs, and sidebar session managers.
  • Asynchronous Threading: To prevent freezing the frontend thread during heavy LLM inference, the Python backend executes all model generation inside dedicated daemon threads, pushing streamed tokens to the GUI using direct JS execution callbacks (window.pywebviewStreamToken).

2. The Coordinating Agent (llm_agent.py)

The coordinating agent acts as the main orchestrator. It manages the multi-stage cognitive pipeline:

  • Stage 1 (Intent & Fallback Routing): Analyzes query metadata and forwards simple math or conversational prompts directly to rule-based engines to save CPU cycles.
  • Stage 2 (Context Gathering): Queries the SystemManager for real-time performance metrics (CPU, RAM, Temp) when relevant to the query, injecting this data into the prompt context.
  • Stage 3 (Inference Execution): Sends the formatted prompt to the ModelManager for streaming.
  • Stage 4 (Self-Correction & Refusal Override): Evaluates model responses for structural errors or refusals, rewriting commands if necessary to guarantee 100% execution reliability.

3. Model Manager & Inference Tuning (model_manager.py)

  • Local GGUF Loading: Uses llama-cpp-python to bind directly to llama.cpp shared libraries. It auto-detects local .gguf files, prioritizing active models stored in D:\mrbob_AI_Models\models\ or local fallback paths.
  • RAM & CPU Auto-Tuning: The manager queries physical system memory and CPU core counts on startup:
    • Ultra-Low Memory Mode (<4GB RAM): Caps context size at 1024 tokens, batch size at 256, and disables memory locks (mlock) to prevent paging issues.
    • Balanced Mode (<8GB RAM): Scales context to 2048 tokens, batch size to 256.
    • High-Performance Mode (>=8GB RAM): Allocates a full 4096 context window, batch size of 512, and enables speculative decoding depth of 5.
  • Thread Affinity Tuning: Auto-calculates optimal compute threads based on physical (not logical) cores to avoid cache thrashing.
  • Speculative Decoding (Prompt Lookup Decoding): Integrates speculative execution via LlamaPromptLookupDecoding. By using a fast, temporary draft model to check for repeated token sequences in the prompt, token generation speed is accelerated up to 2x on low-end CPUs.
  • Prompt Caching: Enables LlamaRAMCache to store pre-evaluated prompt prefix embeddings, reducing response times for recurring conversational threads.

4. Cognitive Prompting & Linguistic Engine (thinking_engine.py)

The thinking engine handles prompt construction, output formatting, and dialect normalization.

  • Chain-of-Thought (CoT) Structuring: Guides the LLM to output its internal logic inside <think>...</think> blocks prior to producing the final user response.
  • Identity Protection: Implements filters that intercept output tokens, ensuring the model never leaks prompt templates or drifts into identity hallucinations.
  • Linguistic Dialect Switching: Determines if the query is conversational (where it triggers Iraqi dialect translations) or academic (where it forces high-quality formal Arabic).

5. OS Control & Automation (system_manager.py)

Provides direct, low-level Python bindings to Windows APIs:

  • Resource Monitoring: Measures CPU usage via kernel calls, RAM allocation via virtual memory counters, and disk space usage recursively.
  • Temperature Tracking: Queries thermal zones using a cached WMI PowerShell command (MSAcpi_ThermalZoneTemperature), falling back to a load-based heat approximation curve to prevent thermal polling overhead.
  • System Automation: Calls Windows API libraries (such as ctypes.windll.user32.LockWorkStation) to lock the workstation, or uses subprocess execution for clean temp cleaning, system shutdowns, and application spawning.

6. Recursive Task Planner (task_planner.py)

Decouples compound user requests (e.g., "Check my RAM, clean the system, and open notepad") into structured pipelines. It parses the intent list recursively, runs each corresponding system tool, aggregates the results, and feeds the combined observation back to the agent to form a unified final answer.

7. Memory & Semantic Storage (memory_engine.py)

  • Session Tracking: Persists chat history locally inside AppData\AIBOB\memory.json.
  • Log Compression (Context Pruner): When conversation history exceeds 1200 tokens, the memory engine compresses older messages into a dense, single-line text summary, keeping only the last 4 active messages in raw format. This fits the entire context within low-memory limits.
  • Semantic Cache: Matches user queries against normalized string hashes (removing diacritics, common Arabic variations, and punctuation). If a match is found, it returns the cached response in 0ms without running the neural network.

The Cognitive 5-Layer Reasoning Pipeline

AI BOB implements a strict reasoning template for every LLM invocation:

  1. Cognitive Understanding: Identify the explicit request, implicit intents, constraints, and system state requirements.
  2. Dependency Check: Inspect necessary files, system permissions, and verify tool parameters.
  3. Recursive Planning: Define the step-by-step procedure required to fulfill the user's intent.
  4. Execution Generation: Construct the precise system command tag ([RUN_COMMAND: ToolName: Params]) with 100% correct JSON formats.
  5. Safety & Quality Polish: Filter out system prompt leakage, eliminate repetitions, correct sentence boundaries, and apply the appropriate dialect polish.
User Query ---> [Intent Analysis] ---> [CoT Reasoning] ---> [Tool Output] ---> [Linguistic Polishing] ---> Clean Output
                     |                      |                      |                     |
               (NLP Routing)         (5-Layer Check)        (CMD Execution)       (10-Layer Filter)

Advanced 10-Layer Linguistic Polisher

To ensure premium local-first presentation, the model's output passes through an advanced 10-layer post-processing filter:

  • Layer 1 (Token Cleanup): Strips residual ChatML tags (<|im_start|>, <|im_end|>) and unclosed formatting tags.
  • Layer 2 (Thought Isolation): Completely hides reasoning chains (<think>...</think>) from the final user panel.
  • Layer 3 (Emoji & Noise Removal): Automatically strips all emojis and corrupted binary bytes from the final text.
  • Layer 4 (Sentence Recovery): Gracefully repairs sentences that are cut off due to token limits by truncating the incomplete word and adding a polite ending.
  • Layer 5 (Dialect Harmonization): Translates dry standard Arabic phrases into warm, natural Iraqi dialect for daily tasks (e.g. converting "كيف حالك" to "شلونك عيني").
  • Layer 6 (Repetition Removal): Detects and eliminates consecutive loop words (e.g. reducing "my friend my friend my friend" to "my friend").
  • Layer 7 (Whitespace Normalization): Caps redundant spaces and blank lines.
  • Layer 8 (Refusal Softener): Rewrites robotic refusal responses into friendly, helpful explanations.
  • Layer 9 (Punctuation Formatting): Standardizes spaces around commas, periods, and question marks.
  • Layer 10 (Tone Polish): Appends conversational warm touchups (like ending sentences with polite particles).

Hardware Optimizations for Low-Spec Machines

  • Low RAM Handling: Restricts thread counts to Physical Cores - 1 to avoid CPU core starvation, and leverages OS memory mapping (use_mmap=True) so that model weights are read directly from disk without allocating massive virtual RAM space.
  • Fast speculative Drafts: The prompt lookup speculative decoding dynamically checks draft predictions. This avoids running the heavy model for common text outputs, saving up to 40% CPU cycles.
  • Prompt Caching: By storing evaluated system prompt headers in RAM cache, the first-token latency (time-to-first-token) is reduced from seconds to milliseconds.

Setup and Installation

Prerequisites

  • OS: Windows 10 or Windows 11 (64-bit).
  • Python: Version 3.10 to 3.12.
  • C++ Compiler: MSVC build tools installed (required to compile llama-cpp-python local bindings).

Step-by-Step Installation

  1. Clone the Repository:

    git clone https://github.com/gcp64/AI-BOB.git
    cd AI-BOB
  2. Install Requirements:

    pip install -r requirements.txt

    Note: If you have an NVIDIA GPU, install llama-cpp-python with CUDA support to enable fast GPU acceleration:

    $env:CMAKE_ARGS="-DGGML_CUDA=on"
    pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
  3. Deploy the LLM weights (GGUF):

    • Download a compatible model in GGUF format.
    • Recommended Model: Google Gemma 4 E4B Instruct (Q4_K_M GGUF). It can be downloaded from the Unsloth HuggingFace repository: unsloth/gemma-4-E4B-it-GGUF.
    • Save the downloaded .gguf file to the model directory: D:\mrbob_AI_Models\models\ or directly in the project folder under models/.
  4. Launch the Application:

    python main.py

Building a Standalone Executable

To compile AI BOB into a single, console-free Windows executable file, run the custom build script:

python build_exe.py

This script invokes PyInstaller using AIBOB.spec to pack the application. It dynamically excludes heavy, unused GUI dependencies (like PyQt5 and PyQt6 shared libraries and DLLs) to ensure the final output binary size is as small as possible.


Contributing and Model Feedback

Collaborators are welcome to contribute to the project. Areas of focus include:

  • Model Benchmark Reports: If you discover a lightweight GGUF model (under 5GB) with excellent Arabic language instruction following capabilities, please open a discussion.
  • OS Automation Tools: Contributions of new system automation commands inside system_manager.py are highly appreciated.
  • Bug Reports: If you encounter interface lags, thread deadlocks, or command parsing issues, submit a detailed report.

Developed by Mr. Bob - Principal AI Architect, ranked 7th globally on GitHub.

About

A fully offline, serverless intelligent desktop agent and OS manager for Windows. Combines local PyWebView GUI with quantized GGUF LLMs (Qwen/Gemma) featuring advanced Iraqi/Arabic dialect post-processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors