# SatcomLLM: Fully Local RAG Pipeline Demo

Welcome to the **Fully Local** SatcomLLM RAG Demo! This notebook demonstrates how to build a complete Retrieval-Augmented Generation (RAG) system that runs **entirely on your local machine** - no external APIs required.

## What's Different from the Cloud Version?

This local version replaces all cloud services with local alternatives:

| Component | Cloud Version | Local Version |
|-----------|---------------|---------------|
| **Embeddings** | DeepInfra API | sentence-transformers (local) |
| **Vector DB** | Qdrant Cloud | Qdrant Local (in-memory or file-based) |
| **LLM Inference** | RunPod vLLM API | Local transformers (TinyLlama or similar) |

## What You'll Learn

This notebook demonstrates how to build a RAG system that:
- Runs completely offline
- Uses local embedding models for semantic search
- Stores vectors in a local Qdrant instance
- Generates answers using a small local LLM (TinyLlama)
- Works on Apple Silicon with MPS acceleration
- **Saves and reuses processed chunks** to avoid re-processing

## Notebook Structure

The demo is divided into 4 main parts:

- **Setup and LLM Fundamentals**: Install dependencies, configure local models, and understand core concepts like tokenization and prompting. You'll test local model inference to ensure everything works.

- **Theory around RAG**: Understand the principles of Retrieval-Augmented Generation (RAG), including how retrieval complements generative models.

- **Embedding Vector Database Creation**: Load and chunk markdown documents, generate embeddings using local models, create a local Qdrant vector database, and populate it with embedded document chunks. **Chunks are saved locally for reuse.**

- **Retrieval and RAG Pipeline**: Implement semantic search to find relevant documents, build the complete RAG pipeline that combines retrieval with local generation, and test the system with real questions.

## Technology Stack

- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2 or similar)
- **Vector Database**: Qdrant Local (in-memory or persistent)
- **LLM**: TinyLlama-1.1B-Chat-v1.0 (or any small local model)
- **Document Processing**: LangChain's header-based chunking
- **Chunk Storage**: Local JSON files for persistence

## Prerequisites

- Python 3.8 or higher
- Apple Silicon Mac (for MPS) or CPU/GPU setup
- ~2-4GB RAM for models
- The complete demo takes approximately 5-10 minutes to run

---


# PART 1: Setup & LLM Fundamentals

Before building our local RAG system, let's set up the environment and understand key LLM concepts.

## What's in This Section?

### 1. Dependency Installation
Install all required Python packages:
- `transformers` & `torch`: LLM libraries
- `sentence-transformers`: Local embedding models
- `langchain`: RAG orchestration  
- `qdrant-client`: Local vector database
- `python-dotenv`: Environment management

### 2. Local Model Configuration
Set up local models:
- **Embedding Model**: sentence-transformers (runs locally)
- **LLM Model**: TinyLlama or similar small model (runs locally)
- **Vector Database**: Qdrant local instance (in-memory or file-based)

### 3. LLM Concepts
Learn the fundamentals:
- **Tokenization**: Text → Numbers
- **Prompting**: Crafting instructions
- **Generation**: Controlling output

### 4. Local Inference Testing
Verify your local model setup works before building RAG.

---


### 1.1. Setup Instructions

The cell below will automatically:
1. **Create a virtual environment** (`./venv/`) if it doesn't exist
2. **Install all required packages** into the venv
3. **Add the venv to Python path** so packages are available in this notebook

**First run**: This will take 2-5 minutes to download and install packages.
**Subsequent runs**: This will be much faster (packages are already installed).

Run the cell below to set up the environment:


In [2]:
import subprocess
import sys
import os
from pathlib import Path
import shutil

# Define venv path
VENV_PATH = Path("./venv")
VENV_PYTHON = VENV_PATH / "bin" / "python"
VENV_PIP = VENV_PATH / "bin" / "pip"

def find_system_python():
    """Find a working system Python 3 to create the venv."""
    python_candidates = [
        "/usr/bin/python3",
        "/usr/local/bin/python3", 
        "/opt/homebrew/bin/python3",
        shutil.which("python3"),
    ]
    
    for python_path in python_candidates:
        if python_path and Path(python_path).exists():
            try:
                result = subprocess.run(
                    [python_path, "--version"], 
                    capture_output=True, text=True, timeout=5
                )
                if result.returncode == 0 and "Python 3" in result.stdout:
                    return python_path
            except Exception:
                continue
    return sys.executable

# Step 1: Check if venv exists and is valid
venv_valid = False
if VENV_PATH.exists():
    if VENV_PYTHON.exists():
        try:
            result = subprocess.run(
                [str(VENV_PYTHON), "--version"], 
                capture_output=True, text=True, timeout=5
            )
            venv_valid = result.returncode == 0
        except Exception:
            venv_valid = False
    
    if not venv_valid:
        print("Existing venv is broken, removing it...")
        shutil.rmtree(VENV_PATH, ignore_errors=True)

# Step 2: Create virtual environment if needed
if not VENV_PATH.exists():
    print("Creating virtual environment...")
    system_python = find_system_python()
    print(f"  Using Python: {system_python}")
    subprocess.run([system_python, "-m", "venv", str(VENV_PATH)], check=True)
    print(f"Virtual environment created at {VENV_PATH}")
else:
    print(f"Virtual environment already exists at {VENV_PATH}")

# Step 3: Upgrade pip in venv
print("\nUpgrading pip...")
subprocess.run([str(VENV_PIP), "install", "--upgrade", "pip", "-q"], check=True)

# Step 4: Install required packages
print("\nInstalling required packages (this may take a few minutes on first run)...")
packages = [
    "torch",
    "transformers",
    "accelerate",
    "sentence-transformers",
    "qdrant-client",
    "langchain",
    "langchain-community",
    "langchain-text-splitters",
    "tqdm",
    "python-dotenv"
]

subprocess.run([str(VENV_PIP), "install", "-q"] + packages, check=True)
print("All packages installed successfully!")

# Step 5: Add venv to Python path for this notebook session
venv_site_packages = None
for python_dir in (VENV_PATH / "lib").glob("python*"):
    sp = python_dir / "site-packages"
    if sp.exists():
        venv_site_packages = sp
        break

if venv_site_packages and str(venv_site_packages) not in sys.path:
    sys.path.insert(0, str(venv_site_packages))
    print(f"Added {venv_site_packages} to Python path")

print("\nEnvironment setup complete! You can now run the following cells.")

Virtual environment already exists at venv

Upgrading pip...

Installing required packages (this may take a few minutes on first run)...
All packages installed successfully!

Environment setup complete! You can now run the following cells.


In [3]:
# Ensure venv is in path (in case this cell is run after kernel restart)
import sys
from pathlib import Path

VENV_PATH = Path("./venv")
for python_dir in (VENV_PATH / "lib").glob("python*"):
    sp = python_dir / "site-packages"
    if sp.exists() and str(sp) not in sys.path:
        sys.path.insert(0, str(sp))

# Verify imports work correctly
try:
    import transformers
    import torch
    import sentence_transformers
    import langchain
    import qdrant_client
    import json
    print("All required packages are installed correctly")
    print(f"PyTorch version: {torch.__version__}")
    print(f"Transformers version: {transformers.__version__}")
    
    # Check MPS availability
    if torch.backends.mps.is_available():
        print("MPS (Apple Silicon GPU) is available and will be used")
    else:
        print("MPS not available, will use CPU")
except ImportError as e:
    print(f"Missing package: {e}")
    print("\nPlease run the setup cell above first (Cell 3)")
    print("This will create a virtual environment and install all packages.")


  from .autonotebook import tqdm as notebook_tqdm


All required packages are installed correctly
PyTorch version: 2.8.0
Transformers version: 4.57.3
MPS (Apple Silicon GPU) is available and will be used


### 1.2 Configure Local Models

For this local demo, we'll use:

1. **Embedding Model**: `all-MiniLM-L6-v2` (80MB, fast, good quality)
   - Alternative: `all-mpnet-base-v2` (420MB, better quality)
   
2. **LLM Model**: `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (2.2GB, fast inference)
   - Alternative: Any small model that fits in your RAM

3. **Vector Database**: Qdrant Local (in-memory or file-based)

4. **Chunk Storage**: Local JSON files in `./chunks_cache/` folder

No API keys needed! Everything runs locally.


### Understanding Local Storage Locations

Before we proceed, it's important to understand **where everything is stored** on your local machine:

#### 1. **Model Storage (HuggingFace Cache)**

When you load models using `transformers` or `sentence-transformers`, they are automatically downloaded and cached locally:

- **Default Location**: `~/.cache/huggingface/` (on macOS/Linux) or `C:\Users\<username>\.cache\huggingface\` (on Windows)
- **What's Stored**:
  - **LLM Models**: TinyLlama (~2.2GB) stored in `models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/`
  - **Embedding Models**: all-MiniLM-L6-v2 (~80MB) stored in `sentence-transformers/`
  - **Tokenizers**: Vocabulary files and tokenizer configs

- **First Run**: Models are downloaded from HuggingFace Hub (requires internet)
- **Subsequent Runs**: Models are loaded from cache (no download needed)
- **Disk Space**: ~2.3GB total for both models

You can check your cache location:
```python
from huggingface_hub.constants import HF_HUB_CACHE
print(f"Models cached at: {HF_HUB_CACHE}")
```

#### 2. **Vector Database Storage (Qdrant)**

The vector database can be stored in two ways:

- **In-Memory** (Current Setup): 
  - Location: RAM only
  - **Pros**: Fast, no disk I/O
  - **Cons**: Data is lost when notebook restarts
  - Use: `QdrantClient(":memory:")`

- **Persistent Storage** (Recommended for Production):
  - Location: `./qdrant_db/` (or any path you specify)
  - **Pros**: Data persists between sessions
  - **Cons**: Slightly slower, uses disk space
  - Use: `QdrantClient(path="./qdrant_db")`
  - **Disk Space**: ~10-50MB per 1000 document chunks (depends on embedding dimension)

#### 3. **Chunk Storage (NEW - Local JSON Files)**

Processed document chunks are now saved locally to avoid re-processing:

- **Location**: `./chunks_cache/` folder
- **Format**: JSON files (one per document)
- **What's Stored**: 
  - Chunk text content
  - Metadata (headers, source file, chunk IDs)
  - Chunk size information
- **Benefits**: 
  - Skip re-chunking on subsequent runs
  - Fast loading of pre-processed chunks
  - Easy to inspect and debug
- **Disk Space**: ~1-5MB per 1000 chunks (text only, no embeddings)

#### 4. **Document Storage**

- **Source Documents**: Your markdown files in `data/` folder
- **Chunked Documents**: Stored in Qdrant payload (along with embeddings) AND saved as JSON files
- **Metadata**: Document headers, source files, chunk IDs stored in both places

#### Summary of Storage Locations:

```
Your Project/
├── data/                          # Your source markdown files
│   └── *.md
├── chunks_cache/                  # Saved processed chunks (NEW!)
│   └── <document_name>_chunks.json
├── qdrant_db/                     # Vector database (if using persistent storage)
│   └── collections/
│       └── satcom_rag_local/
└── ~/.cache/huggingface/          # Model cache (automatic)
    ├── models--TinyLlama--.../    # LLM model (~2.2GB)
    └── sentence-transformers/     # Embedding model (~80MB)
```

**Tip**: To free up space, you can delete the HuggingFace cache, but models will need to be re-downloaded on next use.


In [4]:
# Ensure venv is in path
import sys
from pathlib import Path

VENV_PATH = Path("./venv")
for python_dir in (VENV_PATH / "lib").glob("python*"):
    sp = python_dir / "site-packages"
    if sp.exists() and str(sp) not in sys.path:
        sys.path.insert(0, str(sp))

import os
import torch

# Configuration for local models
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"  # Small, fast embedding model
LLM_MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Small LLM

# Storage paths
CHUNKS_CACHE_DIR = Path("./chunks_cache")
CHUNKS_CACHE_DIR.mkdir(exist_ok=True)  # Create chunks cache directory

# Device configuration
if torch.backends.mps.is_available():
    DEVICE = "mps"
    print("Using MPS (Apple Silicon GPU)")
elif torch.cuda.is_available():
    DEVICE = "cuda"
    print("Using CUDA (NVIDIA GPU)")
else:
    DEVICE = "cpu"
    print("Using CPU")

print(f"\nConfiguration:")
print(f"  Embedding Model: {EMBEDDING_MODEL_NAME}")
print(f"  LLM Model: {LLM_MODEL_NAME}")
print(f"  Device: {DEVICE}")
print(f"  Chunks Cache: {CHUNKS_CACHE_DIR}")
print(f"\nAll models will run locally - no API calls needed!")


Using MPS (Apple Silicon GPU)

Configuration:
  Embedding Model: all-MiniLM-L6-v2
  LLM Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  Device: mps
  Chunks Cache: chunks_cache

All models will run locally - no API calls needed!


## 1.3 Tokenization: Breaking Text into Tokens

Before a language model can process text, the text must be converted into a form the model can understand. This is why **tokenization** is essential: it transforms raw text into numerical units that the model can operate on.

Let's see tokenization in action with our local model:


In [5]:
from transformers import AutoTokenizer

# Load tokenizer for TinyLlama (will be downloaded on first run)
print(f"Loading tokenizer for {LLM_MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)

# Example text about satellite communications
text = "Satellite communications enable global connectivity through geostationary and low Earth orbit satellites."

# Tokenize the text
encoded = tokenizer(
    text,
    return_offsets_mapping=True,
    return_tensors="pt",
    add_special_tokens=True
)

# Convert token IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])

print(f"\nOriginal text: {text}\n")
print(f"Number of tokens: {len(tokens)}\n")
print(f"Tokens: {tokens}\n")
print(f"Token IDs: {encoded['input_ids'][0].tolist()}")


Loading tokenizer for TinyLlama/TinyLlama-1.1B-Chat-v1.0...

Original text: Satellite communications enable global connectivity through geostationary and low Earth orbit satellites.

Number of tokens: 22

Tokens: ['<s>', '▁Sat', 'ellite', '▁communic', 'ations', '▁enable', '▁global', '▁connect', 'ivity', '▁through', '▁ge', 'ost', 'ation', 'ary', '▁and', '▁low', '▁Earth', '▁orbit', '▁sat', 'ell', 'ites', '.']

Token IDs: [1, 12178, 20911, 7212, 800, 9025, 5534, 4511, 2068, 1549, 1737, 520, 362, 653, 322, 4482, 11563, 16980, 3290, 514, 3246, 29889]


## 1.4 Local LLM Inference

Now let's set up and test local LLM inference. We'll create a simple function to generate text using our local model.


### Model Loading Process Explained

When you run the cell above, here's what happens:

1. **Check Cache**: First, the code checks if the model exists in `~/.cache/huggingface/`
2. **Download (if needed)**: If not cached, downloads from HuggingFace Hub
   - TinyLlama: ~2.2GB download (one-time)
   - Progress bar shows download status
3. **Load to Memory**: Model weights loaded into RAM/GPU memory
4. **Device Assignment**: Model moved to MPS (Apple Silicon), CUDA (NVIDIA), or CPU

**Memory Usage**:
- **Model Size on Disk**: ~2.2GB (quantized/compressed)
- **Model Size in RAM**: ~2.2-4GB (depending on precision: float32 vs float16)
- **With MPS**: Uses GPU memory instead of RAM (faster inference)

**First Run**: Expect 2-5 minutes for download (depends on internet speed)
**Subsequent Runs**: Expect 10-30 seconds for loading from cache


In [6]:
from transformers import AutoModelForCausalLM, pipeline

# Load local LLM model
print(f"Loading local LLM model: {LLM_MODEL_NAME}")
print("This may take a moment on first run (downloading model)...\n")

local_llm_model = AutoModelForCausalLM.from_pretrained(
    LLM_MODEL_NAME,
    torch_dtype=torch.float16 if DEVICE != "cpu" else torch.float32,
    device_map="auto" if DEVICE == "cuda" else None,
    trust_remote_code=True
)

# Move to device if not using device_map
if DEVICE != "cuda":
    if DEVICE == "mps":
        local_llm_model = local_llm_model.to("mps")
    else:
        local_llm_model = local_llm_model.to("cpu")

# Create text generation pipeline
if DEVICE == "mps":
    pipeline_device = torch.device("mps")
elif DEVICE == "cuda":
    pipeline_device = 0
else:
    pipeline_device = -1

local_text_generator = pipeline(
    "text-generation",
    model=local_llm_model,
    tokenizer=tokenizer,
    device=pipeline_device
)

print("Local LLM model loaded successfully!")


Loading local LLM model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
This may take a moment on first run (downloading model)...



`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use mps


Local LLM model loaded successfully!


In [7]:
def chat_with_local_llm(prompt, max_new_tokens=256, temperature=0.2, top_p=0.9):
    """
    Generate text using local LLM.
    
    Args:
        prompt: Input text prompt
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        top_p: Top-p sampling parameter
    
    Returns:
        Generated text
    """
    # Format prompt for chat model
    formatted_prompt = f"<|user|>\n{prompt}<|assistant|>\n"
    
    # Generate
    outputs = local_text_generator(
        formatted_prompt,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Extract generated text
    generated_text = outputs[0]["generated_text"]
    
    # Remove the prompt from the output
    if formatted_prompt in generated_text:
        answer = generated_text.split(formatted_prompt)[-1].strip()
    else:
        answer = generated_text.strip()
    
    return answer

# Test local LLM
print("Testing local LLM inference...\n")
test_prompt = "What is satellite communications in one sentence?"
print(f"Question: {test_prompt}\n")

response = chat_with_local_llm(test_prompt, max_new_tokens=128, temperature=0.7)
print(f"Response: {response}\n")
print("Local LLM inference working correctly!")


Testing local LLM inference...

Question: What is satellite communications in one sentence?

Response: Satellite communications refers to the use of space-based communication systems to send and receive data over long distances.

Local LLM inference working correctly!


---

# PART 2: THEORY AROUND RAG SYSTEM

Now that we know that the local model is working, let's understand how to implement RAG with **custom documents**.

## 2.1. The Problem with Standard LLMs

Large Language Models have limitations:

| Issue | Description | Impact |
|-------|-------------|--------|
| **Hallucinations** | Generate plausible but false info | Unreliable answers |
| **Knowledge Cutoff** | Training data has a date limit | Outdated information |
| **No Source** | Can't cite where info came from | Unverifiable |
| **Generic** | Lack domain-specific expertise | Poor specialized answers |
| **Static** | Can't update without retraining | Expensive to maintain |

## How RAG Solves These Problems

**Retrieval-Augmented Generation** adds a knowledge retrieval step:

```
Standard LLM:
Question → LLM → Answer (may hallucinate)

RAG System:
Question → Find Relevant Docs → LLM + Context → Grounded Answer
```

### Key Benefits:

- Grounded: Answers based on actual documents  
- Verifiable: Shows sources used  
- Up-to-date: Update docs without retraining model  
- Domain-specific: Add specialized knowledge  
- Cost-effective: Cheaper than fine-tuning  

## 2.2 RAG Architecture

### Two Main Phases:

#### Phase 1: Indexing (One-Time Setup)
```
Documents → Clean → Chunk → Embed → Store in Vector DB
```

#### Phase 2: Query (Every Question)
```
Question → Embed → Search Vector DB → Retrieve Docs → 
    Augment Prompt → LLM → Answer
```

## 2.3 Architecture Overview

```
┌─────────────┐
│   User      │ Asks a question
│  Question   │
└──────┬──────┘
       ↓
┌──────────────────────────────────────────┐
│  1. EMBED QUERY (Local sentence-transformers) │
│     Convert question → vector           │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  2. SEMANTIC SEARCH (Local Qdrant)       │
│     Find top-k similar document chunks   │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  3. RETRIEVE CONTEXT                     │
│     Get text + metadata from matches     │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  4. ENRICH PROMPT                        │
│     Question + Retrieved Context         │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  5. GENERATE ANSWER (Local LLM)          │
│     TinyLlama produces grounded response │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────┐
│   Answer     │ With source attribution
│ + Sources    │
└──────────────┘
```

Let's start building!

---


# Part 3: Embedding & Vector Database

In this part, we'll build the knowledge base using **local** components:

1. **Load documents** from local files (Markdown).
2. **Chunk the documents** intelligently to preserve semantic structure.
3. **Save chunks locally** (NEW!) to avoid re-processing.
4. **Generate embeddings** for each chunk using local sentence-transformers.
5. **Upload the embeddings to a local Qdrant database**, making them ready for retrieval.

## 3.1. Document Loading and Chunking

Let's load and chunk our documents. **Chunks will be saved locally for reuse!**


In [8]:
import os
from pathlib import Path
import json

# Load all markdown documents from the data folder
data_folder = Path('data')

# Check if data folder exists
if not data_folder.exists():
    print(f"Data folder not found: {data_folder}")
    print("Creating sample data folder...")
    data_folder.mkdir(exist_ok=True)
    print("Please add your markdown files to the 'data' folder")
    documents = {}
else:
    # Find all markdown files
    markdown_files = list(data_folder.glob('*.md')) + list(data_folder.glob('*.markdown'))
    
    if not markdown_files:
        print(f"No markdown files found in {data_folder}")
        print("Please add markdown files to the 'data' folder")
        documents = {}
    else:
        print(f"Found {len(markdown_files)} markdown file(s) in '{data_folder}':")
        for f in markdown_files:
            print(f"  - {f.name}")
        
        # Load all documents
        documents = {}
        total_chars = 0
        
        for file_path in markdown_files:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
                documents[file_path.name] = content
                total_chars += len(content)
                print(f"\n{file_path.name}: {len(content)} characters")
        
        print(f"\nTotal characters across all documents: {total_chars}")
        if documents:
            print(f"\nPreview of first document:")
            first_doc = list(documents.values())[0]
            print(first_doc[:500])


Found 4 markdown file(s) in 'data':
  - dace33f5-f959-4955-bd68-00229c97599e.md
  - ecf0399d-1463-405b-99a3-6878f4828bd7.md
  - db5563aa-30f9-4b6d-a9de-f8a22f5d30f4.md
  - ea51d98e-f644-4fbb-8558-a661ce56cd9e.md

dace33f5-f959-4955-bd68-00229c97599e.md: 24472 characters

ecf0399d-1463-405b-99a3-6878f4828bd7.md: 59954 characters

db5563aa-30f9-4b6d-a9de-f8a22f5d30f4.md: 16223 characters

ea51d98e-f644-4fbb-8558-a661ce56cd9e.md: 21191 characters

Total characters across all documents: 121840

Preview of first document:
# Analyzing Multispectral Satellite Imagery of South American Wildfires Using Deep Learning

[PERSON]

_Monta Vista High School_

Cupertino, CA, United States

###### Abstract

Since frequent severe droughts are lengthening the dry season in the Amazon Rainforest, it is important to detect wildfires promptly and forecast possible spread for effective suppression response. Current wildfire detection models are not versatile enough for the low-technology conditions of South 

In [9]:
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def chunk_markdown_document(markdown_text, max_chunk_size=1000, chunk_overlap=200):
    """
    Chunk markdown document using header-based splitting.
    
    Args:
        markdown_text: Markdown formatted text
        max_chunk_size: Maximum chunk size in characters
        chunk_overlap: Overlap between chunks for context preservation
    
    Returns:
        List of Document objects with content and metadata
    """
    # Define headers to split on
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
        ("####", "Header 4"),
    ]
    
    # Initialize markdown splitter
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    
    # Split by headers
    md_header_splits = markdown_splitter.split_text(markdown_text)
    
    # Further split large chunks if needed
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    # Process each header-based chunk
    all_chunks = []
    for idx, doc in enumerate(md_header_splits):
        # If chunk is too large, split further
        if len(doc.page_content) > max_chunk_size:
            sub_chunks = text_splitter.split_text(doc.page_content)
            for sub_idx, sub_chunk in enumerate(sub_chunks):
                chunk_doc = Document(
                    page_content=sub_chunk,
                    metadata={
                        **doc.metadata,
                        'chunk_id': f"{idx}_{sub_idx}",
                        'chunk_size': len(sub_chunk)
                    }
                )
                all_chunks.append(chunk_doc)
        else:
            doc.metadata['chunk_id'] = str(idx)
            doc.metadata['chunk_size'] = len(doc.page_content)
            all_chunks.append(doc)
    
    return all_chunks

def save_chunks_to_file(chunks, filename):
    """
    Save chunks to a JSON file for later reuse.
    
    Args:
        chunks: List of Document objects
        filename: Name of the file to save to
    """
    chunks_data = []
    for chunk in chunks:
        chunks_data.append({
            'page_content': chunk.page_content,
            'metadata': chunk.metadata
        })
    
    filepath = CHUNKS_CACHE_DIR / filename
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(chunks_data, f, indent=2, ensure_ascii=False)
    
    return filepath

def load_chunks_from_file(filename):
    """
    Load chunks from a saved JSON file.
    
    Args:
        filename: Name of the file to load from
    
    Returns:
        List of Document objects
    """
    filepath = CHUNKS_CACHE_DIR / filename
    if not filepath.exists():
        return None
    
    with open(filepath, 'r', encoding='utf-8') as f:
        chunks_data = json.load(f)
    
    chunks = []
    for chunk_data in chunks_data:
        chunk = Document(
            page_content=chunk_data['page_content'],
            metadata=chunk_data['metadata']
        )
        chunks.append(chunk)
    
    return chunks

# Chunk all documents (with caching!)
if 'documents' in locals() and documents:
    all_chunks = []
    
    for filename, markdown_text in documents.items():
        # Check if chunks are already cached
        cache_filename = f"{Path(filename).stem}_chunks.json"
        cached_chunks = load_chunks_from_file(cache_filename)
        
        if cached_chunks:
            print(f"\n✓ Loading cached chunks for {filename} ({len(cached_chunks)} chunks)")
            all_chunks.extend(cached_chunks)
        else:
            print(f"\nProcessing {filename}...")
            doc_chunks = chunk_markdown_document(markdown_text)
            
            # Add source filename to metadata
            for chunk in doc_chunks:
                chunk.metadata['source_file'] = filename
            
            # Save chunks to cache
            saved_path = save_chunks_to_file(doc_chunks, cache_filename)
            print(f"  Generated {len(doc_chunks)} chunks")
            print(f"  ✓ Saved to {saved_path}")
            
            all_chunks.extend(doc_chunks)
    
    print(f"\n{'='*60}")
    print(f"Total chunks across all documents: {len(all_chunks)}")
    print(f"{'='*60}")
    
    # Show sample chunks
    if all_chunks:
        print("\nSample chunks:")
        for i, chunk in enumerate(all_chunks[:3]):
            print(f"\n--- Chunk {i+1} ---")
            print(f"Source: {chunk.metadata.get('source_file', 'unknown')}")
            print(f"Headers: {chunk.metadata.get('Header 1', '')} > {chunk.metadata.get('Header 2', '')}")
            print(f"Content preview: {chunk.page_content[:150]}...")
    
    # Rename for consistency
    chunks = all_chunks
else:
    print("⚠ No documents loaded. Please add markdown files to the 'data' folder.")
    chunks = []



Processing dace33f5-f959-4955-bd68-00229c97599e.md...
  Generated 40 chunks
  ✓ Saved to chunks_cache/dace33f5-f959-4955-bd68-00229c97599e_chunks.json

Processing ecf0399d-1463-405b-99a3-6878f4828bd7.md...
  Generated 86 chunks
  ✓ Saved to chunks_cache/ecf0399d-1463-405b-99a3-6878f4828bd7_chunks.json

Processing db5563aa-30f9-4b6d-a9de-f8a22f5d30f4.md...
  Generated 22 chunks
  ✓ Saved to chunks_cache/db5563aa-30f9-4b6d-a9de-f8a22f5d30f4_chunks.json

Processing ea51d98e-f644-4fbb-8558-a661ce56cd9e.md...
  Generated 31 chunks
  ✓ Saved to chunks_cache/ea51d98e-f644-4fbb-8558-a661ce56cd9e_chunks.json

Total chunks across all documents: 179

Sample chunks:

--- Chunk 1 ---
Source: dace33f5-f959-4955-bd68-00229c97599e.md
Headers: Analyzing Multispectral Satellite Imagery of South American Wildfires Using Deep Learning > 
Content preview: [PERSON]  
_Monta Vista High School_  
Cupertino, CA, United States  
###### Abstract...

--- Chunk 2 ---
Source: dace33f5-f959-4955-bd68-00229c97599e.m

### Chunk Caching Explained

The chunking process now includes **local caching**:

#### How It Works:

1. **Check Cache**: Before processing, checks if chunks exist in `./chunks_cache/<filename>_chunks.json`
2. **Load from Cache**: If found, loads instantly (no re-processing!)
3. **Process & Save**: If not found, processes document and saves chunks to cache
4. **Reuse**: Next time you run the notebook, chunks load from cache

#### Benefits:

- **Fast**: Skip re-chunking on subsequent runs
- **Consistent**: Same chunks every time
- **Inspectable**: JSON files are human-readable
- **Portable**: Can share cache files between projects

#### Cache File Structure:

```json
[
  {
    "page_content": "Chunk text content...",
    "metadata": {
      "Header 1": "Section Title",
      "Header 2": "Subsection",
      "chunk_id": "0",
      "chunk_size": 856,
      "source_file": "document.md"
    }
  },
  ...
]
```

#### Managing Cache:

- **Clear cache**: Delete files from `./chunks_cache/` folder
- **Force re-process**: Delete specific cache file or entire folder
- **Share cache**: Copy `./chunks_cache/` folder to another project


## 3.2 Initialize Local Embeddings

Now we'll set up our **local embedding model** using sentence-transformers. This runs entirely on your machine - no API calls!


### Embedding Model Storage

The embedding model (`all-MiniLM-L6-v2`) is stored similarly:

- **Cache Location**: `~/.cache/huggingface/sentence-transformers/`
- **Model Size**: ~80MB
- **First Download**: ~30 seconds
- **Subsequent Loads**: ~2-5 seconds

**Note on MPS**: sentence-transformers may not fully support MPS yet, so embeddings typically run on CPU. This is usually fast enough since embedding is a one-time operation per document chunk.

**Embedding Dimensions**: 
- `all-MiniLM-L6-v2`: 384 dimensions
- Each document chunk → 384-dimensional vector
- Stored in Qdrant for similarity search


In [10]:
from sentence_transformers import SentenceTransformer

# Initialize local embedding model
print(f"Loading local embedding model: {EMBEDDING_MODEL_NAME}")
print("This may take a moment on first run (downloading model)...\n")

embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)

# Move to device if available
if DEVICE == "mps":
    # Note: sentence-transformers may not fully support MPS yet, but will use CPU efficiently
    print("Note: Embeddings will use CPU (MPS support in sentence-transformers is limited)")
elif DEVICE == "cuda":
    embedding_model = embedding_model.to(DEVICE)

# Test embedding
sample_text = "The SatcomLLM pipeline generates synthetic QA pairs from documents"
sample_embedding = embedding_model.encode(sample_text)

print(f"✓ Local embedding model loaded successfully!")
print(f"✓ Model: {EMBEDDING_MODEL_NAME}")
print(f"✓ Embedding dimension: {len(sample_embedding)}")
print(f"✓ Sample embedding (first 10 values): {sample_embedding[:10]}")


Loading local embedding model: all-MiniLM-L6-v2
This may take a moment on first run (downloading model)...

Note: Embeddings will use CPU (MPS support in sentence-transformers is limited)
✓ Local embedding model loaded successfully!
✓ Model: all-MiniLM-L6-v2
✓ Embedding dimension: 384
✓ Sample embedding (first 10 values): [-0.10052735  0.00207785 -0.10086767  0.00358008 -0.09882836  0.01954412
 -0.10562673 -0.0159261   0.02350388  0.02032893]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## 3.3 Create Local Qdrant Vector Database

Time to set up our **local vector database** where we'll store all document embeddings.

### Collection Configuration

| Parameter | Value | Explanation |
|-----------|-------|-------------|
| **Name** | `satcom_rag_local` | Identifier for our knowledge base |
| **Vector Size** | 384 | Matches all-MiniLM-L6-v2 output |
| **Distance Metric** | Cosine | Best for normalized embeddings |
| **Storage** | Local (in-memory or file-based) | Runs entirely on your machine |


### Vector Database Storage Options

Qdrant can store data in two ways. Currently, we're using **in-memory** storage, but you can easily switch to **persistent** storage:

#### Option 1: In-Memory (Current - Fast but Temporary)
```python
qdrant_client = QdrantClient(":memory:")
```
- **Storage**: RAM only
- **Persistence**: Lost when notebook restarts
- **Speed**: Fastest (no disk I/O)
- **Use Case**: Quick experiments, testing

#### Option 2: Persistent File-Based (Recommended for Production)
```python
qdrant_client = QdrantClient(path="./qdrant_db")
```
- **Storage**: Disk (`./qdrant_db/` folder)
- **Persistence**: Survives restarts
- **Speed**: Fast (local disk)
- **Use Case**: Production, reusing data between sessions

**To Switch to Persistent Storage**:
1. Uncomment the persistent storage line in the code cell below
2. Comment out the `:memory:` line
3. Your vector database will be saved to `./qdrant_db/` folder
4. Next time you run the notebook, it will load existing data

**Disk Space Estimate**:
- ~1000 chunks with 384-dim embeddings ≈ 10-20MB
- ~10,000 chunks ≈ 100-200MB
- Scales linearly with number of chunks


In [11]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Initialize local Qdrant client (in-memory)
# For persistent storage, use: QdrantClient(path="./qdrant_db")
print("Initializing local Qdrant vector database...")
# qdrant_client = QdrantClient(":memory:")  # In-memory storage (fast, but not persistent)
# For persistent storage, uncomment:
qdrant_client = QdrantClient(path="./qdrant_db")

# Collection name
collection_name = "satcom_rag_local"

# Get embedding dimension
embedding_dim = len(embedding_model.encode("test"))

# Create new collection
print(f"Creating collection: {collection_name}")
print(f"Vector dimension: {embedding_dim}")

try:
    qdrant_client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
    )
    print(f"Collection '{collection_name}' created successfully!")
except Exception as e:
    # Collection might already exist
    if "already exists" in str(e).lower():
        print(f"Collection '{collection_name}' already exists, using existing collection")
    else:
        raise


Initializing local Qdrant vector database...
Creating collection: satcom_rag_local
Vector dimension: 384
Collection 'satcom_rag_local' created successfully!


### Understanding the Embedding & Upload Process

When you run the cell below, here's the step-by-step process:

#### Step 1: Load Chunks (from cache or fresh)
- Input: Chunks from cache files OR freshly processed
- Process: Load from `./chunks_cache/` if available
- Output: List of Document objects

#### Step 2: Batch Embedding Generation
- Input: Text chunks (batch of 32 at a time)
- Process: Local embedding model converts text → vectors
- Output: NumPy arrays of 384-dimensional vectors
- **Storage**: Temporarily in RAM, then written to Qdrant

#### Step 3: Vector Storage
- Input: Embeddings + document text + metadata
- Process: Create Qdrant PointStruct objects
- Output: Points uploaded to Qdrant collection
- **Storage**: 
  - In-memory: RAM
  - Persistent: `./qdrant_db/` folder

#### Data Flow:
```
Cached Chunks → Embeddings (384-dim vectors) → Qdrant Points → Vector DB
```

**Performance**:
- Loading cached chunks: ~0.1-1 seconds
- Embedding 100 chunks: ~5-10 seconds
- Embedding 1000 chunks: ~1-2 minutes
- Upload to Qdrant: ~1-5 seconds (depends on storage type)

**Memory Usage During Processing**:
- Embeddings temporarily in RAM: ~150KB per chunk
- Qdrant points in memory: ~10-20KB per chunk
- Total for 1000 chunks: ~150-200MB RAM


In [12]:
from tqdm import tqdm

# Upload chunks to local Qdrant with embeddings
if chunks:
    # Check if collection already has data (for persistent storage)
    collection_info = qdrant_client.get_collection(collection_name=collection_name)
    existing_points = collection_info.points_count
    
    if existing_points > 0:
        print(f"ℹ Collection '{collection_name}' already contains {existing_points} points.")
        print("  Skipping upload - using existing data.")
        print("  (To re-upload: delete ./qdrant_db/ folder and restart)")
        SKIP_UPLOAD = True
    else:
        SKIP_UPLOAD = False
    
    if not SKIP_UPLOAD:
        print(f"Embedding and uploading {len(chunks)} chunks to local Qdrant...")
        print("This may take a moment...\n")
        
        # Prepare points for batch upload
        points = []
        
        # Process chunks in batches
        batch_size = 32  # Larger batches for local processing
        for i in tqdm(range(0, len(chunks), batch_size), desc="Processing chunks"):
            batch_chunks = chunks[i:i+batch_size]
            
            # Get texts from batch
            texts = [chunk.page_content for chunk in batch_chunks]
            
            # Generate embeddings for batch (local, no API calls!)
            embeddings = embedding_model.encode(texts, show_progress_bar=False)
            
            # Create points
            for j, (chunk, embedding) in enumerate(zip(batch_chunks, embeddings)):
                point_id = str(uuid.uuid4())
                points.append(
                    PointStruct(
                        id=point_id,
                        vector=embedding.tolist(),
                        payload={
                            "text": chunk.page_content,
                            "metadata": chunk.metadata,
                            "chunk_id": chunk.metadata.get("chunk_id", f"{i+j}"),
                            "source": chunk.metadata.get("source_file")
                        }
                    )
                )
        
        # Upload all points to Qdrant
        print(f"\nUploading {len(points)} points to local Qdrant...")
        qdrant_client.upsert(
            collection_name=collection_name,
            points=points
        )
        
        print(f"✓ Successfully uploaded {len(points)} chunks to local Qdrant!")
    
    # Show final collection info
    collection_info = qdrant_client.get_collection(collection_name=collection_name)
    print(f"\nCollection info:")
    print(f"  - Points count: {collection_info.points_count}")
    print(f"  - Vector size: {collection_info.config.params.vectors.size}")
else:
    print("⚠ No chunks to upload. Please load documents first.")


Embedding and uploading 179 chunks to local Qdrant...
This may take a moment...



Processing chunks: 100%|██████████| 6/6 [00:00<00:00,  6.62it/s]



Uploading 179 points to local Qdrant...
✓ Successfully uploaded 179 chunks to local Qdrant!

Collection info:
  - Points count: 179
  - Vector size: 384


# Part 4: Retrieval, Prompt Enrichment, and Generation

In this part, we move from preparing the knowledge base to actively using it in a RAG workflow. Everything runs locally!

## 4.1: Implement Semantic Search

With our knowledge base populated, let's build the **search function** that retrieves the most relevant document chunks.


### How Semantic Search Works (Local)

Here's what happens when you search the knowledge base:

#### Step 1: Query Embedding
- **Input**: Your question (text string)
- **Process**: Local embedding model converts question → 384-dim vector
- **Location**: Generated in RAM, not stored
- **Time**: ~10-50ms

#### Step 2: Vector Similarity Search
- **Input**: Query vector (384 dimensions)
- **Process**: Qdrant searches for most similar vectors using cosine similarity
- **Algorithm**: Approximate Nearest Neighbor (ANN) search
- **Location**: Searches in Qdrant (RAM or disk, depending on storage type)
- **Time**: ~1-10ms per search (very fast!)

#### Step 3: Result Retrieval
- **Input**: Top-K similar vector IDs
- **Process**: Qdrant retrieves document text and metadata
- **Output**: List of documents with similarity scores
- **Storage**: Retrieved from Qdrant payload (text + metadata stored with each vector)

#### Data Flow:
```
Question → Embed (384-dim) → Search Qdrant → Retrieve Docs → Return Results
```

**Why It's Fast**:
- Embedding: Single vector generation (~10-50ms)
- Search: Optimized vector similarity search (~1-10ms)
- No network calls: Everything is local
- Total time: ~20-100ms per query


In [13]:
def search_knowledge_base(query, top_k=3):
    """
    Search the knowledge base for relevant documents (fully local).
    
    Args:
        query: User's question
        top_k: Number of top results to return
    
    Returns:
        List of relevant documents with scores
    """
    # Embed the query (local, no API call!)
    query_embedding = embedding_model.encode(query)
    
    # Search in local Qdrant
    search_results = qdrant_client.query_points(
        collection_name=collection_name,
        query=query_embedding.tolist(),
        limit=top_k
    )
    
    # Format results
    results = []
    for result in search_results.points:
        results.append({
            "text": result.payload["text"],
            "metadata": result.payload["metadata"],
            "score": result.score
        })
    
    return results

# Test the search function
if chunks:
    test_query = "What is SatcomLLM?"
    print(f"Test Query: {test_query}\n")
    print("="*80)
    
    results = search_knowledge_base(test_query, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"\n--- Result {i} (Score: {result['score']:.4f}) ---")
        print(f"Section: {result['metadata'].get('Header 1', '')} > {result['metadata'].get('Header 2', '')}")
        print(f"Text preview: {result['text'][:300]}...")
        
    print("\n" + "="*80)
    print("✓ Local search function working correctly!")
else:
    print("⚠ No chunks available for search. Please load documents first.")


Test Query: What is SatcomLLM?


--- Result 1 (Score: 0.5673) ---
Section: Analyzing Multispectral Satellite Imagery of South American Wildfires Using Deep Learning > References
Text preview: Politi Marcello is a machine learning scientist and the leader of the SatComLLM project, funded by ESA under the ARTES program. Riccardo Corrente is a deep learning scientist at Pi School and played a key role in building both the large and small versions of SatComLLM as part of the Pi School team. ...

--- Result 2 (Score: 0.2614) ---
Section: Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples > References
Text preview: * [1] [PERSON]. [PERSON], [PERSON], [PERSON], [PERSON], [PERSON], [PERSON], and [PERSON], \"Leo satellite networks assisted geo-distributed data processing,\" _IEEE Wireless Communications Letters_, 2024.
* [2] [PERSON], [PERSON], [PERSON], [PERSON], [PERSON], [PERSON], [PERSON], and [PERSON], \"Gra...

--- Result 3 (Score: 0.2393) ---
Section: Constructing 4

## 4.2 Complete Local RAG Pipeline

Now we bring everything together! This is the **main RAG function** that combines local retrieval with local generation.


### Complete RAG Pipeline: Data Flow & Storage

Here's the complete end-to-end process when you ask a question:

#### Phase 1: Retrieval (Local)
```
User Question
    ↓
[Embedding Model] → 384-dim vector (generated in RAM, not stored)
    ↓
[Qdrant Search] → Find top-K similar vectors (searches local DB)
    ↓
[Retrieve Documents] → Get text + metadata from Qdrant payload
    ↓
Retrieved Context (in RAM)
```

**Storage Involved**:
- Query embedding: Generated on-the-fly, not stored
- Vector search: Searches Qdrant (RAM or `./qdrant_db/`)
- Retrieved docs: Loaded into RAM temporarily

#### Phase 2: Prompt Construction (In-Memory)
```
Retrieved Context + User Question
    ↓
[Format Prompt] → Create enriched prompt string
    ↓
Enriched Prompt (in RAM)
```

**Storage**: All in RAM, temporary

#### Phase 3: Generation (Local LLM)
```
Enriched Prompt
    ↓
[Tokenization] → Convert text to token IDs (in RAM)
    ↓
[Local LLM Model] → Generate tokens (uses MPS/CPU/GPU memory)
    ↓
[Detokenization] → Convert tokens back to text
    ↓
Final Answer (returned to user)
```

**Storage Involved**:
- Model weights: Loaded in RAM/GPU memory (from `~/.cache/huggingface/`)
- Tokens: Generated in RAM
- Final answer: Returned as string

#### Complete Data Flow:
```
Question → Embed → Search Qdrant → Retrieve → Format Prompt → 
    Local LLM → Generate → Answer
```

**Total Storage Summary**:
- **Models**: ~2.3GB in `~/.cache/huggingface/` (persistent)
- **Chunks Cache**: ~1-5MB in `./chunks_cache/` (persistent)
- **Vector DB**: Variable in RAM or `./qdrant_db/` (persistent if file-based)
- **Runtime Memory**: ~2-4GB RAM for models + ~100-500MB for processing

**Performance**:
- Retrieval: ~20-100ms
- Prompt formatting: ~1-5ms
- LLM generation: ~1-5 seconds (depends on model size and max_tokens)
- **Total**: ~1-6 seconds per query


In [14]:
def ask_rag_question(question, top_k=3, max_new_tokens=256, temperature=0.7, show_sources=True):
    """
    Complete local RAG pipeline: Retrieve relevant documents and generate answer using local LLM.
    
    Args:
        question: User's question
        top_k: Number of documents to retrieve
        max_new_tokens: Maximum tokens for generation
        temperature: Sampling temperature
        show_sources: Whether to display retrieved sources
    
    Returns:
        Generated answer
    """
    print(f"\n{'='*80}")
    print(f"Question: {question}")
    print(f"{'='*80}\n")
    
    # Step 1: Retrieve relevant documents (local)
    print("Searching local knowledge base...")
    retrieved_docs = search_knowledge_base(question, top_k=top_k)
    
    if not retrieved_docs:
        return "No relevant documents found in the knowledge base."
    
    if show_sources:
        print(f"\nRetrieved {len(retrieved_docs)} relevant documents:")
        for i, doc in enumerate(retrieved_docs, 1):
            headers = doc['metadata'].get('Header 1', '')
            if doc['metadata'].get('Header 2'):
                headers += f" > {doc['metadata'].get('Header 2')}"
            print(f"  [{i}] {headers} (score: {doc['score']:.3f})")
    
    # Step 2: Format context from retrieved documents
    context_parts = []
    for i, doc in enumerate(retrieved_docs, 1):
        headers = []
        if doc['metadata'].get('Header 1'):
            headers.append(doc['metadata']['Header 1'])
        if doc['metadata'].get('Header 2'):
            headers.append(doc['metadata']['Header 2'])
        
        section_info = " > ".join(headers) if headers else "General"
        context_parts.append(f"[Document {i}] {section_info}\n{doc['text']}\n")
    
    context = "\n".join(context_parts)
    
    # Step 3: Build enriched prompt
    enriched_prompt = f"""You are a helpful assistant specializing in satellite communications.

Use the following context from the documentation to answer the question accurately and concisely.
If the answer cannot be found in the context, say so clearly.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
    
    # Step 4: Generate answer using local LLM
    print("\nGenerating answer with local LLM...")
    answer = chat_with_local_llm(
        enriched_prompt, 
        max_new_tokens=max_new_tokens, 
        temperature=temperature
    )
    
    # Display answer
    print(f"\n{'─'*80}")
    print(f"ANSWER:")
    print(f"{'─'*80}")
    print(answer)
    print(f"\n{'='*80}\n")
    
    return answer

# Test the complete local RAG pipeline
if chunks:
    print("Testing Complete Local RAG Pipeline")
    print("="*80)
    
    # Test question
    ask_rag_question(
        "What is SatcomLLM?",
        top_k=3,
        temperature=0.7,
        max_new_tokens=256
    )
else:
    print("⚠ No chunks available. Please load documents first.")


Testing Complete Local RAG Pipeline

Question: What is SatcomLLM?

Searching local knowledge base...

Retrieved 3 relevant documents:
  [1] Analyzing Multispectral Satellite Imagery of South American Wildfires Using Deep Learning > References (score: 0.567)
  [2] Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples > References (score: 0.261)
  [3] Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples > References (score: 0.239)

Generating answer with local LLM...

────────────────────────────────────────────────────────────────────────────────
ANSWER:
────────────────────────────────────────────────────────────────────────────────
SatcomLLM is a machine learning project that uses deep learning techniques to analyze satellite imagery and develop models for satellite communications. The project aims to develop advanced models for satellite communications, combining domain-specific reasoning and question-answering capabilities. The team emphasizes 

## 4.3 Testing with Custom Questions

Let's test our local RAG system with several questions:


In [15]:
# Test with more questions
if chunks:
    test_questions = [
        "What is a LEO satellite?",
        "What is deep learning?",
        "What is Internet of Things?"
    ]
    
    for question in test_questions:
        ask_rag_question(question, top_k=3, temperature=0.7, max_new_tokens=256)
        print("\n" + "="*80 + "\n")
else:
    print("⚠ No chunks available. Please load documents first.")



Question: What is a LEO satellite?

Searching local knowledge base...

Retrieved 3 relevant documents:
  [1] Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples > References (score: 0.509)
  [2] Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples (score: 0.503)
  [3] Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples > I Introduction (score: 0.496)

Generating answer with local LLM...

────────────────────────────────────────────────────────────────────────────────
ANSWER:
────────────────────────────────────────────────────────────────────────────────
A LEO satellite is a type of non-terrestrial network (NTN) that orbits at an altitude of less than 3500 km above Earth's surface.





Question: What is deep learning?

Searching local knowledge base...

Retrieved 3 relevant documents:
  [1] Analyzing Multispectral Satellite Imagery of South American Wildfires Using Deep Learning > III Methods (score: 0.447)
  [2] Construct

### Storage Management & Cleanup

#### Checking Your Storage Usage

You can check where models are stored and how much space they use:

In [16]:
import os
from pathlib import Path

# Get HuggingFace cache location (compatible with all versions)
try:
    from huggingface_hub.constants import HF_HUB_CACHE
    cache_path = Path(HF_HUB_CACHE)
except ImportError:
    # Fallback for different API versions
    cache_path = Path.home() / ".cache" / "huggingface" / "hub"

print(f"HuggingFace cache: {cache_path}")
print(f"Cache exists: {cache_path.exists()}")

if cache_path.exists():
    # Calculate total size
    total_size = sum(f.stat().st_size for f in cache_path.rglob('*') if f.is_file())
    print(f"Total cache size: {total_size / (1024**3):.2f} GB")

# Check chunks cache
chunks_cache_size = sum(f.stat().st_size for f in CHUNKS_CACHE_DIR.rglob('*') if f.is_file())
print(f"\nChunks cache: {CHUNKS_CACHE_DIR}")
print(f"Chunks cache size: {chunks_cache_size / (1024**2):.2f} MB")

HuggingFace cache: /Users/riccardocorrente/.cache/huggingface/hub
Cache exists: True
Total cache size: 4.27 GB

Chunks cache: chunks_cache
Chunks cache size: 0.18 MB


In [17]:
import torch
import subprocess

# Check if MPS is available
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")

# Get total system memory (Apple Silicon uses unified memory - shared between CPU and GPU)
import os
total_ram = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3)
print(f"\nTotal System RAM (shared with GPU): {total_ram:.1f} GB")

# Check current memory allocation on MPS
if torch.backends.mps.is_available():
    # Current MPS memory allocated by PyTorch
    allocated = torch.mps.current_allocated_memory() / (1024**3)
    print(f"MPS memory currently allocated: {allocated:.2f} GB")
    
    # Driver-level memory allocated
    driver_allocated = torch.mps.driver_allocated_memory() / (1024**3)
    print(f"MPS driver allocated memory: {driver_allocated:.2f} GB")

# Get detailed system memory info using macOS command
print("\n--- System Memory Details ---")
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
print(result.stdout)

MPS available: True
MPS built: True

Total System RAM (shared with GPU): 18.0 GB
MPS memory currently allocated: 2.14 GB
MPS driver allocated memory: 6.07 GB

--- System Memory Details ---
Mach Virtual Memory Statistics: (page size of 16384 bytes)
Pages free:                                5232.
Pages active:                            187914.
Pages inactive:                          186178.
Pages speculative:                          693.
Pages throttled:                              0.
Pages wired down:                        200909.
Pages purgeable:                            696.
"Translation faults":                 658593749.
Pages copy-on-write:                   11289929.
Pages zero filled:                    325555858.
Pages reactivated:                    138520233.
Pages purged:                          29368298.
File-backed pages:                       100936.
Anonymous pages:                         273849.
Pages stored in compressor:             1531014.
Pages occupied by

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### Managing Storage

**To Free Up Space**:
1. **Delete specific models**: Remove folders from `~/.cache/huggingface/models--*/`
2. **Clear entire cache**: Delete `~/.cache/huggingface/` (models will re-download when needed)
3. **Clear chunks cache**: Delete `./chunks_cache/` folder (chunks will be re-processed)
4. **Clear Qdrant DB**: Delete `./qdrant_db/` folder (if using persistent storage)

**To Persist Data Between Sessions**:
1. Use persistent Qdrant storage: `QdrantClient(path="./qdrant_db")`
2. Models are automatically cached by HuggingFace
3. Chunks are automatically cached in `./chunks_cache/`
4. Documents in `data/` folder are your source files

#### Recommended Setup for Production

```python
# Persistent Qdrant storage
qdrant_client = QdrantClient(path="./qdrant_db")

# Chunks automatically cached in ./chunks_cache/
# Models automatically cached by HuggingFace
# Documents in data/ folder
```

This way:
- Models persist (HuggingFace cache)
- Chunks persist (`./chunks_cache/`)
- Vector database persists (`./qdrant_db/`)
- Documents persist (`data/` folder)
- Everything survives notebook restarts


## Summary: Complete Local RAG Pipeline

Congratulations! You've built a complete **fully local** RAG system with:

### Components:
1. **Document Processing**: Markdown chunking with header-based splitting
2. **Chunk Caching**: Local JSON files for fast re-loading (NEW!)
3. **Embeddings**: Local sentence-transformers (all-MiniLM-L6-v2)
4. **Vector Database**: Local Qdrant (in-memory or file-based)
5. **LLM Generation**: Local TinyLlama model
6. **RAG Orchestration**: Custom pipeline combining retrieval and generation

### Key Advantages of Local Setup:
- **No API costs** - Everything runs on your machine
- **Privacy** - Your data never leaves your computer
- **Offline capable** - Works without internet
- **Fast** - No network latency
- **Customizable** - Full control over models and parameters
- **Persistent chunks** - Skip re-processing with local cache (NEW!)

### Storage Locations Summary:

| Component | Location | Size | Persistent? |
|-----------|----------|------|-------------|
| **LLM Model** | `~/.cache/huggingface/models--TinyLlama--.../` | ~2.2GB | Yes |
| **Embedding Model** | `~/.cache/huggingface/sentence-transformers/` | ~80MB | Yes |
| **Chunks Cache** | `./chunks_cache/*.json` | ~1-5MB | Yes |
| **Vector DB** | RAM or `./qdrant_db/` | ~10-200MB | If file-based |
| **Source Docs** | `./data/*.md` | Variable | Yes |

### Next Steps:
- Try different questions
- Adjust top_k for more/less context
- Experiment with temperature settings
- Add more documents to the knowledge base
- Switch to larger models if you have more RAM
- Use persistent Qdrant storage for larger datasets
- Share your chunks cache between projects


---