# SatcomLLM: Live Chat & RAG Pipeline Demo

Welcome to the SatcomLLM Live Demo! This interactive notebook teaches you how to build a complete, production-ready Retrieval-Augmented Generation (RAG) system for satellite communications domain knowledge.

## What You'll Learn

This notebook demonstrates how to build a RAG system that combines semantic search with large language models to create grounded, source-attributed responses. You'll learn to store documents in a vector database, retrieve relevant context based on user queries, and generate answers using a cloud-hosted LLM.

## Notebook Structure

The demo is divided into 4 main parts:

- **Setup and LLM Fundamentals**: install dependencies, configure API keys, and understand core concepts like tokenization and prompting. You'll test the connection to the RunPod vLLM endpoint to ensure everything works before building the RAG system.

- **Theory around RAG**: understand the principles of Retrieval-Augmented Generation (RAG), including how retrieval complements generative models by providing up-to-date and domain-specific context.

- **Embedding Vector Database Creation**: load and chunk markdown documents, generate embeddings using DeepInfra, create a Qdrant vector database collection, and populate it with embedded document chunks. This builds the knowledge base that powers semantic search.

- **Retrieval and RAG Pipeline**: implement semantic search to find relevant documents, build the complete RAG pipeline that combines retrieval with generation, and test the system with real questions. You'll see how retrieved context improves answer quality and enables source attribution.

## Technology Stack

This system uses DeepInfra for embeddings (2560-dimensional vectors from Qwen3-4B), Qdrant Cloud for vector storage and similarity search, and RunPod vLLM for fast inference with the SatcomLLM model. Documents are processed with LangChain's header-based chunking to preserve semantic structure.

## Prerequisites

You'll need Python 3.8 or higher and API keys for DeepInfra, RunPod, and Qdrant Cloud. The complete demo takes approximately 10 minutes to run.

---


# PART 1: Setup & LLM Fundamentals

Before building our RAG system, let's set up the environment and understand key LLM concepts.

## What's in This Section?

### 1. Dependency Installation
Install all required Python packages:
- `transformers` & `torch`: LLM libraries
- `langchain`: RAG orchestration  
- `qdrant-client`: Vector database
- `requests`: API calls
- `python-dotenv`: Environment management

### 2. API Configuration
Set up three essential services:
- **DeepInfra**: For generating embeddings
- **RunPod**: For LLM inference with vLLM
- **Qdrant Cloud**: For vector storage

### 3. LLM Concepts
Learn the fundamentals:
- **Tokenization**: Text → Numbers
- **Prompting**: Crafting instructions
- **Generation**: Controlling output

### 4. API Testing
Verify your RunPod vLLM connection works before building RAG.


---



### 1.1. Setup Instructions


First, install all required packages:

In [1]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


### 1.2 Configure API Keys

Copy `env.example` to `.env` and fill in your API keys.

**Required API Keys:**

1. **DeepInfra API Key** (for embeddings):
   - Sign up at [deepinfra.com](https://deepinfra.com)
   - Get your API key from the dashboard
   - Free tier available with generous limits

2. **RunPod API Key** (for LLM inference):
   - Create account at [runpod.io](https://runpod.io)
   - Contact [SatComLLM team](https://github.com/esa-sceva) for the endpoint URL and API key

3. **Qdrant Cloud** (optional, for production):
   - Sign up at [qdrant.tech](https://qdrant.tech)
   - Create a cluster (free tier available)
   - Get your cluster URL and API key
   - *Note: The notebook will use in-memory storage if not configured*


In [2]:
# Verify imports work correctly
try:
    import transformers
    import langchain
    import qdrant_client
    import openai
    print("All required packages are installed correctly")
except ImportError as e:
    print("f Missing package: {e}")
    print("\nPlease install dependencies:")
    print("  pip install -r requirements.txt")


All required packages are installed correctly


In [6]:
# Setup environment
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check if required API keys are set
print("Environment Configuration Status:")
print("=" * 60)

# Required keys
required_keys = {
    "DEEPINFRA_API_KEY": "DeepInfra (Embeddings)",
    "RUNPOD_API_KEY": "RunPod (LLM Inference)",
    "RUNPOD_API_URL": "RunPod Endpoint URL"
}

all_set = True
for key, description in required_keys.items():
    status = "Set" if os.getenv(key) else "NOT SET (REQUIRED)"
    print(f"  {description:25} {status}")
    if not os.getenv(key):
        all_set = False

# Optional keys
print(f"  {'Qdrant Cloud':25} {'Set' if os.getenv('QDRANT_URL') else '○ Not set (will use in-memory)'}")

print("=" * 60)
if all_set:
    print("All required API keys are configured!")
else:
    print("\nMissing required API keys!")
    print("   Please copy env.example to .env and add your keys.")


Environment Configuration Status:
  DeepInfra (Embeddings)    Set
  RunPod (LLM Inference)    Set
  RunPod Endpoint URL       Set
  Qdrant Cloud              Set
All required API keys are configured!


## 1.3 Tokenization: Breaking Text into Tokens

Before a language model can process text, the text must be converted into a form the model can understand. This is why **tokenization** is essential: it transforms raw text into numerical units that the model can operate on. Modern LLMs do not rely on full words as atomic units; instead, they use **subword tokenization**, which divides text into smaller, reusable components called *tokens*.

### Why Subword Tokenization?

Subword tokenization is used because it strikes a balance between word-level and character-level representations. In particular, it offers several practical advantages:

* **Robust handling of rare or unseen words**: Words that do not appear in the training vocabulary can still be represented by combining known subword units.
* **Smaller, more efficient vocabularies**: Keeping the vocabulary compact reduces memory requirements and speeds up training and inference.
* **Better capture of linguistic structure**: Morphological patterns (such as prefixes, suffixes, and stems) are encoded naturally through shared subword units.

For example, the word *"satellite"* might be tokenized into `["sat", "ell", "ite"]`. These components can help the model generalize to related terms such as *"satellites"* or even domain-specific variations like *"satcom"*.

Let's see tokenization in action:


In [10]:
from transformers import AutoTokenizer

# Load a popular tokenizer
tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')

# Example text about satellite communications
text = "Satellite communications enable global connectivity through geostationary and low Earth orbit satellites."

# Tokenize the text
encoded = tokenizer(
    text,
    return_offsets_mapping=True,
    return_tensors="pt",
    add_special_tokens=True
)

# Convert token IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])

print(f"Original text: {text}\n")
print(f"Number of tokens: {len(tokens)}\n")
print(f"Tokens: {tokens}\n")
print(f"Token IDs: {encoded['input_ids'][0].tolist()}")


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Original text: Satellite communications enable global connectivity through geostationary and low Earth orbit satellites.

Number of tokens: 22

Tokens: ['<s>', '▁Sat', 'ellite', '▁communic', 'ations', '▁enable', '▁global', '▁connect', 'ivity', '▁through', '▁ge', 'ost', 'ation', 'ary', '▁and', '▁low', '▁Earth', '▁orbit', '▁sat', 'ell', 'ites', '.']

Token IDs: [1, 12178, 20911, 7212, 800, 9025, 5534, 4511, 2068, 1549, 1737, 520, 362, 653, 322, 4482, 11563, 16980, 3290, 514, 3246, 29889]


### Understanding Token Representation

Notice that:

* Some words become **single tokens** (especially common words).
* Other words are **split into multiple subword tokens** (often technical terms or rare words).
* **Special tokens** may appear (e.g., `<s>` for start-of-sequence, `</s>` for end-of-sequence).
* Prefixes like `Ġ` or `▁` often indicate a **leading space** before the word in certain tokenizers.

Tokenization affects several aspects of LLM usage:

* **Context window**: each model has a maximum number of tokens it can process (e.g., 4096, 8192…).
* **API costs**: most LLM APIs charge based on the number of tokens processed.
* **Inference speed**: more tokens mean slower processing and higher latency.


## 1.4 Prompting: Instructing Language Models

Prompting refers to the process of designing input instructions that guide a language model to produce the desired output. The way a prompt is structured can significantly influence the quality, relevance, and accuracy of the model’s responses. In essence, effective prompting bridges the gap between human intent and model understanding.

### Why Prompting Matters

* **Improves response quality**: Clear and specific prompts help the model generate accurate and relevant answers.
* **Guides model behavior**: Prompts can specify tone, style, format, or constraints for the response.
* **Enables few-shot learning**: Including examples in the prompt allows the model to learn patterns and apply them to new queries without retraining.

### Prompt Structure in Modern Chat Models

Most modern chat-oriented LLMs use a structured prompt format composed of multiple message types:

1. **System Message**
   Sets the model’s role, behavior, or overall context. For example, a system message might instruct the model to act as a tutor, an assistant, or a domain expert.

2. **User Message**
   Contains the main query, instruction, or request. This is the content that the model is expected to respond to.

3. **Assistant Message**
   Represents the model’s response, or can include few-shot examples demonstrating the expected behavior. These examples help the model generalize the instruction pattern to new queries.

Each message is internally encoded with special tokens that help the model distinguish between system instructions, user queries, and assistant outputs. This structure allows the model to maintain coherent multi-turn conversations and follow complex instructions more effectively.



In [11]:
# Example of a well-structured prompt
prompt_template = """<|system|>
You are a helpful assistant specializing in satellite communications and space technology. 
Provide accurate, technical information while remaining accessible to your audience.
<|end|>
<|user|>
{question}
<|end|>
<|assistant|>
"""

question = "What is the difference between GEO and LEO satellites?"
formatted_prompt = prompt_template.format(question=question)

print("Formatted Prompt:")
print(formatted_prompt)


Formatted Prompt:
<|system|>
You are a helpful assistant specializing in satellite communications and space technology. 
Provide accurate, technical information while remaining accessible to your audience.
<|end|>
<|user|>
What is the difference between GEO and LEO satellites?
<|end|>
<|assistant|>



### Prompting Best Practices

1. Be Specific: Clearly state what you want
2. Provide Context: Give relevant background information
3. Use Examples: Few-shot prompting improves accuracy
4. Set Constraints: Specify format, length, or style requirements
5. System Message: Use it to set expertise domain and behavior

### Key parameters for LLM text generation:

  - max_tokens: Maximum number of tokens to generate (e.g., 256, 512, 1024)
  - temperature: Controls randomness (0.0 = deterministic, 1.0 = very creative)
  - top_p: Nucleus sampling - considers top probability mass (e.g., 0.9 = top 90%)
  - top_k: Only consider top K most likely tokens at each step

Now let's see prompting in action with a real model:


In [12]:
# For this notebook, we'll use RunPod vLLM for all LLM inference
# This avoids downloading large models locally and provides production-grade performance

# Let's demonstrate how the prompt would be tokenized
print("Prompt tokenization example:\n")

# Tokenize our formatted prompt from the previous cell
prompt_tokens = tokenizer.encode(formatted_prompt)
print(f"Prompt: {formatted_prompt[:100]}...")
print(f"\nNumber of tokens in prompt: {len(prompt_tokens)}")
print(f"First 20 token IDs: {prompt_tokens[:20]}")

# This helps us understand:
# - How much of the model's context window the prompt uses
# - Why API costs are often measured in tokens
# - How to optimize prompts to fit within context limits

print("\nWe'll use RunPod vLLM for actual text generation")


Prompt tokenization example:

Prompt: <|system|>
You are a helpful assistant specializing in satellite communications and space technology...

Number of tokens in prompt: 80
First 20 token IDs: [1, 529, 29989, 5205, 29989, 29958, 13, 3492, 526, 263, 8444, 20255, 4266, 5281, 297, 28421, 7212, 800, 322, 2913]

We'll use RunPod vLLM for actual text generation


## 1.5 Testing the RunPod vLLM API

Before we proceed to RAG, it is important to ensure that our connection to the RunPod vLLM server is working correctly. This step verifies that the API endpoint is properly configured and that the SatcomLLM model hosted on the server is accessible.

### RunPod vLLM Endpoint

RunPod provides a scalable infrastructure for hosting large language models with low-latency inference. The vLLM API endpoint allows us to send text prompts to the hosted model and receive generated responses in real time. Testing this endpoint ensures that:

* Network connectivity to the server is functional.
* API authentication (if required) is correctly configured.
* The hosted model is running and ready to accept requests.

### Verifying the Hosted Model

The model hosted on RunPod is **[LLaMA3-Satcom-8B](https://huggingface.co/esa-sceva/llama3-satcom-8b)**. By sending a test prompt to this endpoint, we can:

* Confirm that the endpoint is reachable and responding correctly.
* Ensure that the SatcomLLM model interprets queries as intended and generates coherent, domain-specific outputs.
* Validate the setup before integrating the model into downstream workflows such as embeddings, retrieval, or RAG pipelines.

Performing this test gives us confidence that the RunPod-hosted SatcomLLM is ready for production use and that our API integration is functioning properly.




In [13]:
import requests
import json
import time

# Get API credentials from environment
API_URL = os.getenv("RUNPOD_API_URL")
API_KEY = os.getenv("RUNPOD_API_KEY")

def poll_job_status(job_id, max_attempts=30, delay=2):
    """Poll RunPod job status until completion."""
    status_url = f"{API_URL.rsplit('/', 1)[0]}/status/{job_id}"
    headers = {"Authorization": f"Bearer {API_KEY}"}
    
    for attempt in range(max_attempts):
        response = requests.get(status_url, headers=headers)
        
        if response.status_code == 200:
            result = response.json()
            status = result.get("status")
            
            if status == "COMPLETED":
                output = result.get("output")
                
                # Parse vLLM response structure
                if isinstance(output, list) and len(output) > 0:
                    if "choices" in output[0]:
                        choices = output[0]["choices"]
                        if len(choices) > 0:
                            # Try tokens array first (vLLM format)
                            if "tokens" in choices[0]:
                                return "".join(choices[0]["tokens"])
                            # Try message format (OpenAI compatible)
                            elif "message" in choices[0]:
                                return choices[0]["message"]["content"]
                
                # OpenAI-style response
                if isinstance(output, dict) and "choices" in output:
                    return output["choices"][0]["message"]["content"]
                
                # Direct string output
                if isinstance(output, str):
                    return output
                    
                return str(output)
                
            elif status == "FAILED":
                return f"Job failed: {result.get('error', 'Unknown error')}"
            
            # Still running, wait and retry
            time.sleep(delay)
        else:
            return f"Status check error {response.status_code}: {response.text}"
    
    return "Timeout waiting for job completion"

def chat_with_vllm(messages, max_tokens=512, temperature=0.1):
    """
    Send chat request to RunPod vLLM server.
    
    Args:
        messages: List of message dicts with 'role' and 'content'
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0 = deterministic)
        
    Returns:
        Generated text response
    """
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "input": {
            "messages": messages,
            "max_tokens": max_tokens,
            "max_new_tokens": max_tokens,  # Alternative parameter name
            "temperature": temperature
        }
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    
    if response.status_code == 200:
        result = response.json()
        
        # RunPod returns a job ID for async execution
        if "id" in result:
            job_id = result["id"]
            print(f"Job submitted: {job_id}")
            print("Waiting for response...")
            return poll_job_status(job_id)
        
        # Direct response (synchronous mode)
        if "output" in result:
            return result["output"]
    
    return f"Error {response.status_code}: {response.text}"

print("RunPod vLLM client initialized")


RunPod vLLM client initialized


In [14]:
# Test the RunPod vLLM API with a sample question
test_messages = [
    {"role": "user", "content": "What is satellite communications in one sentence?"}
]

print("Testing RunPod vLLM API...")
print(f"Question: {test_messages[0]['content']}\n")

response = chat_with_vllm(test_messages, max_tokens=256, temperature=0.7)

print(f"\nResponse: {response}")
print("\nRunPod vLLM API test successful!")

Testing RunPod vLLM API...
Question: What is satellite communications in one sentence?

Job submitted: e663c325-0223-4bde-8749-e4c4dd630e9e-e2
Waiting for response...

Response: Satellite communications is a method of exchanging information through radio signals transmitted over long distances via artificial satellites orbiting the Earth, enabling global connectivity and wireless communication between remote or hard-to-reach areas.

RunPod vLLM API test successful!


---

# PART 2: THEORY AROUND RAG SYSTEM

Now that we know that the model is working on the RunPod endpoint and can answer questions, let's understand how to implement RAG, starting from **custom documents**
But before we build, let's understand why and how use RAG.

## 2.1. The Problem with Standard LLMs

Large Language Models have limitations:

| Issue | Description | Impact |
|-------|-------------|--------|
| **Hallucinations** | Generate plausible but false info | Unreliable answers |
| **Knowledge Cutoff** | Training data has a date limit | Outdated information |
| **No Source** | Can't cite where info came from | Unverifiable |
| **Generic** | Lack domain-specific expertise | Poor specialized answers |
| **Static** | Can't update without retraining | Expensive to maintain |

## How RAG Solves These Problems

**Retrieval-Augmented Generation** adds a knowledge retrieval step:

```
Standard LLM:
Question → LLM → Answer (may hallucinate)

RAG System:
Question → Find Relevant Docs → LLM + Context → Grounded Answer
```

### Key Benefits:

- Grounded: Answers based on actual documents  
- Verifiable: Shows sources used  
- Up-to-date: Update docs without retraining model  
- Domain-specific: Add specialized knowledge  
- Cost-effective: Cheaper than fine-tuning  

## 2.2 RAG Architecture

### Two Main Phases:

#### Phase 1: Indexing (One-Time Setup)
```
Documents → Clean → Chunk → Embed → Store in Vector DB
```
This is what we'll do in this part!

#### Phase 2: Query (Every Question)
```
Question → Embed → Search Vector DB → Retrieve Docs → 
    Augment Prompt → LLM → Answer
```
This is what we will implement in part 3!

## 2.3 When to Use RAG

| Use Case | RAG? | Why |
|----------|------|-----|
| Documentation Q&A | Yes | Need exact info from docs |
| Technical support | Yes | Must cite solutions |
| General chat | No | Model knowledge sufficient |
| Research queries | Yes | Need to reference papers |
| Identity info | No | Model can hallucinate |




## 2.4 What We'll Build

In the following cells, we will:

1. Load and Chunk Documents - Process satcom docs into semantic chunks
2. Set Up Embeddings - Embed the chunks for retrieving tasks, before populating the vector database.  
3. Create Vector Database on Qdrant
4. Upload Knowledge Base - Store all document chunks in the vector DB
5. Implement Search - Semantic retrieval from the vector database
6. Build RAG Pipeline - Combine retrieval with vLLM generation
7. Test & Interact - Ask questions and get grounded answers

## 2.5 Architecture Overview

```
┌─────────────┐
│   User      │ Asks a question
│  Question   │
└──────┬──────┘
       ↓
┌──────────────────────────────────────────┐
│  1. EMBED QUERY (DeepInfra)              │
│     Convert question → 2560-dim vector   │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  2. SEMANTIC SEARCH (Qdrant Cloud)       │
│     Find top-k similar document chunks   │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  3. RETRIEVE CONTEXT                     │
│     Get text + metadata from matches     │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  4. ENRICH PROMPT                        │
│     Question + Retrieved Context         │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────────────────────────────────┐
│  5. GENERATE ANSWER (RunPod vLLM)        │
│     SatcomLLM produces grounded response │
└──────┬───────────────────────────────────┘
       ↓
┌──────────────┐
│   Answer     │ With source attribution
│ + Sources    │
└──────────────┘
```


Let's start building!

---

# Part 3: Embedding & Vector Database

In a Retrieval-Augmented Generation (RAG) system, as we have seen before, the knowledge base is represented in a way that allows fast and semantically meaningful search. This is achieved by **embedding** text into high-dimensional vectors and storing them in a **vector database**.

In this part of the pipeline, we will:

1. **Load documents** from local files (Markdown or converted PDFs).
2. **Chunk the documents** intelligently to preserve semantic and hierarchical structure.
3. **Generate embeddings** for each chunk using a specialized embedding model.
4. **Upload the embeddings to a vector database**, making them ready for retrieval during RAG.



## 3.1. Document Loading and Chunking

Before we can build our RAG system, we need to prepare our knowledge base. This involves:

### Document Loading
We'll use a couple of Markdown files in the `data` folder as our example of knowledge source, any other kind of `.md` file is fine. We will also explore how to easily convert `pdfs` into `.md` using a simple python library if needed.

### Intelligent Chunking
Then we use header-based markdown splitting (hierarchical chunking) to preserve semantic structure:
- Split on markdown headers (#, ##, ###, ####)
- Maintain parent header context in metadata
- Respect maximum chunk size (1000 characters)
- Add overlap between chunks for continuity

This approach is inspired by the SatcomLLM pipeline's own chunking strategies and ensures:
- Semantic coherence - Chunks respect document structure
- Context preservation - Headers provide hierarchical context  
- Optimal retrieval - Chunks are sized for effective embedding

Run the next cells to load and chunk documents:


In [16]:
import os
from pathlib import Path

# Load all markdown documents from the data folder
data_folder = Path('data')

# Check if data folder exists
if not data_folder.exists():
    raise FileNotFoundError(f"Data folder not found: {data_folder}")

# Find all markdown files
markdown_files = list(data_folder.glob('*.md')) + list(data_folder.glob('*.markdown'))

if not markdown_files:
    raise FileNotFoundError(f"No markdown files found in {data_folder}")

print(f"Found {len(markdown_files)} markdown file(s) in '{data_folder}':")
for f in markdown_files:
    print(f"  - {f.name}")

# Load all documents
documents = {}
total_chars = 0

for file_path in markdown_files:
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
        documents[file_path.name] = content
        total_chars += len(content)
        print(f"\n{file_path.name}: {len(content)} characters")

print(f"\nTotal characters across all documents: {total_chars}")
print(f"\nPreview of first document:")
first_doc = list(documents.values())[0]
print(first_doc[:500])


Found 4 markdown file(s) in 'data':
  - 01382b63-a19b-46af-b9ab-21e399f22d09.md
  - 0100a359-2176-4af9-8fc8-8b8a16988c33.md
  - 00ee777f-df00-4e9e-8493-d0356ffafef9.md
  - 00b5dfe7-904b-493b-bac8-b7ed98a65338.md

01382b63-a19b-46af-b9ab-21e399f22d09.md: 69292 characters

0100a359-2176-4af9-8fc8-8b8a16988c33.md: 25219 characters

00ee777f-df00-4e9e-8493-d0356ffafef9.md: 53416 characters

00b5dfe7-904b-493b-bac8-b7ed98a65338.md: 47048 characters

Total characters across all documents: 194975

Preview of first document:
Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things

[PERSON], [PERSON], [PERSON], and [PERSON]

Part of this paper has been presented at the IEEE ICCC 2023 [1].[PERSON] and [PERSON] are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, 310027, China (e-mail:{ming_ying_, [EMAIL_ADDRESS]). [PERSON] is with the School of Information Science and Technology, Hangzhou Normal U

We'll use LangChain's `MarkdownHeaderTextSplitter` which aligns with the SatcomLLM chunking strategy from `synthetic_gen/chunker/`:


In [19]:
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def chunk_markdown_document(markdown_text, max_chunk_size=1000, chunk_overlap=200):
    """
    Chunk markdown document using header-based splitting.
    This approach is inspired by the SatcomLLM synthetic data generation pipeline.
    
    Args:
        markdown_text: Markdown formatted text
        max_chunk_size: Maximum chunk size in characters
        chunk_overlap: Overlap between chunks for context preservation
    
    Returns:
        List of Document objects with content and metadata
    """
    # Define headers to split on (matching SatcomLLM approach)
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
        ("####", "Header 4"),
    ]
    
    # Initialize markdown splitter
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    
    # Split by headers
    md_header_splits = markdown_splitter.split_text(markdown_text)
    
    # Further split large chunks if needed
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    # Process each header-based chunk
    all_chunks = []
    for idx, doc in enumerate(md_header_splits):
        # If chunk is too large, split further
        if len(doc.page_content) > max_chunk_size:
            sub_chunks = text_splitter.split_text(doc.page_content)
            for sub_idx, sub_chunk in enumerate(sub_chunks):
                chunk_doc = Document(
                    page_content=sub_chunk,
                    metadata={
                        **doc.metadata,
                        'chunk_id': f"{idx}_{sub_idx}",
                        'chunk_size': len(sub_chunk)
                    }
                )
                all_chunks.append(chunk_doc)
        else:
            doc.metadata['chunk_id'] = str(idx)
            doc.metadata['chunk_size'] = len(doc.page_content)
            all_chunks.append(doc)
    
    return all_chunks


In [20]:
# Chunk all documents
all_chunks = []

for filename, markdown_text in documents.items():
    print(f"\nProcessing {filename}...")
    doc_chunks = chunk_markdown_document(markdown_text)
    
    # Add source filename to metadata
    for chunk in doc_chunks:
        chunk.metadata['source_file'] = filename
    
    all_chunks.extend(doc_chunks)
    print(f"  Generated {len(doc_chunks)} chunks")

print(f"\n{'='*60}")
print(f"Total chunks across all documents: {len(all_chunks)}")
print(f"{'='*60}")

# Show sample chunks from different documents
print("\nSample chunks:")
for i, chunk in enumerate(all_chunks[:3]):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Source: {chunk.metadata.get('source_file', 'unknown')}")
    print(f"Headers: {chunk.metadata.get('Header 1', '')} > {chunk.metadata.get('Header 2', '')}")
    print(f"Content preview: {chunk.page_content[:150]}...")

# Rename for consistency with rest of notebook
chunks = all_chunks



Processing 01382b63-a19b-46af-b9ab-21e399f22d09.md...
  Generated 103 chunks

Processing 0100a359-2176-4af9-8fc8-8b8a16988c33.md...
  Generated 31 chunks

Processing 00ee777f-df00-4e9e-8493-d0356ffafef9.md...
  Generated 73 chunks

Processing 00b5dfe7-904b-493b-bac8-b7ed98a65338.md...
  Generated 69 chunks

Total chunks across all documents: 276

Sample chunks:

--- Chunk 1 ---
Source: 01382b63-a19b-46af-b9ab-21e399f22d09.md
Headers:  > 
Content preview: Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things  
[PERSON], [PERSON], [PERSON], and [PERSON]...

--- Chunk 2 ---
Source: 01382b63-a19b-46af-b9ab-21e399f22d09.md
Headers:  > 
Content preview: Low earth orbit (LEO) satellite internet of things (IoT) is a promising way achieving global Internet of Everything, and thus has been widely recogniz...

--- Chunk 3 ---
Source: 01382b63-a19b-46af-b9ab-21e399f22d09.md
Headers:  > 
Content preview: predicts the CSI of current time slot acco

## 3.2 Initialize DeepInfra Embeddings

Now we'll set up our embedding model using **DeepInfra's API**.

### What are Embeddings?
Embeddings convert text into dense numerical vectors that capture semantic meaning:
- Similar texts → Similar vectors
- Enable semantic search (not just keyword matching)
- Foundation for RAG retrieval

### Model Specifications
- **Model**: `Qwen/Qwen3-Embedding-4B` 
- **Dimensions**: 2560
- **Use Case**: High-quality semantic search and retrieval
- **Normalization**: Vectors are normalized for cosine similarity

Run the next cell to initialize the embedding model:


In [23]:
import requests
import json
import os

# Fixed DeepInfra Embeddings API wrapper
class DeepInfraEmbeddings:
    """Embeddings using DeepInfra API"""
    
    def __init__(self, api_key, model="Qwen/Qwen3-Embedding-4B"):
        self.api_key = api_key
        self.model = model
        self.api_url = "https://api.deepinfra.com/v1/openai/embeddings"  # Fixed URL
    
    def embed_documents(self, texts):
        """Embed a list of documents"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        payload = {
            "model": self.model,
            "input": texts
        }
        response = requests.post(self.api_url, headers=headers, json=payload)
        if response.status_code == 200:
            result = response.json()
            return [item["embedding"] for item in result["data"]]
        else:
            raise Exception(f"DeepInfra API error: {response.status_code} - {response.text}")
    
    def embed_query(self, text):
        """Embed a single query"""
        return self.embed_documents([text])[0]

# Initialize DeepInfra embedding model with FIXED settings
print("Initializing DeepInfra embeddings with corrected settings...")
embedding_model = DeepInfraEmbeddings(
    api_key=os.getenv("DEEPINFRA_API_KEY"),
    model="Qwen/Qwen3-Embedding-4B"  # Known working model, 1024 dimensions
)

# Test embedding
sample_text = "The SatcomLLM pipeline generates synthetic QA pairs from documents"
sample_embedding = embedding_model.embed_query(sample_text)

print(f"✓ DeepInfra embeddings initialized successfully!")
print(f"✓ Model: {embedding_model.model}")
print(f"✓ Embedding dimension: {len(sample_embedding)}")
print(f"✓ Sample embedding (first 10 values): {sample_embedding[:10]}")


Initializing DeepInfra embeddings with corrected settings...
✓ DeepInfra embeddings initialized successfully!
✓ Model: Qwen/Qwen3-Embedding-4B
✓ Embedding dimension: 2560
✓ Sample embedding (first 10 values): [-0.0002994537353515625, 0.0308837890625, -0.0194091796875, 0.016357421875, -0.00115966796875, 0.058349609375, 0.060302734375, 0.000591278076171875, 0.023193359375, -0.006500244140625]


## 3.3 Create Qdrant Vector Database Collection

Time to set up our **vector database** where we'll store all document embeddings for fast retrieval.

### What is Qdrant?
**Qdrant** is a high-performance vector database designed for:
- **Similarity Search**: Find semantically similar documents in milliseconds
- **Scalability**: Handle millions of vectors efficiently
- **Filtering**: Combine vector search with metadata filters
- **Production-Ready**: Built for real-world applications

### Collection Configuration

We're creating a collection with these specifications:

| Parameter | Value | Explanation |
|-----------|-------|-------------|
| **Name** | `satcom_rag_demo` | Identifier for our knowledge base |
| **Vector Size** | 2560 | Matches Qwen3-Embedding-4B output |
| **Distance Metric** | Cosine | Best for normalized embeddings |
| **Storage** | Qdrant Cloud | Persistent, managed storage |


Run the next cell to create the collection:


In [24]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Initialize Qdrant Cloud client
print("Connecting to Qdrant Cloud...")
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY")
)

# Collection name
collection_name = "satcom_rag_demo"

# Create new collection
print(f"Creating collection: {collection_name}")
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=2560, distance=Distance.COSINE),
)

print(f"✓ Collection '{collection_name}' created successfully!")


Connecting to Qdrant Cloud...
Creating collection: satcom_rag_demo
✓ Collection 'satcom_rag_demo' created successfully!


And upload chunks created to Qdrant

In [25]:
from tqdm import tqdm

# Upload chunks to Qdrant with embeddings
print(f"Embedding and uploading {len(chunks)} chunks to Qdrant...")
print("This may take a moment...")

# Prepare points for batch upload
points = []

# Process chunks in batches to avoid API rate limits
batch_size = 10
for i in tqdm(range(0, len(chunks), batch_size), desc="Processing chunks"):
    batch_chunks = chunks[i:i+batch_size]
    
    # Get texts from batch
    texts = [chunk.page_content for chunk in batch_chunks]
    
    # Generate embeddings for batch
    embeddings = embedding_model.embed_documents(texts)
    
    # Create points
    for j, (chunk, embedding) in enumerate(zip(batch_chunks, embeddings)):
        point_id = str(uuid.uuid4())
        points.append(
            PointStruct(
                id=point_id,
                vector=embedding,
                payload={
                    "text": chunk.page_content,
                    "metadata": chunk.metadata,
                    "chunk_id": chunk.metadata.get("chunk_id", f"{i+j}"),
                    "source": chunk.metadata.get("source_file")
                }
            )
        )

# Upload all points to Qdrant
print(f"\nUploading {len(points)} points to Qdrant...")
qdrant_client.upsert(
    collection_name=collection_name,
    points=points
)

print(f"✓ Successfully uploaded {len(points)} chunks to Qdrant!")

# Verify collection info
collection_info = qdrant_client.get_collection(collection_name=collection_name)
print(f"\nCollection info:")
print(f"  - Points count: {collection_info.points_count}")


Embedding and uploading 276 chunks to Qdrant...
This may take a moment...


Processing chunks: 100%|██████████| 28/28 [00:42<00:00,  1.51s/it]



Uploading 276 points to Qdrant...
✓ Successfully uploaded 276 chunks to Qdrant!

Collection info:
  - Points count: 276


# Part 4: Retrieval, Prompt Enrichment, and Generation

In this part, we move from preparing the knowledge base to actively using it in a RAG workflow. The goal is to answer user queries by:

1. Retrieving relevant document chunks from the vector database using semantic similarity search.
2. Enriching the prompt by combining the retrieved context with the user query, ensuring the model has all the necessary information.
3. Generating responses with the RunPod vLLM-hosted SatcomLLM model, producing accurate and context-aware answers.


This step effectively demonstrates the full RAG loop: from query to retrieval, prompt construction, and finally, high-quality answer generation using a domain-specific LLM. By combining semantic search with prompt engineering, we ensure that even complex or technical queries are answered accurately and coherently.



## 4.1: Implement Semantic Search

With our knowledge base populated, first let's build the **search function** that retrieves the most relevant document chunks for any user query.

#### How Semantic Search Works

```
User Question
    ↓
1. Embed question using DeepInfra
    ↓
2. Query Qdrant with question vector
    ↓
3. Qdrant finds similar document vectors
    ↓
4. Return top-k most similar chunks
    ↓
Retrieved Documents + Similarity Scores
```

#### Understanding Similarity Scores

The **cosine similarity score** ranges from 0 to 1:
- **0.9 - 1.0**: Extremely similar (near-duplicate)
- **0.7 - 0.9**: Highly relevant (strong match)
- **0.5 - 0.7**: Moderately relevant (partial match)
- **< 0.5**: Weakly relevant (consider filtering out)

#### Function Features
- **Top-K Retrieval**: Get the k most similar documents
- **Score Transparency**: See exactly how relevant each result is
- **Metadata Preservation**: Returns headers and structure info
- **Fast**: Typically < 100ms for searches

Run the next cell to implement and test the search function:

In [27]:
def search_knowledge_base(query, top_k=3):
    """
    Search the knowledge base for relevant documents.
    
    Args:
        query: User's question
        top_k: Number of top results to return
    
    Returns:
        List of relevant documents with scores
    """
    # Embed the query
    query_embedding = embedding_model.embed_query(query)
    
    # Search in Qdrant using the correct API method
    search_results = qdrant_client.query_points(
        collection_name=collection_name,
        query=query_embedding,
        limit=top_k
    )
    
    # Format results
    results = []
    for result in search_results.points:
        results.append({
            "text": result.payload["text"],
            "metadata": result.payload["metadata"],
            "score": result.score
        })
    
    return results

# Test the search function
test_query = "What is a LEO Satellite?"
print(f"Test Query: {test_query}\n")
print("="*80)

results = search_knowledge_base(test_query, top_k=3)

for i, result in enumerate(results, 1):
    print(f"\n--- Result {i} (Score: {result['score']:.4f}) ---")
    print(f"Section: {result['metadata'].get('Header 1', '')} > {result['metadata'].get('Header 2', '')}")
    print(f"Text preview: {result['text'][:300]}...")
    
print("\n" + "="*80)
print("✓ Search function working correctly!")



Test Query: What is a LEO Satellite?


--- Result 1 (Score: 0.7069) ---
Section:  > I Introduction
Text preview: Recently, low-earth-orbit (LEO) satellite with low transmission latency and small propagation loss has arisen public interest by its potential as an expansion of traditional terrestrial networks to address the above issues. It can provide continuous and sufficient connectivity for districts without ...

--- Result 2 (Score: 0.6497) ---
Section:  > 
Text preview: Deep learning, multibeam precoding, channel prediction, LEO satellite Internet of Things....

--- Result 3 (Score: 0.6066) ---
Section:  > I Introduction
Text preview: The remainder of this article is organized as follows. In Section II, we introduce the channel model and channel estimation method for LEO satellite IoT. Based on the system model, we propose a supervised channel prediction scheme to predict the current CSI using the CSI of previous time slots in Se...

✓ Search function working correctly!


## 4.2 Complete RAG Pipeline

Now we bring everything together! This is the **main RAG function** that combines retrieval with generation.

#### The Complete RAG Workflow

#### Phase 1: Retrieval
```
Question → Embed → Search Qdrant → Top-K Documents
```
- Convert user question to vector
- Find most similar chunks in vector database  
- Retrieve actual text content and metadata

#### Phase 2: Context Building
```
Retrieved Docs → Format with Headers → Enriched Context
```
- Format each document with its section headers
- Include document numbers for reference
- Combine into structured context string

#### Phase 3: Prompt Engineering
```
System Prompt + Context + Question → Enriched Prompt
```
- Instruct the model to use provided context
- Include all retrieved documents as context
- Add the user's original question
- Set clear expectations for the response

#### Phase 4: Generation
```
Enriched Prompt → RunPod vLLM → Grounded Answer
```
- Send to SatcomLLM model on RunPod
- vLLM provides fast, optimized inference
- Model generates answer based on context
- Returns answer with source attribution

### Key Benefits

| Feature | Benefit |
|---------|---------|
| **Grounded** | Answers based on actual documents, not hallucinations |
| **Traceable** | Shows which documents were used |
| **Transparent** | Displays similarity scores for trust |
| **Configurable** | Adjust top_k, temperature, max_tokens |
| **Fast** | vLLM optimization for quick responses |

### Function Parameters

- `question`: Your query (string)
- `top_k`: Number of documents to retrieve (default: 3)
- `max_tokens`: Maximum answer length (default: 512)
- `temperature`: Creativity (0.0 = factual, 1.0 = creative)
- `show_sources`: Display retrieved documents (default: True)

Run the next cell to implement the complete RAG pipeline and test it:

In [29]:
def ask_rag_question(question, top_k=3, max_tokens=512, temperature=0.1, show_sources=True):
    """
    Complete RAG pipeline: Retrieve relevant documents and generate answer using vLLM.
    
    Args:
        question: User's question
        top_k: Number of documents to retrieve
        max_tokens: Maximum tokens for generation
        temperature: Sampling temperature
        show_sources: Whether to display retrieved sources
    
    Returns:
        Generated answer
    """
    print(f"\n{'='*80}")
    print(f"Question: {question}")
    print(f"{'='*80}\n")
    
    # Step 1: Retrieve relevant documents
    print("Searching knowledge base...")
    retrieved_docs = search_knowledge_base(question, top_k=top_k)
    
    if show_sources:
        print(f"\nRetrieved {len(retrieved_docs)} relevant documents:")
        for i, doc in enumerate(retrieved_docs, 1):
            headers = doc['metadata'].get('Header 1', '')
            if doc['metadata'].get('Header 2'):
                headers += f" > {doc['metadata'].get('Header 2')}"
            print(f"  [{i}] {headers} (score: {doc['score']:.3f})")
    
    # Step 2: Format context from retrieved documents
    context_parts = []
    for i, doc in enumerate(retrieved_docs, 1):
        headers = []
        if doc['metadata'].get('Header 1'):
            headers.append(doc['metadata']['Header 1'])
        if doc['metadata'].get('Header 2'):
            headers.append(doc['metadata']['Header 2'])
        
        section_info = " > ".join(headers) if headers else "General"
        context_parts.append(f"[Document {i}] {section_info}\n{doc['text']}\n")
    
    context = "\n".join(context_parts)
    
    # Step 3: Build enriched prompt
    enriched_prompt = f"""You are a helpful assistant specializing in satellite communications.

Use the following context from the documentation to answer the question accurately and concisely.
If the answer cannot be found in the context, say so clearly.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
    
    # Step 4: Generate answer using vLLM
    print("\nGenerating answer with SatcomLLM...")
    messages = [
        {"role": "user", "content": enriched_prompt}
    ]
    
    answer = chat_with_vllm(messages, max_tokens=max_tokens, temperature=temperature)
    
    # Display answer
    print(f"\n{'─'*80}")
    print(f"ANSWER:")
    print(f"{'─'*80}")
    print(answer)
    print(f"\n{'='*80}\n")
    
    return answer

# Test the complete RAG pipeline
print("Testing Complete RAG Pipeline")
print("="*80)

# Test question 1
ask_rag_question(
    "Tell me about Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things",
    top_k=3,
    temperature=0.1
)


Testing Complete RAG Pipeline

Question: Tell me about Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things

Searching knowledge base...

Retrieved 3 relevant documents:
  [1]  (score: 0.936)
  [2]  (score: 0.901)
  [3]  (score: 0.852)

Generating answer with SatcomLLM...
Job submitted: ba98ab9e-425f-46ac-bbdc-47a13ee3c57f-e2
Waiting for response...

────────────────────────────────────────────────────────────────────────────────
ANSWER:
────────────────────────────────────────────────────────────────────────────────
Based on the provided context, here's a summary of Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things:

- **Purpose**: Develop a deep learning (DL) based joint channel prediction and multibeam precoding scheme to address the challenges of high-speed movement of Low Earth Orbit (LEO) satellites, such as acquiring timely channel state information (CSI) and designing ef

"Based on the provided context, here's a summary of Deep Learning-based Joint Channel Prediction and Multibeam Precoding for LEO Satellite Internet of Things:\n\n- **Purpose**: Develop a deep learning (DL) based joint channel prediction and multibeam precoding scheme to address the challenges of high-speed movement of Low Earth Orbit (LEO) satellites, such as acquiring timely channel state information (CSI) and designing effective multibeam precoding for various IoT applications.\n\n- **Methods**:\n "

## 4.3 Testing with custom questions

Let's test our RAG system with several diverse questions about the SatcomLLM pipeline to demonstrate its capabilities.


In [31]:
# Test with more questions
test_questions = [
    "What is a LEO satellite?",
    "What is deep learning?",
    "What is Internet of Things?"
]

for question in test_questions:
    ask_rag_question(question, top_k=3, temperature=0.1)
    print("\n" + "="*80 + "\n")



Question: What is a LEO satellite?

Searching knowledge base...

Retrieved 3 relevant documents:
  [1]  > I Introduction (score: 0.710)
  [2]  (score: 0.634)
  [3]  > I Introduction (score: 0.607)

Generating answer with SatcomLLM...
Job submitted: cc72930e-7c1a-4856-94da-42c95e7dfc6f-e1
Waiting for response...

────────────────────────────────────────────────────────────────────────────────
ANSWER:
────────────────────────────────────────────────────────────────────────────────
According to Document 1, a Low-Earth-Orbit (LEO) satellite has the following characteristics: 

It has low transmission latency and small propagation loss. It is used for continuous and sufficient connectivity for areas without adequate network coverage.





Question: What is deep learning?

Searching knowledge base...

Retrieved 3 relevant documents:
  [1]  (score: 0.579)
  [2]  > III DL-Based Channel Prediction (score: 0.512)
  [3]  > III DL-Based Channel Prediction (score: 0.481)

Generating answer with Sa

In [34]:
# Ask your own question!
my_question = "What is SatCom?"

# Adjust parameters as needed:
# - top_k: Number of documents to retrieve (3-5 recommended)
# - temperature: 0.0 = deterministic, 1.0 = creative
# - max_tokens: Maximum length of answer
ask_rag_question(
    question=my_question,
    top_k=3,
    temperature=0.1,
    max_tokens=512,
    show_sources=True
)



Question: What is SatCom?

Searching knowledge base...

Retrieved 3 relevant documents:
  [1]  (score: 0.461)
  [2]  > I Introduction (score: 0.457)
  [3]  > I Introduction (score: 0.455)

Generating answer with SatcomLLM...
Job submitted: 5dba8737-2940-4554-a04e-3019d38116e7-e1
Waiting for response...

────────────────────────────────────────────────────────────────────────────────
ANSWER:
────────────────────────────────────────────────────────────────────────────────
Based on your provided context, specifically from [Document 1] General and [Document 3] I Introduction, I cannot find the explicit definition of "SatCom" in the given information. However, it is probable that "SatCom" refers to Satellite Communications.




'Based on your provided context, specifically from [Document 1] General and [Document 3] I Introduction, I cannot find the explicit definition of "SatCom" in the given information. However, it is probable that "SatCom" refers to Satellite Communications.'

## Summary: Complete RAG Pipeline

Congratulations! You've built a complete RAG (Retrieval-Augmented Generation) system with:

### Components:
1. **Document Processing**: Markdown chunking with header-based splitting
2. **Embeddings**: DeepInfra API with BAAI/bge-large-en-v1.5 (1024 dimensions)
3. **Vector Database**: Qdrant Cloud for scalable similarity search
4. **LLM Generation**: RunPod vLLM endpoint with SatcomLLM model
5. **RAG Orchestration**: Custom pipeline combining retrieval and generation


### Next Steps:
- Try different questions
- Adjust top_k for more/less context
- Experiment with temperature settings
- Add more documents to the knowledge base
- Implement re-ranking for better results


## Bonus: PDF to Markdown Conversion

If you have PDF documents, here's a utility function to convert them to markdown for processing:


In [16]:
import fitz  # PyMuPDF
from pathlib import Path

def pdf_to_markdown(pdf_path, output_path=None):
    """
    Convert PDF to markdown format.
    
    Args:
        pdf_path: Path to PDF file
        output_path: Optional path for markdown output
    
    Returns:
        Markdown text content
    """
    # Open PDF
    doc = fitz.open(pdf_path)
    
    markdown_content = []
    
    # Extract text from each page
    for page_num, page in enumerate(doc, 1):
        # Extract text
        text = page.get_text()
        
        # Add page marker
        markdown_content.append(f"\n## Page {page_num}\n")
        markdown_content.append(text)
    
    doc.close()
    
    # Combine all content
    full_markdown = "\n".join(markdown_content)
    
    # Save to file if output path provided
    if output_path:
        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(full_markdown)
        print(f"Markdown saved to: {output_path}")
    
    return full_markdown

# Example usage:
# pdf_path = "./data/sample_document.pdf"
# markdown_text = pdf_to_markdown(pdf_path, "./data/sample_document.md")
# # Then use chunk_markdown_document(markdown_text) to chunk it

print("PDF to Markdown converter ready!")


PDF to Markdown converter ready!
