## Part 1: Interview-Style Questions

### Q1. (Embeddings): What is a text embedding and why is it useful for RAG?

A text embedding is a mathematical representation that maps text into a dense vector in N-dimensional space, where semantically similar texts are positioned close together and dissimilar texts are far apart. For example, "I love dogs" and "I love cats" would have vectors that are much closer together than either would be to "The weather is nice today." This is fundamentally useful for RAG because it enables semantic search rather than just keyword matching—when a user asks a question, we can find relevant documents based on meaning rather than exact word overlap. The embedding captures the conceptual content of text, allowing us to retrieve passages that answer a question even if they don't share the same vocabulary. Without embeddings, a query for "automobile maintenance" might miss a highly relevant document about "car repair" because the keywords don't match. In a RAG pipeline, we pre-compute embeddings for all document chunks during indexing, then at query time we embed the user's question and find the chunks with the highest cosine similarity, which become the context for the LLM to generate an answer.





### Q2. (Dimension & trade-offs): What does the dimension of an embedding model mean, and why does it matter for storage, latency, and retrieval quality?

The dimension of an embedding model refers to the length of the output vector—for example, OpenAI's text-embedding-3-small produces 1536-dimensional vectors, while text-embedding-3-large produces 3072 dimensions. You can think of each dimension as a latent feature channel that encodes some aspect of semantic meaning, so higher dimensions provide more "directions" in the vector space to distinguish between different concepts. For storage, the cost scales linearly: each vector requires `dimension × 4 bytes` (for float32), so a 3072-dim model uses twice the storage of a 1536-dim model—this becomes significant when you have millions of chunks. For latency, similarity search must compute dot products across all dimensions, so higher dimensions mean more computation per query; additionally, Approximate Nearest Neighbor (ANN) indexes like HNSW can become less efficient in very high-dimensional spaces. For retrieval quality, higher dimensions generally capture more semantic nuance and can better distinguish subtle differences between concepts, leading to more accurate similarity rankings. In practice, I'd start with 768-1536 dimensions for most applications, as this balances quality with cost, and only move to higher dimensions if evaluation shows retrieval quality is a bottleneck.




### Q3. (Choosing an embedding model): How would you choose an embedding model for a RAG system, and what trade-offs would you consider?

When choosing an embedding model, I would start by checking the MTEB (Massive Text Embedding Benchmark) leaderboard to see how different models perform on retrieval tasks specifically, since RAG primarily needs good retrieval performance rather than clustering or classification. The key trade-offs I'd consider are: first, quality versus cost—proprietary models like OpenAI's embeddings or Cohere tend to perform well but incur per-token API costs, while open-source models like sentence-transformers can be self-hosted for free but require infrastructure. Second, latency versus accuracy—larger models with more parameters generally produce better embeddings but take longer to run; for real-time applications, I might choose a smaller, faster model. Third, dimension size affects storage and search speed as discussed earlier. Fourth, I'd consider the max token limit—some models only handle 512 tokens while others handle 8192, which affects chunking strategy. Fifth, domain specificity matters: a model trained on general web text might underperform on legal or medical documents compared to a domain-specific model. Finally, I'd consider whether I need multilingual support. In practice, I'd start with a well-performing general model like text-embedding-3-small, establish baseline metrics, and only switch if evaluation reveals problems.




### Q4. (Vector DB vs FAISS): Why would you use a vector database instead of just using FAISS directly?

FAISS is a powerful low-level library for similarity search, but it's just that—a library, not a complete solution. When you use FAISS directly, you're responsible for managing persistence (saving and loading indexes), storing the original text and metadata separately, implementing any filtering logic, handling updates and deletions, and building a service layer if you need network access. A vector database like ChromaDB, Pinecone, or Qdrant handles all of this out of the box: it automatically persists data, stores metadata alongside vectors, provides filtering capabilities (e.g., "only search documents from this user" or "only documents created after 2024"), offers a clean query API, and manages the complexity of updates. For production RAG systems, metadata filtering is particularly crucial—you often need to enforce access control (only retrieve documents the user has permission to see) or filter by document type, date, or source. Vector databases also typically provide hybrid search combining vector similarity with keyword matching, which FAISS alone cannot do. I would use FAISS directly only if I needed maximum performance and control, was willing to build the infrastructure myself, and didn't need rich metadata filtering—for most RAG applications, a vector database is the better choice for development speed and maintainability.




### Q5. (Metadata & access control): What role does metadata play in your retrieval system, and how does it help with access control and filtering?

Metadata is essential for making retrieval practical in production systems—vectors alone only tell you semantic similarity, but metadata tells you everything else about the source content. At the document level, we store fields like document_id, title, source_type (PDF, HTML, URL), source_url, uploaded_by, created_at, and crucially for multi-tenant systems, a tenant_id or list of allowed users/groups. At the chunk level, we store page_number, section heading, and block_ids so we can tell users exactly where an answer came from. For access control, when a user queries the system, we apply a metadata filter before or during the similarity search—for example, `WHERE tenant_id = 'user_org' AND user_has_access = true`—ensuring users only see documents they're authorized to access. This is non-negotiable for enterprise applications where different departments or clients have different document permissions. Beyond access control, metadata enables useful filtering like "only search policy documents updated in the last year" or "only search the engineering knowledge base, not HR documents." It also powers features like showing citations with page numbers and enabling users to click through to the source. Without rich metadata, you'd have a retrieval system that returns relevant content but can't tell you where it came from or enforce any business logic around who can see what.





### Q6. (Naive parsing limitations): Why is naïve text extraction from PDFs/HTML not enough for a production RAG system?

Naive text extraction fundamentally destroys the structure that gives documents meaning. When you run a simple PDF-to-text tool on a multi-column document, the columns get interleaved—text from column A mixes with text from column B, creating nonsensical passages. Tables are particularly problematic: row and column relationships are lost, so a table showing "Q3 Revenue: $42M" might become "Q3 Revenue Q4 Revenue $42M $38M" with numbers separated from their headers. Headers, footers, and page numbers get mixed into the content, so "Confidential - Page 3" appears in the middle of paragraphs. Lists lose their structure, code blocks lose their formatting, and mathematical formulas become unreadable character sequences. Images and charts are either dropped entirely or become useless placeholders like "Figure 3." The consequence for RAG is severe: embeddings computed on this noisy, broken text become unreliable because the semantic meaning is corrupted. Even if similarity search happens to retrieve the right page, the chunk itself may not contain coherent information for the LLM to use. This is why we need advanced parsing that understands document layout—using vision models or hybrid approaches that can identify and properly extract different content types while preserving their structure and relationships.




### Q7. (Document → Page → Block → Chunk): How does your system internally represent documents before building the vector index?

Our system uses a four-level hierarchy to preserve document structure throughout the pipeline. At the top level, a Document represents the entire file (an 80-page PDF or complete web article) with metadata like document_id, title, source_type, and access control fields. Each Document contains multiple Pages, which correspond to physical pages in PDFs or logical sections in web content, with metadata including page_number and an optional rendered image for visual reference. Within each Page, we identify Blocks—these are the fundamental semantic units with types like Title, Section_Header, Paragraph, Table, Figure, List_Item, Code, Header (page header), and Footer. Each Block has a block_id, its type, the raw content, bounding box coordinates, and importantly a semantic_content field which is an LLM-generated description optimized for embedding (especially crucial for tables and charts). Finally, Blocks are combined into Chunks based on our chunking strategy—a Chunk is what actually gets embedded and indexed. Each Chunk maintains references back to its source blocks, page number, and document, enabling us to show users exactly where retrieved content came from. This hierarchy lets us apply intelligent chunking that respects logical boundaries, filter by structural elements, and provide rich citations in the final output.





### Q8. (Tables & charts): RAG systems often fail on tables and charts. How does your system handle them?

Tables and charts fail in naive RAG because embedding models cannot understand structural relationships—if you embed raw HTML like `<table><tr><th>Q3</th><th>Revenue</th></tr>...`, the model doesn't know that "Q3" is a column header related to specific values below it. Our system handles this with a dual-representation approach. First, we preserve the structured data in a format the LLM can interpret—HTML with proper tags for complex tables, Markdown for simpler ones, or JSON for programmatic access. Second, and crucially, we generate a semantic_content field using a vision-capable LLM that summarizes what the table or chart actually means: "Revenue comparison table showing Q3 2024 revenue of $42.5M with 35% year-over-year growth, compared to Q3 2023 revenue of $31.5M with 22% growth, indicating accelerating growth." This semantic description is what gets embedded for retrieval, because it captures the meaning in natural language that embedding models understand well. In experiments, the semantic description achieves 30-60% higher cosine similarity with relevant queries compared to raw table data. When a chunk containing a table is retrieved, we send both the semantic summary and the actual structured data to the LLM, so it has both the context for why it was retrieved and the precise data needed to answer accurately.




### Q9. (Chunking strategies & example): How did you design your chunking strategy, and can you give an example where bad chunking hurt retrieval quality?

Our chunking strategy prioritizes logical coherence over fixed sizes. Rather than blindly splitting every N characters, we use structure-aware chunking that respects document hierarchy: section headings stay with their content, bullet lists aren't split mid-list, and tables remain intact with their captions. We target 400-600 tokens per chunk as a balance—large enough to contain complete ideas but small enough for precise retrieval—with 10-20% overlap to avoid cutting sentences. We use separators in priority order: first try to split on section breaks, then paragraphs, then sentences, then words.

Here's a concrete example of bad chunking hurting retrieval. Consider a document section titled "Core causes of RAG failure" followed by four explanatory bullet points. With naive 500-character chunking: Chunk 1 ends with "...Core causes of RAG failure", Chunk 2 contains the first bullet point cut mid-sentence, and Chunk 3 has bullets 2-4 split across it. When a user asks "What are the main causes of RAG failure?", the query matches best with Chunk 1 (which has the heading but no actual causes) or partially with Chunk 2 (which has one incomplete cause). The LLM receives fragmented context and gives an incomplete answer. With structure-aware chunking, the heading and all four bullets form a single logical chunk. The same query retrieves complete, actionable information, and the LLM can enumerate all four causes accurately.





### Q10. (End-to-end indexing pipeline): Walk me through your indexing pipeline end-to-end.

The pipeline starts when a raw document—PDF, HTML, or URL—is uploaded. First, we run hybrid parsing: a low-level parser extracts text fragments with bounding boxes from each page, while simultaneously we render each page as an image and send both the image and raw text to a vision-capable LLM like GPT-4.1-mini. The LLM analyzes the visual layout and returns structured JSON identifying each block's type (title, paragraph, table, figure, etc.), its approximate location, and generates semantic descriptions for complex elements like tables and charts. We post-process this output to merge small spans, remove repeated headers/footers, and normalize formatting.

Next, we apply structure-aware chunking. Using the block types and heading hierarchy, we group related blocks into chunks—keeping headings with their content, lists intact, and tables with captions. We target 400-600 tokens with overlap at natural boundaries. Each chunk maintains metadata linking back to its source blocks, page numbers, and document.

Then we generate embeddings by calling our embedding model (e.g., text-embedding-3-small) on each chunk's text—for tables/charts, we embed the semantic_content rather than raw data. Finally, we store everything in our vector database (ChromaDB): the embedding vector, the chunk text, and all metadata including document_id, page_number, block_types, heading_path, tenant_id for access control, and timestamps. The vectors are indexed for fast ANN search, and we're ready to serve queries.

At query time, we embed the user's question, apply metadata filters (access control, document type), run top-K similarity search, optionally fetch neighboring chunks for context enrichment, and pass the retrieved context to the LLM for answer generation with citations.

---



## Part 2: Build a Tiny Vector Index & Experiment with Chunking


### 1. Document Used

I used the "Day 2 – Advanced Data Quality & Indexing Pipeline" course notes PDF (30 pages), which covers embeddings, vector search, document parsing, chunking strategies, and RAG optimization. This document has clear section headings, multiple subsections, code examples, tables, and bullet-point lists—ideal for comparing chunking strategies.


### 2. Chunking Implementation

**Naive Fixed-Size Chunking:**
```python
def naive_chunk(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append({
            'text': text[start:end],
            'strategy': 'naive',
            'start_char': start
        })
        start = end - overlap
    return chunks
```
Simply splits every 1000 characters with 200-character overlap, regardless of content structure.

**Structure-Aware Chunking:**
```python
def structure_aware_chunk(text, max_chunk_size=1500):
    # Split on markdown headings
    sections = re.split(r'(^#{1,3}\s+.+$)', text, flags=re.MULTILINE)
    
    chunks = []
    current_heading = ""
    
    for i, section in enumerate(sections):
        if re.match(r'^#{1,3}\s+', section):
            current_heading = section.strip()
        else:
            content = section.strip()
            if content:
                # Keep heading with content
                chunk_text = f"{current_heading}\n\n{content}" if current_heading else content
                
                # Split if too long, preserving paragraph boundaries
                if len(chunk_text) > max_chunk_size:
                    paragraphs = chunk_text.split('\n\n')
                    # Group paragraphs into chunks
                    ...
                else:
                    chunks.append({
                        'text': chunk_text,
                        'strategy': 'structured',
                        'heading': current_heading
                    })
    return chunks
```
Identifies section headings (lines starting with #), keeps each heading with its following content, and only splits within sections if they exceed the max size.

---

### 3. Query Results Comparison

#### Query 1: "What are the main limitations of naive RAG?"

**Naive Chunking - Top Result (similarity: 0.72):**
```
"...vectors stored in a vector database.

Parsing goals: from messy file → structured blocks
Instead of "just text," we want a structured representation of the document:
Document → Page → Block → Chunk
where:
Document = entire file (e.g., 80-page PDF or full web article).
Page = physical page for PDFs / logical sections for web pages.
Block = layout unit with a semantic type:
Title, Heading(level), Paragraph, List, Table, Figure, Code, Caption,
Header, Footer, etc.
Chunk = text unit prepared specifically for embeddings & retrieval.
This hierarchy is used throughout the course code..."
```
*Problem: Retrieved content about document hierarchy, not RAG limitations. The heading "Core Limitations of Naive RAG" got separated from its content.*

**Structure-Aware Chunking - Top Result (similarity: 0.81):**
```
"## Core Limitations of Naive RAG

• Single Modality: When an answer requires synthesizing information from a 
combination of images, tables, charts, and text within a document, a Naive 
RAG that only performs vector search on plain text will be ineffective.

• Insufficient Context Breadth: When a question requires synthesizing 
information from an entire document or multiple documents (e.g., "summarize 
this report"), retrieving only the Top-K most similar text chunks is far from 
sufficient.

• Limitations of Vector Search: Vector distance is not always a perfect indicator 
of semantic relevance. Sometimes, the most semantically relevant text chunk 
may not be the closest in the vector space (e.g., due to negation or complex 
logical relationships)."
```
*Success: All three limitations with explanations in a single coherent chunk.*

---

#### Query 2: "Why should I use semantic descriptions for tables instead of raw HTML?"

**Naive Chunking - Top Result (similarity: 0.68):**
```
"...search becomes unreliable.
Even if the right page is "retrieved", the chunk itself may not actually
contain the clean information the LLM needs.
This is why we treat advanced parsing and layout-aware extraction as a
first-class part of the pipeline.

VEHICLE CAPACITY
Preparations are underway across our existing factories for the launch of new products in 2025,
including more affordable models. New Model Y has launched on the configurator globally with
production ramp and deliveries beginning later in Q1..."
```
*Problem: Retrieved Tesla vehicle capacity content mixed with parsing discussion—completely off-topic due to document sections being interleaved.*

**Structure-Aware Chunking - Top Result (similarity: 0.79):**
```
"## Why Use semantic_content for Embedding?

For tables and figures, raw content (like HTML) does not embed well.
We generate a semantic description that captures the meaning.

Example comparison:
- Raw HTML table: similarity score 0.4312
- Semantic description: similarity score 0.5903  
- Improvement: +36.9% better

The semantic description captures the MEANING: "Revenue comparison showing 
Q3 2024 revenue of $42.5M with 35% YoY growth" rather than structural markup 
that embedding models cannot interpret."
```
*Success: Directly answers the question with the relevant section intact.*

---

#### Query 3: "How do I choose between FAISS and ChromaDB for my RAG system?"

**Naive Chunking - Top Result (similarity: 0.71):**
```
"...Vector DB Comparison: https://superlinked.com/vector-db-comparison

FAISS vs ChromaDB (and other vector DBs)
FAISS
A low-level C++/Python library for similarity search (Meta).
Provides many index types (Flat, IVF, HNSW, PQ, …).
You manage: storage, metadata mapping, sharding, and any network
service layer.
Great when you want maximum control and performance and are okay
with custom infra.
ChromaDB"
```
*Partial success: Got the right section but cut off before ChromaDB explanation.*

**Structure-Aware Chunking - Top Result (similarity: 0.78):**
```
"## FAISS vs ChromaDB (and other vector DBs)

**FAISS:**
- A low-level C++/Python library for similarity search (Meta)
- Provides many index types (Flat, IVF, HNSW, PQ, …)
- You manage: storage, metadata mapping, sharding, and any network service layer
- Great when you want maximum control and performance with custom infra

**ChromaDB:**
- A Python-first embedded vector database
- Stores vectors + metadata + original text, with automatic persistence
- Very easy to integrate into RAG prototypes
- Ideal for single-service or small/medium-scale apps where simplicity matters

**Key trade-offs:** control vs convenience, scale vs simplicity, infra cost vs vendor lock-in."
```
*Success: Complete comparison with both options and trade-offs summarized.*

---



### 4. Reflection (8-12 sentences)

**For which queries did structure-aware chunking give clearly better context?**

Structure-aware chunking performed significantly better on Query 1 and Query 2, where the answer was contained within a specific section with a clear heading. For Query 1 about RAG limitations, the naive approach retrieved a chunk that happened to contain related keywords but missed the actual list of limitations because the heading got separated from its bullet points. The structure-aware approach kept "Core Limitations of Naive RAG" together with all three explanatory bullets, providing complete and actionable information. Query 2 showed an even more dramatic difference—naive chunking retrieved completely irrelevant content about Tesla vehicles that happened to be near the table discussion in the raw text, while structure-aware chunking retrieved the exact section explaining why semantic descriptions matter.

**Did naive chunking ever "win" or look comparable?**

For Query 3, naive chunking performed reasonably well because the relevant content happened to start near a chunk boundary, so it captured most of the FAISS explanation. However, it still cut off before completing the ChromaDB section, requiring the user to potentially retrieve multiple chunks to get a complete answer. The cases where naive chunking looks comparable tend to be when the query terms appear early in what happens to be a logical section, or when the relevant information is short enough to fit within a single naive chunk. This is essentially luck—the chunking happened to align with the content structure by coincidence.

**How did chunk size and overlap affect results?**

For answer completeness, larger chunks in the structure-aware approach (1500 chars max vs 1000) meant that complete sections stayed together, providing full context. The overlap in naive chunking (200 chars) helped prevent cutting sentences mid-word but couldn't prevent cutting logical units mid-thought. For noise, smaller naive chunks sometimes had less irrelevant content per chunk, but this was offset by the higher probability of retrieving chunks that were partially relevant but missing key context. For token cost, structure-aware chunks were on average 30% larger, meaning slightly higher LLM costs per retrieval, but fewer chunks needed to be retrieved to get complete answers.

**What would I try next to improve retrieval quality?**

Given more time, I would implement context enrichment—automatically fetching the chunks immediately before and after the top retrieved chunk to provide surrounding context. I would also experiment with hybrid search combining vector similarity with BM25 keyword matching, which would help with queries containing specific terms like product codes or technical acronyms. Additionally, I would try generating hypothetical questions for each chunk during indexing (HyPE) to better bridge the gap between how users phrase questions and how documents phrase answers. Finally, I would implement a reranking step using a cross-encoder model to re-score the top-20 candidates from the initial retrieval, as cross-encoders are more accurate but too slow to run on the full corpus.