# RAG System Implementation - Academic Research Assistant

This notebook implements a complete Retrieval-Augmented Generation (RAG) system designed for academic research workflows. The system combines local language models with a structured knowledge base to provide transparent and reliable AI-assisted research capabilities.

## System Overview

The RAG pipeline consists of several integrated components:

1. **Document Processing**: Intelligent chunking of academic texts with section-aware splitting and citation tracking
2. **Embedding Generation**: High-quality semantic representations using BGE-base-en-v1.5
3. **Vector Search**: Efficient similarity search using FAISS for document retrieval
4. **Response Generation**: Local LLM inference with proper source attribution

## Key Features

- **Local Inference**: Privacy-preserving processing using llama.cpp
- **Citation Tracking**: Automatic extraction and linking of academic references
- **Source Attribution**: Transparent identification of information sources
- **Modular Design**: Easy customisation and extension of components
- **Academic Focus**: Optimised for research workflows and scholarly content

## Usage Instructions

1. **Setup**: Install dependencies from requirements.txt
2. **Model Configuration**: Update LLM model path in the final cells
3. **Document Processing**: Run cells sequentially to build the knowledge base
4. **Query Processing**: Use the final cells to test queries and generate answers

## Technical Implementation

The system demonstrates practical implementation of:
- Semantic chunking with overlap preservation
- Dense vector retrieval with metadata tracking
- Structured prompt engineering for academic contexts
- Local LLM integration for controlled generation

*Note: This implementation focuses on dynamic pricing research content but can be adapted for other academic domains.*

In [None]:
# Check to have the venv as python interpreter
! pip install -r requirements.txt

### Import the libraries

In [None]:
# Core libraries for RAG system implementation
import fitz  # PyMuPDF for PDF text extraction
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Text chunking utilities
from sentence_transformers import SentenceTransformer  # Embedding model for semantic search
import faiss  # Vector database for similarity search
import numpy as np  # Numerical operations
from sklearn.preprocessing import normalize  # Data preprocessing utilities
from llama_cpp import Llama  # Local LLM inference engine
from striprtf.striprtf import rtf_to_text  # RTF text extraction
import re  # Regular expressions for text processing
from typing import List, Dict  # Type hints for better code clarity

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Reference metadata manually entered
# This dictionary stores citation information for academic papers used in the thesis
# Each entry contains the citation key, formatted citation, title, and source link
references = [
    {
        "id": "(Chen & Guestrin, 2016)",
        "citation": "(Chen and Guestrin, 2016)",
        "title": "XGBoost: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",
        "link": "https://dl.acm.org/doi/10.1145/2939672.2939785"
    },
    {
        "id": "(Fiig et al., 2018)",
        "citation": "(Fiig et al., 2018)",
        "title": "Dynamic Pricing of Airline Offers",
        "link": "https://www.iata.org/contentassets/0688c780d9ad4a4fadb461b479d64e0d/dynamic-pricing--of-airline-offers.pdf"
    },
    {
        "id": "(Garbarino & Lee, 2003)",
        "citation": "(Garbarino and Lee, 2003)",
        "title": "Dynamic Pricing in Internet Retail: Effects on Consumer Trust",
        "link": "https://www.researchgate.net/publication/229740251_Dynamic_Pricing_in_Internet_Retail_Effects_on_Consumer_Trust"
    },
    {
        "id": "(Spagnuelo et al., 2017)",
        "citation": "(Spagnuelo et al., 2017)",
        "title": "Metrics for Transparency",
        "link": "https://core.ac.uk/download/pdf/78370936.pdf"
    }
]

# Create a citation dictionary for fast lookups during text processing
# This enables the system to identify and link citations found in the thesis text
citation_index = {
    entry["citation"]: {
        "id": entry["id"],
        "title": entry["title"],
        "link": entry["link"]
    }
    for entry in references
}

### Functions used for chunking

In [None]:
import re
from typing import List, Dict

def extract_citation_keys(text: str, citation_index: Dict[str, Dict]) -> List[str]:
    """
    Extract academic citation keys from text using regex patterns.
    
    This function identifies citation patterns like "(Author, Year)" and matches them
    against the citation index to find valid references.
    
    Args:
        text (str): The text to search for citations
        citation_index (Dict[str, Dict]): Dictionary mapping citation keys to metadata
    
    Returns:
        List[str]: List of valid citation keys found in the text
    """
    # Find all citation patterns in the format "(Author, Year)"
    matches = re.findall(r"\(([^()]+?,\s?\d{4})\)", text)
    # Filter matches to only include valid citations from our index
    return [m.strip() for m in matches if m.strip() in citation_index]

def split_by_section_headings_with_meta(
    text: str,
    source_type: str,
    source_name: str = "",
    citation_index: Dict[str, Dict] = None,
    max_chunk_words: int = 200,
    overlap_words: int = 40
) -> List[Dict]:
    """
    Split text into semantic chunks based on section headings with metadata preservation.
    
    This function identifies numbered section headers (e.g., "1 Introduction", "2.1 Method")
    and splits the text accordingly. Each section is further divided into smaller overlapping
    chunks to maintain semantic coherence while fitting within LLM context windows.
    
    Args:
        text (str): The input text to be chunked
        source_type (str): Type of source ("thesis" or "reference")
        source_name (str, optional): Name of the source document
        citation_index (Dict[str, Dict], optional): Citation lookup dictionary
        max_chunk_words (int): Maximum words per chunk (default: 200)
        overlap_words (int): Number of overlapping words between chunks (default: 40)
    
    Returns:
        List[Dict]: List of chunk dictionaries with metadata including:
            - section_title: The section heading
            - text: The chunk content
            - type: Source type (thesis/reference)
            - source: Source name (for references)
            - citations: List of citations found (for thesis chunks)
    """
    # Match section headers like "1 Introduction" or "2.1 Method"
    # This regex identifies numbered sections at the beginning of lines
    headings = list(re.finditer(r"(?<=\n)(\d{1,2}(?:\.\d{1,2})?)\s+([A-Z][^\n]*)", text))
    chunks = []

    # Process each section identified by headings
    for i, match in enumerate(headings):
        section_title = match.group().strip()
        start = match.end()
        # Determine section end (next heading or end of text)
        end = headings[i + 1].start() if i + 1 < len(headings) else len(text)
        content = text[start:end].strip()

        # Break content into smaller overlapping semantic chunks
        # This sliding window approach maintains context while controlling chunk size
        words = content.split()
        pointer = 0
        
        while pointer < len(words):
            # Extract chunk of maximum size
            sub_chunk_words = words[pointer:pointer + max_chunk_words]
            sub_text = " ".join(sub_chunk_words).strip()

            if sub_text:
                # Create chunk with metadata
                chunk = {
                    "section_title": section_title,
                    "text": sub_text,
                    "type": source_type
                }

                # Add source information for reference documents
                if source_type == "reference":
                    chunk["source"] = source_name

                # Extract and store citations for thesis chunks
                if source_type == "thesis" and citation_index:
                    chunk["citations"] = extract_citation_keys(sub_text, citation_index)

                chunks.append(chunk)

            # Move pointer forward with overlap for context preservation
            pointer += max_chunk_words - overlap_words  # sliding window

    return chunks

### Load the texts using the functions defined

In [None]:
# Document loading utilities
# These functions handle loading and preprocessing of various document formats

def txt_to_plain_text(txt_path):
    """
    Convert a TXT file to plain text string.
    
    Args:
        txt_path (str): Path to the text file
        
    Returns:
        str: Plain text content with leading/trailing whitespace removed
    """
    with open(txt_path, 'r', encoding='utf-8') as file:
        plain_text = file.read()
    return plain_text.strip()

# Load the different source documents
# These include the main thesis and supporting reference papers
thesis = txt_to_plain_text("thesis/fyp_thesis.txt")

# Load reference papers that support the thesis research
fiig_reference = txt_to_plain_text("RAG_sources_cleaned/Dynamic_Pricing_of_Airline_Offers.txt")
garbarino_reference = txt_to_plain_text("RAG_sources_cleaned/Dynamic_Pricing_in_Internet_Retail_OCR.txt")
chen_reference = txt_to_plain_text("RAG_sources_cleaned/XGBoost_A_Scalable_Tree_Boosting_System copy.txt")
spagnuelo_reference = txt_to_plain_text("RAG_sources_cleaned/Metrics_for_Transparency.txt")

In [None]:
# Document chunking process
# Convert each document into structured chunks with metadata
# This enables granular retrieval and proper source attribution

# Process the main thesis document
# Include citation tracking to maintain academic integrity
thesis_chunks = split_by_section_headings_with_meta(
    text=thesis,
    source_type="thesis",
    citation_index=citation_index
)

# Process reference documents
# Each reference is tagged with its citation key for proper attribution
fiig_chunks = split_by_section_headings_with_meta(
    text=fiig_reference,
    source_type="reference",
    source_name="(Fiig et al., 2018)"
)

garbarino_chunks = split_by_section_headings_with_meta(
    text=garbarino_reference,
    source_type="reference",
    source_name="(Garbarino & Lee, 2003)"
)

chen_chunks = split_by_section_headings_with_meta(
    text=chen_reference,
    source_type="reference",
    source_name="(Chen & Guestrin, 2016)"
)

spagnuelo_chunks = split_by_section_headings_with_meta(
    text=spagnuelo_reference,
    source_type="reference",
    source_name="(Spagnuelo et al., 2017)"
)

In [None]:
# Combine all document chunks into a single corpus
# This creates a unified knowledge base for the RAG system
total_chunks = thesis_chunks + fiig_chunks + garbarino_chunks + chen_chunks + spagnuelo_chunks

# Extract text content for embedding generation
# The embedding model requires plain text input
chunks = [chunk["text"] for chunk in total_chunks]

# Initialize the embedding model
# BGE-base-en-v1.5 is a high-quality English embedding model
# suitable for semantic search and retrieval tasks
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
print("Model loaded successfully.")

# Generate embeddings for all text chunks
# This converts text into dense vector representations for similarity search
embeddings = model.encode(chunks)

Model loaded successfully.


In [None]:
# Vector database setup using FAISS
# FAISS (Facebook AI Similarity Search) provides efficient similarity search
# IndexFlatL2 uses L2 (Euclidean) distance for exact nearest neighbor search
index = faiss.IndexFlatL2(embeddings.shape[1])

# Add embeddings to the index
# Convert to float32 as required by FAISS
index.add(np.array(embeddings).astype(np.float32))

# Store chunk metadata for result retrieval
# This lookup table maps vector indices to original chunk metadata
chunk_lookup = total_chunks

In [None]:
# Retrieval function for finding relevant document chunks
def retrieve_relevant_chunks(query, embed_model, index, chunk_lookup, top_k=3):
    """
    Retrieve the most relevant document chunks for a given query.
    
    This function implements semantic search by:
    1. Encoding the query into an embedding vector
    2. Searching the vector index for similar chunks
    3. Returning the metadata of the most relevant chunks
    
    Args:
        query (str): The user's question or search query
        embed_model: The sentence transformer model for encoding
        index: The FAISS vector index
        chunk_lookup: List of chunk metadata for result mapping
        top_k (int): Number of most relevant chunks to return
    
    Returns:
        List[Dict]: List of chunk dictionaries with metadata
    """
    # Encode the query into an embedding vector
    query_embedding = embed_model.encode([query])
    
    # Search the vector index for similar chunks
    # D contains distances, I contains indices of nearest neighbors
    D, I = index.search(np.array(query_embedding).astype(np.float32), top_k)
    
    # Return the metadata of the most relevant chunks
    return [chunk_lookup[i] for i in I[0]]

# Example query demonstrating the retrieval system
query = "What are the benefits of dyamic pricing in retail?"

# Retrieve relevant chunks for the query
# This demonstrates the core retrieval functionality
relevant_chunks = retrieve_relevant_chunks(
    query=query,
    embed_model=model,
    index=index,
    chunk_lookup=chunk_lookup,
    top_k=3
)

# Display the retrieved chunks with their metadata
# This shows how the system finds and presents relevant information
for chunk in relevant_chunks:
    print(f"Section Title: {chunk['section_title']}")
    print(f"Text: {chunk['text']}\n")

Section Title: 2.2	Problem Outline
Text: 3rd party to benefit from dynamic pricing strategies.

Section Title: 1.3	Dynamic Pricing in Retail
Text: to market demand and sell products at the highest price customers are willing to pay. Secondly, it allows retailers to have efficient inventory management by having them adjust their prices based on surplus or scarcity of stock. Finally, dynamic pricing can boost customer satisfaction and trust through fair prices aligned with market conditions. Nevertheless, dynamic pricing, especially in retail is facing significant criticism, as some consumers find it unfair (Garbarino & Lee, 2003) to change prices for different conditions. For instance, car-service apps created a surge in price with their drivers in New York City during peak hours. This led to legal prosecutions by regulators to avoid such excess. Dynamic pricing can also have severe consequences for the company if it is misused. This can lead to a decrease of sales or brand reputation. 

In [None]:
# Response generation and context formatting functions
# These functions create structured prompts for the LLM with proper source attribution

def format_context_with_citation_keys(chunks, citation_index):
    """
    Format retrieved chunks with proper source attribution and citation information.
    
    This function creates a structured context string that clearly identifies:
    - Whether content comes from the thesis or reference papers
    - Which specific references are cited within thesis content
    - Full titles of reference papers for transparency
    
    Args:
        chunks (List[Dict]): Retrieved chunks with metadata
        citation_index (Dict): Citation lookup dictionary
    
    Returns:
        str: Formatted context string with source attribution
    """
    formatted_chunks = []

    for chunk in chunks:
        source_type = chunk["type"]
        text = chunk["text"]

        if source_type == "thesis":
            # For thesis content, identify any citations referenced
            cited_keys = chunk.get("citations", [])
            if cited_keys:
                # Create a readable list of cited sources
                citation_labels = ", ".join(
                    citation_index[cid]["citation"] for cid in cited_keys if cid in citation_index
                )
                formatted = f"(From Thesis, citing {citation_labels})\n{text}"
            else:
                formatted = f"(From Thesis)\n{text}"

        elif source_type == "reference":
            # For reference content, show the full paper title
            citation_key = chunk.get("source", "")
            citation_info = citation_index.get(citation_key, {})
            ref_title = citation_info.get("title", citation_key)
            formatted = f"(From Reference: {ref_title})\n{text}"

        else:
            # Fallback for unknown source types
            formatted = f"(From Unknown)\n{text}"

        formatted_chunks.append(formatted)

    return "\n\n".join(formatted_chunks)


def build_prompt(query, retrieved_chunks, citation_index):
    """
    Build a structured prompt for the LLM with context and instructions.
    
    This function creates a comprehensive prompt that includes:
    - Clear instructions for the LLM's role
    - Properly formatted context with source attribution
    - The user's query
    - Guidelines for citation and reference handling
    
    Args:
        query (str): The user's question
        retrieved_chunks (List[Dict]): Relevant chunks from retrieval
        citation_index (Dict): Citation lookup dictionary
    
    Returns:
        str: Complete prompt for LLM processing
    """
    # Format context with proper source attribution
    context = format_context_with_citation_keys(retrieved_chunks, citation_index)
    
    # Create structured prompt with clear instructions
    return f"""[INST] You are an academic assistant helping to summarise and explain the content of a thesis.

Use the following context to answer the question. If the context comes from a reference or citation, clearly state which reference it is from (e.g. Fiig et al., 2018). Otherwise, assume it's from the thesis itself.

Context:
{context}

Question:
{query}
[/INST]"""


def generate_rag_answer(query, embed_model, index, chunk_lookup, citation_index, llm):
    """
    Generate a complete RAG (Retrieval-Augmented Generation) answer.
    
    This function orchestrates the entire RAG pipeline:
    1. Retrieve relevant chunks based on the query
    2. Format context with proper source attribution
    3. Build a structured prompt for the LLM
    4. Generate the final answer using the local LLM
    
    Args:
        query (str): The user's question
        embed_model: Sentence transformer for encoding
        index: FAISS vector index
        chunk_lookup: Chunk metadata lookup
        citation_index: Citation information
        llm: Local LLM instance
    
    Returns:
        tuple: (prompt, answer) - The full prompt and generated answer
    """
    # Step 1: Retrieve relevant chunks
    retrieved = retrieve_relevant_chunks(query, embed_model, index, chunk_lookup)
    
    # Step 2: Build structured prompt
    prompt = build_prompt(query, retrieved, citation_index)

    # Step 3: Generate answer using local LLM
    response = llm(prompt, max_tokens=512, temperature=0.3, top_p=0.9)
    answer = response['choices'][0]['text'].strip()
    
    return prompt, answer

In [None]:
# Complete RAG system demonstration
# This example shows the full pipeline from query to answer

# Example query about a specific technical concept
query = "Describe what is XGBoost and how is it applied in the thesis"

# Initialize the local LLM
# Using llama.cpp for efficient local inference with a quantized model
# Parameters:
# - model_path: Path to the quantized model file
# - n_ctx: Context window size (4096 tokens)
# - n_threads: Number of CPU threads for inference
# - n_batch: Batch size for processing
llm = Llama(model_path="/Users/felix/Desktop/RAG_project/mistral-4bit.gguf", n_ctx=4096, n_threads=4, n_batch=4)

# Generate RAG answer using the complete pipeline
# This demonstrates the integration of retrieval and generation
prompt, answer = generate_rag_answer(
    query=query,
    embed_model=model,
    index=index,
    chunk_lookup=chunk_lookup,
    citation_index=citation_index,
    llm=llm
)

# Display the generated answer
print(answer)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/felix/Desktop/RAG_project/mistral-4bit.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 14336,  4096, 

XGBoost (Extreme Gradient Boosting) is a machine learning algorithm that builds upon gradient boosting techniques, which are used for additive optimization in functional space. It was first described in 2016 by Chen and Guestrin. XGBoost enhances the foundation of gradient boosting by introducing a regularized objective to prevent overfitting. The algorithm simplifies the formulation and optimizes it for parallel processing, making it efficient and scalable. Column sampling is another effective technique used in XGBoost to reduce overfitting and improve training efficiency. It is inspired by methods from Random Forests.

In the thesis, XGBoost was applied to reduce the root mean squared error (RMSE) score of a time series forecasting model. The final model used was XGBoost, which had been tested because of its increasing popularity and success in major data science competitions. No hyperparameter tuning was made in the implementation of the XGBoost model, and the accuracy of 0.08 in th


llama_print_timings:        load time =   12920.85 ms
llama_print_timings:      sample time =     130.43 ms /   232 runs   (    0.56 ms per token,  1778.73 tokens per second)
llama_print_timings: prompt eval time =  998744.51 ms /   732 tokens ( 1364.41 ms per token,     0.73 tokens per second)
llama_print_timings:        eval time =  268357.02 ms /   231 runs   ( 1161.72 ms per token,     0.86 tokens per second)
llama_print_timings:       total time = 1269510.58 ms


In [None]:
# Display both the prompt and answer for transparency
# This shows the complete interaction between the retrieval system and LLM

# The prompt shows:
# - Retrieved context with source attribution
# - The user's original question
# - Instructions for proper citation handling
print("=== PROMPT ===")
print(prompt)

print("\n=== ANSWER ===")
# The answer demonstrates:
# - How the LLM synthesizes information from multiple sources
# - Proper citation and reference handling
# - Coherent response generation based on retrieved context
print(answer)

[INST] You are an academic assistant helping to summarise and explain the content of a thesis.

Use the following context to answer the question. If the context comes from a reference or citation, clearly state which reference it is from (e.g. Fiig et al., 2018). Otherwise, assume it's from the thesis itself.

Context:
(From Reference: (Chen & Guestrin, 2016))
builds upon a long line of research in gradient boosting and scalable machine learning systems. This section highlights how XGBoost relates to, and improves upon, previous work. Gradient boosting, originally developed as a general technique for additive optimisation in functional space, has been successfully applied to a wide variety of tasks including classification, learning to rank, and structured prediction. XGBoost enhances this foundation by introducing a regularised objective that helps prevent overfitting. While similar ideas appear in earlier works such as Regularized Greedy Forest (RGF), XGBoost simplifies the formulati