One of the interesting things about embeddings[^1] is how they can be used to search documents with results that reflect the meaning of your search rather than the actual search words and terms.

[^1]: AWS explains embeddings [here](https://aws.amazon.com/what-is/embeddings-in-machine-learning) and I play around with them [here](../2025-01-19-playing-with-embeddings/index.html).

I am going to use Latent Space’s [The 2025 AI Engineer Reading List](https://www.latent.space/p/2025-papers) and search through the PDFs in their sections about RAG and Agents to see what happens when I search for RAG related terms.

In [4]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain_openai import OpenAI
from langchain.chains.summarize import load_summarize_chain
import os
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.

True

## LLM application development made easier with LangChain
I'm using LangChain, an open source framework that makes life easier for people writing LLM applications.

In [5]:
#| source-line-numbers: "7,8,11,12"
pdf_directory = "../../sample_data/The 2025 AI Engineer Reading List/RAG/"
pdf_files = [f for f in os.listdir(pdf_directory) if f.endswith('.pdf')]

documents = []
for pdf_file in pdf_files:
    file_path = os.path.join(pdf_directory, pdf_file)
    loader = PyPDFLoader(file_path)
    documents.extend(loader.load())

# Optionally split the documents into manageable chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

The `RecursiveCharacterTextSplitter` creates a list of document objects, storing their meta-data (document name, page number of content etc) along with the actual content. Note that this is still stored as normal text at this tage.

## Create embeddings for the documents
We need our AI models to be able to work with the documents, so we create embeddings. This is where we start hitting the ChatGPT API. I would prefer to use a local LLM, but for now we need the best results possible to know where our code is doing a good job and where it isn't. 

We use embeddings from OpenAI and Meta's FAISS library for searching for similar phrases.

In [6]:
embeddings = OpenAIEmbeddings()  # needs the OPENAI_API_KEY environment variable
vectorstore = FAISS.from_documents(docs, embeddings)

And if we explore the embeddings, there are a total 362 embeddings

In [29]:
vectorstore.index.ntotal

362

where each embedding has 1536 dimensions

In [30]:
# look at the first embeeding
vectorstore.index.reconstruct(0)

array([-0.02747275, -0.01618657, -0.0022142 , ..., -0.02544596,
       -0.00838481, -0.03731519], shape=(1536,), dtype=float32)

## Do the actual search

In [7]:
query = "retreived augmented generation in large lanugage models, RAG"
retrieved_docs = vectorstore.similarity_search(query, k=5)  # adjust k as needed

In [9]:
# 4. For each retrieved document, use an LLM to summarize why it might be relevant.
llm = OpenAI(temperature=0)  # low temperature for deterministic output

def summarize_relevance(doc, query):
    # Here we build a simple prompt to ask why the document is relevant to the query.
    prompt = (
        f"Given the following text from a document:\n\n"
        f"{doc.page_content}\n\n"
        f"And the search query:\n\n"
        f"{query}\n\n"
        f"Summarize why this text is relevant to the query."
    )
    summary = llm.invoke(prompt)
    return summary.strip()

# Process and print the results
for i, doc in enumerate(retrieved_docs):
    relevance_summary = summarize_relevance(doc, query)
    print(f"Document {i+1}:")
    print(f"--- Relevance Summary ---\n{relevance_summary}\n")
    print("Document Excerpt:")
    print(doc.page_content[:500], "...\n")  # print first 500 characters as an excerpt
    print("=" * 80)


Document 1:
--- Relevance Summary ---
The text discusses the use of RAG models, which utilize input sequences to retrieve text documents and generate responses. These models have achieved state-of-the-art results on various tasks, including fact verification and question generation. The text also mentions the use of non-parametric memory to update the models' knowledge as the world changes. This is relevant to the query as it discusses the use of "retreived augmented generation" in large language models, which is a key aspect of RAG models.

Document Excerpt:
without access to an external knowledge source. Our RAG models achieve state-of-the-art results
on open Natural Questions [29], WebQuestions [3] and CuratedTrec [2] and strongly outperform
recent approaches that use specialised pre-training objectives on TriviaQA [24]. Despite these being
extractive tasks, we ﬁnd that unconstrained generation outperforms previous extractive approaches.
For knowledge-intensive generation, we experi