In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["VOYAGE_API_KEY"] = os.getenv("VOYAGE_API_KEY")

In [2]:
from typing_extensions import TypedDict
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

from langchain_voyageai import VoyageAIRerank
from langchain.retrievers import ContextualCompressionRetriever

from langgraph.graph import StateGraph, START, END

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Step 1: Load and split PDF
loader = PyPDFLoader("data/attention-is-all-you-need-Paper.pdf")
docs = loader.load()

# Check
print(f"Total pages loaded: {len(docs)}")
print(docs[5].page_content[:500])

Total pages loaded: 11
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. nis the sequence length, dis the representation dimension, kis the kernel
size of convolutions and rthe size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2 ·d) O(1) O(1)
Recurrent O(n·d2) O(n) O(n)
Convolutional O(k·n·d2) O(1) O(logk(n))
Self-Attention (restricted) O(r·n·d) O(1) 


In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)

# Check
print(f"Total chunks: {len(chunks)}")
print(chunks[5].page_content)

Total chunks: 43
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building


In [5]:
# Step 2: Create FAISS vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

In [6]:
# Step 3: Base retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [7]:
# Check
query = "What is transformer?"
retrieved_docs = retriever.get_relevant_documents(query)
print(f"Retrieved {len(retrieved_docs)} documents")

for i, doc in enumerate(retrieved_docs, start=1):
    print(f"Rank Doc {i}:\n{doc.page_content[:200]}")

  retrieved_docs = retriever.get_relevant_documents(query)


Retrieved 5 documents
Rank Doc 1:
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz
Rank Doc 2:
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [14, 15] and [8].
3 Model Architecture

Rank Doc 3:
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the 
Rank Doc 4:
6 Results
6.1 Machine Translation
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
in Table 2) outperforms the best previously reported models (includin
Rank Doc 5:
Figure 1: The Transformer - model architecture.
wise fully connected feed-forward network. We employ a residual co

In [8]:
# Step 4: Add VoyageAI reranker
compressor = VoyageAIRerank(
    model="rerank-lite-1",             # or 'rerank-1', 'rerank-2', etc.
    top_k=3                            # number of top docs to keep
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

reranked_docs = compression_retriever.invoke(query)
print(f"Reranked to {len(reranked_docs)} documents")

for i, doc in enumerate(reranked_docs, start=1):
    print(f"Rank Doc {i}:\n{doc.page_content[:200]}")

Reranked to 3 documents
Rank Doc 1:
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the 
Rank Doc 2:
6 Results
6.1 Machine Translation
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
in Table 2) outperforms the best previously reported models (includin
Rank Doc 3:
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz


In [9]:
# Step 5: LLM
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

context = "\n\n".join([d.page_content for d in reranked_docs])
prompt = f"Answer the following question based on the context below.\n\nContext:\n{context}\n\nQuestion: {query}"

answer = llm.invoke(prompt)
print("Answer:", answer.content)

Answer: The Transformer is a model architecture designed for sequence modeling and transduction tasks that relies entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This architecture allows for the modeling of dependencies between input and output sequences without regard to their distance, enabling significant parallelization during training. The Transformer has demonstrated superior performance in machine translation tasks, achieving state-of-the-art results while requiring less training time and cost compared to previous models that utilized recurrent or convolutional neural networks.


In [10]:
# Step 9: LangGraph pipeline
def retrieve_docs(state):
    docs = compression_retriever.get_relevant_documents(state["question"])
    state["context"] = "\n\n".join([d.page_content for d in docs])
    return state

def generate_answer(state):
    prompt = f"Answer the question using the context:\n\nContext:\n{state['context']}\n\nQuestion:\n{state['question']}"
    state["answer"] = llm.invoke(prompt).content
    return state

graph = StateGraph(dict)
graph.add_node("retrieve", retrieve_docs)
graph.add_node("generate", generate_answer)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "generate")
graph.add_edge("generate", END)

app_graph = graph.compile()

# Test full pipeline
state = {"question": query}
result = app_graph.invoke(state)
print("Pipeline Answer:", result["answer"])


Pipeline Answer: The Transformer is a model architecture that relies entirely on attention mechanisms to model dependencies between input and output sequences, without using recurrence or convolutions. This design allows for significantly more parallelization in processing, leading to improved efficiency and performance in tasks such as machine translation. The Transformer has been shown to achieve state-of-the-art results in translation quality, outperforming previous models while requiring less training time and cost.
