## Simple RAG Application

### PDF Reader

In [None]:
!pipenv install pypdf

In [33]:
from pypdf import PdfReader

In [34]:
def get_pdf_content(file_path: str):
    if not file_path.endswith(".pdf"):
        raise ValueError("Expecting .pdf file")
    
    reader = PdfReader(file_path)
    
    print(f"Found {len(reader.pages)} pages")
    
    pdf_content = ""
    
    for page in reader.pages:
        text = page.extract_text()
        pdf_content += text
    
    return pdf_content

In [35]:
pdf_content = get_pdf_content(file_path="../assets/docs/Declaration_of_Independence.pdf")

Found 3 pages


In [36]:
len(pdf_content)

9599

In [6]:
print(pdf_content[:300])

United States
Declaration of Independence
The 1823 facsimile of the engrossed copy of
the Declaration of Independence
Created June–July 1776
Ratified July 4, 1776
Location Engrossed copy: National
Archives Building
Rough draft: Library of Congress
Author(s) Thomas Jefferson, Committee of
Five
Signat


### LLM + Embeddings

| Chat Completion |
|--|

In [1]:
from openai import AzureOpenAI
from dotenv import load_dotenv
from os import getenv

load_dotenv()
client = AzureOpenAI(
    api_key=getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=getenv("AZURE_OPENAI_API_VERSION")
)

In [2]:
MODEL_NAME = "gpt-4o-mini"

completion = client.chat.completions.create(
  model=MODEL_NAME,
  store=False,
  messages=[
    {"role": "user", "content": "write a haiku about ai"}
  ]
)

In [22]:
def get_chat_completion(query: str):
    completion = client.chat.completions.create(
    model=MODEL_NAME,
    store=False,
    messages=[
        {"role": "user", "content": query}
        ]
    )
    return completion.choices[0].message.content

In [6]:
print(completion.choices[0].message.content)

Silicon whispers,  
Dreams woven in data streams,  
Minds awake in code.


In [25]:
print(get_chat_completion(query="Tell me a joke"))

Why did the scarecrow win an award? 

Because he was outstanding in his field!


| Embeddings |
|--|

In [7]:
embedding_deployment = "text-embedding-3-small"

In [8]:
try:
    # Query the embeddings endpoint
    response = client.embeddings.create(
        model=embedding_deployment,  # Use deployment name, not model name
        input="Hello World"
    )
    
    # Extract the embedding
    embedding = response.data[0].embedding
    print(f"Embedding (first 10 values): {embedding[:10]}")
    print(f"Embedding length: {len(embedding)}")
except Exception as e:
    print(f"Error generating embedding: {e}")

Embedding (first 10 values): [0.004830851685255766, -0.05471838638186455, 0.045494429767131805, 0.031470887362957, -0.02837539091706276, -0.029735533520579338, -0.03137708455324173, 0.03159596025943756, -0.01418769545853138, 0.015368049964308739]
Embedding length: 1536


In [9]:
import numpy as np

In [17]:
def get_openai_embeddings(texts):
    response = client.embeddings.create(
        model=embedding_deployment,  # Use a suitable OpenAI embedding model
        input=texts
    )
    return [np.array(embedding.embedding, dtype='float32') for embedding in response.data]

In [18]:
embedding_list = get_openai_embeddings(texts=["One", "Two", "Tree"])

In [21]:
len(embedding_list), embedding_list[0], len(embedding_list[0])

(3,
 array([-0.01171586, -0.0244152 ,  0.02705341, ..., -0.01912316,
        -0.02052812,  0.06575244], shape=(1536,), dtype=float32),
 1536)

### Vector Store

In [None]:
!pipenv install faiss-cpu

In [26]:
import faiss

In [27]:
EMBEDDING_DIM = 1536

In [56]:
index = faiss.IndexFlatL2(EMBEDDING_DIM)

### Chunking

In [None]:
!pipenv install tiktoken

In [30]:
import tiktoken

In [48]:
def chunk_text(text, chunk_size=150, overlap=50):
    encoding = tiktoken.encoding_for_model(embedding_deployment)
    tokens = encoding.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

In [49]:
pdf_chunks = chunk_text(text=pdf_content)

In [50]:
len(pdf_chunks)

22

In [51]:
chunk_embeddings = get_openai_embeddings(texts=pdf_chunks)

In [52]:
len(chunk_embeddings[0])

1536

In [57]:
index.add(np.array(chunk_embeddings))

In [58]:
query_text = "Abraham Lincoln's involvement"
query_embedding = get_openai_embeddings([query_text])[0]

In [59]:
k = 3  # Number of nearest neighbors
distances, indices = index.search(query_embedding.reshape(1, -1), k)

In [60]:
print("Query:", query_text)
print("Nearest chunks:")
for i, idx in enumerate(indices[0]):
    print(f"Chunk: {pdf_chunks[idx]}")  # Truncate for brevity
    print(f"Distance: {distances[0][i]:.4f}")
    print("#"*100)

Query: Abraham Lincoln's involvement
Nearest chunks:
Chunk:  globally impactful statement on human rights. The
Declaration was viewed by Abraham Lincoln as the moral standard to which the United States
should strive, and he considered it a statement of principles through which the Constitution should
be interpreted.[6]: 126  In 1863, Lincoln made the Declaration the centerpiece of his Gettysburg
Address, widely considered among the most famous speeches in American history.[7] The
Declaration's second sentence, "We hold these truths to be self-evident, that all men are created
equal, that they are endowed by their Creator with certain unalienable Rights, that among these
are Life, Liberty and the pursuit of Happiness", is considered one of the most significant and famed
lines in
Distance: 1.1801
####################################################################################################
Chunk: side was widely
distributed following its signing. It is now preserved at the Library 