# **Retrieval Augmented Generation with LLMs**

#Import Libraries and Load an LLM

We will install updated versions of Keras hub, the latest pre-release updates for Keras (keras-nightly), and pypdf2, which we can use to extract text from PDF files (our reference database in our RAG).

In [None]:
!pip install --upgrade --quiet keras-hub-nightly keras-nightly
!pip install --upgrade --quiet keras-hub keras-nlp
!pip install --quiet pypdf2

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.1/797.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m792.1/792.1 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tensorflow as tf
import keras_hub
import keras_nlp
import numpy as np
import pandas as pd
import keras
import keras_hub
from PyPDF2 import PdfReader
from sklearn.metrics.pairwise import cosine_similarity # We will use cosine similarity to implement our document lookup.

Let's load a pre-trained LLM from Keras hub. We will use Falcon 1B (because it is not too big).

In [None]:
# Load a Model with float16 precision. This step may takes some time.
falcon_chat = keras_hub.models.CausalLM.from_preset(
    "falcon_refinedweb_1b_en",#"phi3_mini_4k_instruct_en",
    dtype="float16"
)

Downloading from https://www.kaggle.com/api/v1/models/keras/falcon/keras/falcon_refinedweb_1b_en/2/download/config.json...


100%|██████████| 508/508 [00:00<00:00, 1.31MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/falcon/keras/falcon_refinedweb_1b_en/2/download/model.weights.h5...


100%|██████████| 4.89G/4.89G [01:38<00:00, 53.5MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/falcon/keras/falcon_refinedweb_1b_en/2/download/tokenizer.json...


100%|██████████| 628/628 [00:00<00:00, 1.24MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/falcon/keras/falcon_refinedweb_1b_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:00<00:00, 2.76MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/falcon/keras/falcon_refinedweb_1b_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:00<00:00, 1.58MB/s]


We will ask Falcon to answer a question for us about Moby Dick, namely 'who was the first mate on Capt. Ahab's boat?' The correct answer is Starbuck.

In [174]:
prompt = "What was the name of the first mate on Captain Ahab's boat?"

print(falcon_chat.generate(prompt, max_length=64))

What was the name of the first mate on Captain Ahab's boat?
- print Print Print
- list Cite
1 Answer
The first mate was a man named Queequinne. He was a native of England and he was the son of a man named John Queinne and a native


#Implement RAG with this Model

Write a function to extract the text from some PDF files.

In [190]:
import re
from PyPDF2 import PdfReader

# 500-token chunks of mutually exclusive text (no overlap between chunks)
def process_pdf(pdf_path, chunk_size=500, overlap=0):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + " "

    # Preprocessing
    text = re.sub(r'\r\n|\r|\n', ' ', text)  # Remove carriage returns and newlines
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation

    if len(text) <= chunk_size:
        return [text]

    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        if end < len(text) and text[end] != ' ':
            end = text.rfind(' ', start, end) + 1
        chunks.append(text[start:end])
        if end == len(text):
            break
        start = end - overlap

    return chunks

Let's find a copy of Moby Dick that we can chunk and use for RAG.

In [191]:
# Download PDF of a cookbook recipe and save it locally for processing.
!wget -O moby_dick.pdf https://uberty.org/wp-content/uploads/2015/12/herman-melville-moby-dick.pdf

!apt-get install -y poppler-utils

# Extract just the first 100 pages... it's a long book. Starbuck is first mentioned in the table of contents (though not in reference to being the first mate) and then again on page 77.
!pdftotext -f 1-50 moby_dick.pdf moby_dick.txt

pdf_path = "moby_dick.pdf"
chunks = process_pdf(pdf_path)

print(chunks[-2])

--2025-04-17 15:04:04--  https://uberty.org/wp-content/uploads/2015/12/herman-melville-moby-dick.pdf
Resolving uberty.org (uberty.org)... 68.66.200.199
Connecting to uberty.org (uberty.org)|68.66.200.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1544566 (1.5M) [application/pdf]
Saving to: ‘moby_dick.pdf’


2025-04-17 15:04:05 (3.80 MB/s) - ‘moby_dick.pdf’ saved [1544566/1544566]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.7).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.
pdftotext version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>    

In [192]:
# Find the first text chunk that mentions 'starbuck'

starbuck_indices = []
for i,chunk in enumerate(chunks):
    if 'starbuck' in chunk.lower():
        starbuck_indices.append(i)

# Starbuck is mentioned in the table of contents...
print(starbuck_indices)
print(chunks[starbuck_indices[1]])

[4, 370, 372, 373, 375, 385, 386, 387, 388, 415, 417, 419, 420, 421, 422, 424, 437, 438, 555, 562, 565, 610, 614, 615, 618, 619, 621, 622, 623, 627, 628, 633, 636, 637, 640, 642, 647, 649, 706, 817, 818, 840, 843, 872, 873, 1049, 1070, 1107, 1146, 1149, 1194, 1195, 1196, 1197, 1294, 1302, 1304, 1309, 1321, 1326, 1327, 1329, 1332, 1431, 1441, 1444, 1445, 1452, 1617, 1619, 1749, 1752, 1753, 1756, 1758, 1759, 1760, 1761, 1779, 1821, 1851, 1852, 1854, 1855, 1856, 1857, 1858, 1859, 1861, 1862, 1866, 1867, 1876, 1877, 1879, 1883, 1885, 1887, 1893, 1894, 1898, 1902, 1903, 1905, 1915, 1916, 1917, 1923, 1942, 1953, 1962, 1963, 1965, 1978, 1979, 1980, 1982, 1983, 1984, 1985, 1989, 1990, 1996, 1997, 2025, 2026, 2027, 2036, 2037, 2038, 2041, 2042, 2045, 2047, 2054, 2055, 2065]
 with a sort of muﬄedness then seemed troubled in the nose then revolved over once or twice then sat up and rubbed his eyes Holloa he breathed at last who be ye smokers Shipped men answered I when does she sail Aye aye ye ar

We need to construct embeddings from our prompt and from our text chunks to be able to perform the lookup between our prompt and the recipe text chunks. Here is how we can recover a Falcon embedding for the prompt...

In [193]:
import tensorflow as tf
from keras_hub.models import CausalLM
from keras_hub.tokenizers import FalconTokenizer

# Load model and tokenizer
tokenizer = FalconTokenizer.from_preset("falcon_refinedweb_1b_en")

# Tokenize input (returns just token IDs, shape: (seq_len,))
token_ids = tokenizer(prompt)

# Add batch dimension (now shape: (1, seq_len))
token_ids = tf.expand_dims(token_ids, 0)

# Create dummy padding mask of ones (no padding here for a single prompt, but the LLM expects a masking tensor along with the input, so it knows what tokens it can ignore.)
padding_mask = tf.ones_like(token_ids)

# Pass both token_ids and padding_mask to the backbone
hidden = falcon_chat.backbone({
    "token_ids": token_ids,
    "padding_mask": padding_mask
})

# Mean-pool to get a sentence embedding
mask = tf.cast(padding_mask, tf.float16)
masked_hidden = hidden * tf.expand_dims(mask, axis=-1)
embedding = tf.reduce_sum(masked_hidden, axis=1) / tf.reduce_sum(mask, axis=1, keepdims=True)

print("Prompt embedding shape:", embedding.shape)  # (1, hidden_dim)
print(f"Prompt embedding: {embedding}")

Prompt embedding shape: (1, 2048)
Prompt embedding: [[ 0.956   3.248   2.383  ...  2.896  -1.5625 -3.219 ]]


And we will now do the same thing for all chunks fo text that came from our reference copy of Moby Dick. We will first tokenize each chunk... note that this may take some time because our chunks span overlapping sections of 50 pages of text :).

In [194]:
tokenized_ids = [tokenizer(chunk) for chunk in chunks]
max_len = max(len(ids) for ids in tokenized_ids)

print(f'Our longest chunk has {max_len} tokens')

Our longest chunk has 174 tokens


Now we will recover embeddings for each chunk, using Falcon's backbone, in batches of 4. The reason we do 4 chunks at a time is that the tensor will otherwise be huge and it will result in GPU memory errors.

In [195]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def get_embeddings_in_batches(token_ids, padding_mask, model, batch_size=4):
    embeddings_list = []
    for i in range(0, token_ids.shape[0], batch_size):
        batch_ids = token_ids[i:i+batch_size]
        batch_mask = padding_mask[i:i+batch_size]

        inputs = {
            "token_ids": batch_ids,
            "padding_mask": batch_mask
        }

        hidden = model.backbone(inputs)  # shape: (batch_size, seq_len, hidden_dim)

        mask = tf.cast(batch_mask, tf.float16)
        masked_hidden = hidden * tf.expand_dims(mask, axis=-1)
        embedding = tf.reduce_sum(masked_hidden, axis=1) / tf.reduce_sum(mask, axis=1, keepdims=True)
        embeddings_list.append(embedding)

    return tf.concat(embeddings_list, axis=0)


# Pad sequences to same length
padded_ids = pad_sequences(tokenized_ids, padding='post')  # shape: (num_chunks, max_seq_len)

# Create padding mask (1 where there's a token, 0 where there's padding)
padding_mask = (padded_ids != 0).astype("int32")

# Convert to tensors
token_ids = tf.convert_to_tensor(padded_ids, dtype=tf.int32)
padding_mask = tf.convert_to_tensor(padding_mask, dtype=tf.int32)

# Now run batching
embeddings = get_embeddings_in_batches(token_ids, padding_mask, falcon_chat, batch_size=4)

Here are our document chunk embeddings...

In [197]:
chunk_index = 100

# Directly access and print the first chunk
print(f'CHUNK TEXT: \n\n{chunks[chunk_index]} \n\n')

# Count words directly
print(f'CHUNK DETAILS: \n\nThe chunk contains roughly {len(chunks[chunk_index].split(" "))} words (based on white spaces).')

# Look at embedding shape directly
print(f'Its embedding has {len(embeddings[chunk_index])} dimensions.\n')

# Print vector representation directly
print(f'VECTOR REPRESENTATION:\n')
print(embeddings[chunk_index])

CHUNK TEXT: 

uncomfortableness and seeing him now exhibiting strong symptoms of concluding his business operations and jump ing into bed with me I thought it was high time now or never before the light was put out to break the spell into which I had so long been bound But the interval I spent in deliberating what to say was a fatal one Taking up his tomahawk from the table he examined the head of it for an instant and then holding it to the light with his mouth at the handle he puﬀed out great clouds of  


CHUNK DETAILS: 

The chunk contains roughly 97 words (based on white spaces).
Its embedding has 2048 dimensions.

VECTOR REPRESENTATION:

tf.Tensor([ 1.287  2.217  2.96  ...  2.605 -2.014 -2.877], shape=(2048,), dtype=float16)


Let's combine text chunks (column 1) with associated embeddings (column 2) in a pandas dataframe and save it.


In [198]:
rag_df = pd.DataFrame({'chunk': chunks, 'embedding': embeddings.numpy().tolist()})

We can save it to our drive folder for later use (or we can load a previously stored file)...

In [202]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

rag_df.to_pickle('/content/drive/MyDrive/Teaching/Courses/BA 865/BA865-2025/Lecture Materials/Week 5/rag_df.pkl')
#rag_df = pd.read_pickle('/content/drive/MyDrive/Teaching/Courses/BA 865/BA865-2025/Lecture Materials/Week 5/rag_df.pkl')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Now we need a method to perform a top-k similar chunk lookup (relative to the prompt).

In [215]:
from sklearn.metrics.pairwise import cosine_similarity

def rag_lookup(query, rag_data, top_k=2):

    token_ids = tokenizer(query)
    token_ids = tf.expand_dims(token_ids, 0)
    padding_mask = tf.ones_like(token_ids)

    hidden = falcon_chat.backbone({
        "token_ids": token_ids,
        "padding_mask": padding_mask
    })
    mask = tf.cast(padding_mask, tf.float16)
    masked_hidden = hidden * tf.expand_dims(mask, axis=-1)
    query_embedding = tf.reduce_sum(masked_hidden, axis=1) / tf.reduce_sum(mask, axis=1, keepdims=True)

    # Stack chunk embeddings into a matrix
    chunk_embeddings = np.vstack(rag_data['embedding'].values)

    # Compute cosine similarity between query and all chunks
    similarities = cosine_similarity(query_embedding, chunk_embeddings)[0]

    # Get top results
    temp_df = rag_data.copy()
    temp_df['similarity'] = similarities
    results = temp_df.sort_values('similarity', ascending=False).head(top_k)
    return results

lookup = rag_lookup("Who was the chief mate on ahab's ship?", rag_df, top_k = 10)
lookup['chunk'].iloc[0]

'and Star bucks coerced will were Ahabs so long as Ahab kept his magnet at Starbucks brain still he knew that for all this the chief mate in his soul abhorred his cap tains quest and could he would joyfully disintegrate himself from it or even frustrate it it might be that a long interval would elapse ere the White Whale was seen During that long interval Starbuck would ever be apt to fall into open relapses of rebellion against his captains leadership unless some ordinary pruden tial '

In [207]:
# RAG prompt creation
def create_rag_prompt(query, rag_data, top_k=2):
    results = rag_lookup(query, rag_data, top_k)
    prompt = f"Answer the following question: {query}\n\n"
    prompt += "Here are some recent relevant documents:\n"
    for i, (_, row) in enumerate(results.iterrows()):
        prompt += f"\n--- Source {i+1} ---\n{row['chunk']}\n"
    prompt += "\nPlease refer to the relevant documents provided above when answering."
    return prompt

In [217]:
rag_prompt = create_rag_prompt("Who was the chief mate on ahab's ship?", rag_df, top_k=3)
print(rag_prompt)

Answer the following question: Who was the chief mate on ahab's ship?

Here are some recent relevant documents:

--- Source 1 ---
and Star bucks coerced will were Ahabs so long as Ahab kept his magnet at Starbucks brain still he knew that for all this the chief mate in his soul abhorred his cap tains quest and could he would joyfully disintegrate himself from it or even frustrate it it might be that a long interval would elapse ere the White Whale was seen During that long interval Starbuck would ever be apt to fall into open relapses of rebellion against his captains leadership unless some ordinary pruden tial 

--- Source 2 ---
tormented chase of that demon phantom that some time or other swims before all human hearts while chasing such over this round globe they either lead us on in barren mazes or midway leave us whelmed e cabincompass is called the telltale because without going to the compass at the helm the Captain while below can inform himself of the course of the ship  CHAPTE

And now we can see how the process would work, to provide additional context to the LLM in the prompt when obtaining its answer. Though, Falcon... struggles. The context window is a bit of a challenge here.

In [230]:
# Without RAG
prompt = "Name the chief mate on Ahabs ship."

response_without_rag = None
response_without_rag = falcon_chat.generate(prompt, max_length=200)

# With RAG
rag_prompt = create_rag_prompt(prompt, rag_df, top_k=1)

response_with_rag = None
response_with_rag = falcon_chat.generate(rag_prompt, max_length=200)

# Print responses
print("WITHOUT RAG:")
print(response_without_rag)
print("\n\nWITH RAG:")
print(response_with_rag)

WITHOUT RAG:
Name the chief mate on Ahabs ship.
1) Captain Ahab
2) Captain Queeg
3) Captain Smith
4) First Officer Queeg
5) First Officer Queeg
6) Captain Smith
7) First Officer Smith
8) First Officer Smith
9) Captain Smith
10) First Officer
11) Captain Queeg
12) Captain Queeg
13) First Officer
14) First Officer
15) Captain Smith
16) Captain Queeg
17) First Officer
18) Captain Ahab
19) Captain Ahab
20) Captain Queeg
21) Captain Ahab
22) First Officer
23) Captain Ahab
24) Queeg
25) Queeg
26) Captain Ahab
27) Captain Ahab


WITH RAG:
Answer the following question: Name the chief mate on Ahabs ship.

Here are some recent relevant documents:

--- Source 1 ---
and Star bucks coerced will were Ahabs so long as Ahab kept his magnet at Starbucks brain still he knew that for all this the chief mate in his soul abhorred his cap tains quest and could he would joyfully disintegrate himself from it or even frustrate it it might be that a long interval would elapse ere the White Whale was seen Durin