# **GOAL:**

Create a custom RAG application which will fetch the answer when a user send it a query.

## Steps:

Step 1: Input paragraph (your data)

Step 2: Clean the text and split into sentences

Step 3: Create dataset from individual sentences

Step 4: Load embedding model

Step 5: Embed each sentence and add to dataset

Step 6: Build FAISS index

Step 7: Perform semantic search query

Step 8: Retrieve top-1 most relevant sentence

Step 9: Show the clean, precise answer

Step 10: Summarize answer if its too long

# Import Data

In [16]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [1]:
!pip install transformers datasets sentence-transformers faiss-cpu

...


In [11]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [25]:
import PyPDF2

with open('The_Art_of_War-Sun_Tzu.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    text = ''.join([page.extract_text() for page in reader.pages])
    corpus = "".join(text)

corpus

"T\x00\x00 \x00\x00\x00 \x00\x00 \x00 \x00\x00\nS\x00\x00  T \x00 \x00\nT \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 :  L \x00 \x00 \x00 ... with numbers likewise following in sequence for the remainder\nof the chapter .\nThis book has been downloaded from www .aliceandbooks.com . You\ncan find many more public domain books in our website"

# Step 2: Clean the text and split into sentences

In [26]:
import re

def clean_text(text):
    text = " ".join(re.findall(r'[a-zA-Z0-9.,;:()"\-]+', text))
    text = text.replace("\n", " ").strip()
    return text

clean_corpus = clean_text(corpus)
clean_corpus

'T S T T : L G P : - 5 1 4 This book has been downloaded from www .aliceandbooks.com . You can find many more public domain books in our websiteP F Lionel Giles ground-breaking edition ... with numbers following in sequence thereafter; and paragraph 13 in Chapter 2 ought to be marked 13, 14. with numbers likewise following in sequence for the remainder of the chapter . This book has been downloaded from www .aliceandbooks.com . You can find many more public domain books in our website'

### Splitting into sentences using Sentence Tokenize

In [27]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(clean_corpus)
sentences

['T S T T : L G P : - 5 1 4 This book has been downloaded from www .aliceandbooks.com .',
 'You can find many more public domain books in our websiteP F Lionel Giles ground-breaking edition of Sun Tzu s ancient treatise on the Art of W ar was nothing short of a scholarly masterpiece.',
 '3.',
 'The art of war , then, is governed by five constant factors, to be taken into account in one s deliberations, when seeking to determine the conditions obtaining in the field.',
 '4.',
 'These are: ( ) the Moral Law; ( ) Heaven; ( ) Earth; ( ) the Commander; ( ) method and discipline.',
 '5.',
 'The Moral Law causes the people to be in complete accord with their ruler , so that they will follow him regardless of their lives, undismayed by any danger .',
 '6.',
 'Heaven signifies night and day , cold and heat, times and seasons.',
 '7.',
 'Earth comprises distances, great and small; danger and security; open ground and narrow passes; the chances of life and death.',
 '8.',
 'The Commander stands f

In [28]:
sentences = [sentence for sentence in sentences if len(sentence) >= 10]
sentences

['T S T T : L G P : - 5 1 4 This book has been downloaded from www .aliceandbooks.com .',
 'You can find many more public domain books in our websiteP F Lionel Giles ground-breaking edition of Sun Tzu s ancient treatise on the Art of W ar was nothing short of a scholarly masterpiece.',
 'It contains the original Chinese text, an accurate and fancy-free yet highly readable translation, extensive annotations by both ancient Chinese commentators and Giles himself, and a vast introduction to provide an in-depth historical perspective to it all.',
'...',
 'This book has been downloaded from www .aliceandbooks.com .',
 'You can find many more public domain books in our website']

# Step 3: Create dataset from individual sentences

In [31]:
!pip install datasets

...

In [33]:
from datasets import Dataset

dataset = Dataset.from_dict({"context": sentences})
dataset

Dataset({
    features: ['context'],
    num_rows: 563
})

In [34]:
dataset['context']

['T S T T : L G P : - 5 1 4 This book has been downloaded from www .aliceandbooks.com .',
 'You can find many more public domain books in our websiteP F Lionel Giles ground-breaking edition of Sun Tzu s ancient treatise on the Art of W ar was nothing short of a scholarly masterpiece.',
'...',
 '1 This edition, due to technical limitations, uses simplified numbering for Chapters 1 and 2.',
 'Correctly , paragraph 5 in Chapter 1 ought to be marked 5, 6. with numbers following in sequence thereafter; and paragraph 13 in Chapter 2 ought to be marked 13, 14. with numbers likewise following in sequence for the remainder of the chapter .',
 'This book has been downloaded from www .aliceandbooks.com .',
 'You can find many more public domain books in our website']

# Step 4: Load embedding model

This section of the code is responsible for loading a pre-trained model that will be used to generate embeddings for our text data.

In [35]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Step 5: Embed each sentence and add to dataset

This code snippet focuses on generating embeddings for each sentence in your dataset and adding those embeddings to the dataset itself. Here's a detailed explanation:

* `def get_embeddings(texts)`: This line defines a function named `get_embeddings` that takes a list of `texts` as input.

* `embeddings = model.encode(texts, normalize_embeddings=True)`: Inside the function, this line is the core of the embedding generation process.

    - It uses the `pre-trained model` (loaded earlier in your code) to encode the input texts. Encoding essentially means converting the text into a numerical vector that represents its meaning.

    - `normalize_embeddings=True` ensures that the generated embeddings are normalized, meaning they have a length of 1. This is often done to improve the performance of similarity calculations.

    - `return embeddings:` This line simply returns the generated embeddings.

* `dataset = dataset.map(lambda x: {"embedding": get_embeddings([x["context"]])[0]})`: This line is crucial for adding the embeddings to your dataset.

    * `dataset.map()` is a method that applies a given function to each element of the dataset.

    * `lambda x: {"embedding": get_embeddings([x["context"]])[0]}` is an anonymous (lambda) function that's being applied to each element.

    * x represents a single row in your dataset. `x["context"]` accesses the "context" column of that row, which contains a sentence. `get_embeddings([x["context"]])[0]` calls our previously defined get_embeddings function to generate the embedding for the sentence and then extracts the first element (since we only encoded one sentence).

* `{"embedding": ...}` creates a new dictionary with the key "embedding" and the generated embedding as its value. This dictionary is added to the dataset.
dataset: Finally, this line prints the updated dataset, which now includes a new column called 'embedding' that will contain the text embeddings for each row within the context column of your dataset.

In [36]:
def get_embeddings(texts):
    embeddings = model.encode(texts, normalize_embeddings=True)
    return embeddings

dataset = dataset.map(lambda x: {"embedding": get_embeddings([x["context"]])[0]})
dataset

Map:   0%|          | 0/563 [00:00<?, ? examples/s]

Dataset({
    features: ['context', 'embedding'],
    num_rows: 563
})

In [40]:
import pandas as pd

pd.DataFrame({'context':dataset['context'], 'embedding':dataset['embedding']})

Unnamed: 0,context,embedding
0,T S T T : L G P : - 5 1 4 This book has been d...,"[-0.059606559574604034, 0.02161659114062786, 0..."
1,You can find many more public domain books in ...,"[-0.07412464916706085, 0.05851251259446144, -0..."
2,"It contains the original Chinese text, an accu...","[-0.0948018953204155, 0.1110830307006836, 0.03..."
3,Despite not having become the final word on Ar...,"[-0.017075080424547195, 0.04146838188171387, -..."
4,This edition aims to of fer the reader the ful...,"[-0.024649500846862793, 0.052450671792030334, ..."
...,...,...
558,"Spies are a most important element in war , be...","[-0.02499540150165558, -0.018303342163562775, ..."
559,"1 This edition, due to technical limitations, ...","[-0.03652389347553253, -0.02410336397588253, 0..."
560,"Correctly , paragraph 5 in Chapter 1 ought to ...","[-0.09111081063747406, 0.0096738925203681, 0.0..."
561,This book has been downloaded from www .alicea...,"[0.008709556423127651, 0.015777170658111572, -..."


# Step 6: Build FAISS index

faiss library, is a powerful tool specifically designed for similarity search.

In simpler terms: You are organizing your sentences in a special way (using FAISS) that makes it extremely fast to find the sentences that are semantically most similar to a given query. This is crucial for the retrieval part of your RAG application.

In [41]:
import faiss
dataset.add_faiss_index(column="embedding")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['context', 'embedding'],
    num_rows: 563
})

In [42]:
import pandas as pd

pd.DataFrame({'context':dataset['context'], 'embedding':dataset['embedding']})

Unnamed: 0,context,embedding
0,T S T T : L G P : - 5 1 4 This book has been d...,"[-0.059606559574604034, 0.02161659114062786, 0..."
1,You can find many more public domain books in ...,"[-0.07412464916706085, 0.05851251259446144, -0..."
2,"It contains the original Chinese text, an accu...","[-0.0948018953204155, 0.1110830307006836, 0.03..."
3,Despite not having become the final word on Ar...,"[-0.017075080424547195, 0.04146838188171387, -..."
4,This edition aims to of fer the reader the ful...,"[-0.024649500846862793, 0.052450671792030334, ..."
...,...,...
558,"Spies are a most important element in war , be...","[-0.02499540150165558, -0.018303342163562775, ..."
559,"1 This edition, due to technical limitations, ...","[-0.03652389347553253, -0.02410336397588253, 0..."
560,"Correctly , paragraph 5 in Chapter 1 ought to ...","[-0.09111081063747406, 0.0096738925203681, 0.0..."
561,This book has been downloaded from www .alicea...,"[0.008709556423127651, 0.015777170658111572, -..."


# Step 7: Perform semantic search query

In [43]:
query = "In attacking with fire, one should be prepared to how many possible developments?"
query_embedding = get_embeddings([query])
query_embedding

array([[ 5.12303486e-02,  4.31203991e-02, -1.20121837e-02,
        -2.85682101e-02,  1.31183062e-02,  7.11346418e-03,
        -5.82009554e-02, -3.98939848e-03,  3.45854717e-03,
..                                                 ...  
         2.65065469e-02,  4.01424728e-02,  8.60863328e-02,
         6.39863536e-02,  1.81283448e-02,  2.11451063e-03,
         6.53263256e-02,  1.92655798e-03,  7.07467422e-02,
        -4.79431786e-02, -2.64714956e-02,  1.81499142e-02]], dtype=float32)

# Step 8: Retrieve top-1 most relevant sentence

In [44]:
scores, results = dataset.get_nearest_examples("embedding", query_embedding, k=1)

# Step 9: Show the clean, precise answer

In [45]:
# Step 9: Show the clean, precise answer
print("scores:",scores)
print("\n Query:", query)
print("Most Relevant Answer:\n", results["context"][0])

scores: [0.29814604]

 Query: In attacking with fire, one should be prepared to how many possible developments?
Most Relevant Answer:
 In attacking with fire, one should be prepared to meet five possible developments: 6.


# More Examples:

In [49]:
query = "what are five ways of attack?"
query_embedding = get_embeddings([query])
scores, results = dataset.get_nearest_examples("embedding", query_embedding, k=3)

# Combine the top-3 results into one paragraph
answer = " ".join(results["context"])

print("Query:", query)
print("\n Scores:", scores)
print("\n Most Relevant Answer:\n", answer)

Query: what are five ways of attack?

 Scores: [0.78094304 0.7871499  0.96002233]

 Most Relevant Answer:
 In attacking with fire, one should be prepared to meet five possible developments: 6. Sun Tzu said: There are five ways of attacking with fire. Strike at its head, and you will be attacked by its tail; strike at its tail, and you will be attacked by its head; strike at its middle, and you will be attacked by head and tail both.


# Step 10: Summarize answer if its too long

In [50]:
from transformers import pipeline

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

summary = summarizer(answer, max_length=150, min_length=20, do_sample=False)
print("Summary:", summary[0]["summary_text"])

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu
Your max_length is set to 150, but your input_length is only 77. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=38)


Summary:  Sun Tzu said: There are five ways of attacking with fire . Strike at its head, and you will be attacked by its tail; strike at its middle, and attack head and tail both .


In [51]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=f281eaa2451b6032a9a8bd02442b63adf739c8be272b4b3c3f9f08ad203f8fa4
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


# Performance Matrices

In [57]:
from difflib import SequenceMatcher

# Define sample evaluation
queries = [
    "What are the five ways of attack?",
    "How many developments in attacking with fire?"
]

expected_answers = [
    "There are five ways of attacking with fire.",
    "In attacking with fire, one should be prepared to meet five possible developments."
]

retrieved_sentences = [
    ["There are five ways of attacking with fire.", "Extra sentence"],
    ["In attacking with fire, one should be prepared to meet five possible developments.", "Another"]
]

generated_answers = [
    "There are five ways of attacking with fire.",
    "In attacking with fire, one should be prepared to meet five possible developments."
]

# Matchers
def is_exact_match(predicted, expected):
    return predicted.strip().lower() == expected.strip().lower()

def fuzzy_match(predicted, expected, threshold=0.8):
    return SequenceMatcher(None, predicted.lower(), expected.lower()).ratio() >= threshold

# Evaluation
total = len(queries)
exact_matches = fuzzy_matches = 0
precision_at_k = []
mrr = []

for i in range(total):
    expected = expected_answers[i]
    retrieved = retrieved_sentences[i]
    generated = generated_answers[i]

    if is_exact_match(generated, expected): exact_matches += 1
    if fuzzy_match(generated, expected): fuzzy_matches += 1

    precision = sum([1 for r in retrieved if fuzzy_match(r, expected)]) / len(retrieved)
    precision_at_k.append(precision)

    rank = next((1 / (i + 1) for i, r in enumerate(retrieved) if fuzzy_match(r, expected)), 0)
    mrr.append(rank)

print(f"Total Queries: {total}")
print(f"Exact Match Accuracy: {exact_matches / total:.2f}")
print(f"Fuzzy Match Accuracy (≥80%): {fuzzy_matches / total:.2f}")
print(f"Avg Precision@k: {sum(precision_at_k) / total:.2f}")
print(f"Mean Reciprocal Rank (MRR): {sum(mrr) / total:.2f}")

Total Queries: 2
Exact Match Accuracy: 1.00
Fuzzy Match Accuracy (≥80%): 1.00
Avg Precision@k: 0.50
Mean Reciprocal Rank (MRR): 1.00


# Save the Model

In [67]:
import os
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import faiss
import pickle

# Paths
MODEL_SAVE_PATH = "my_rag_model"
DATASET_SAVE_PATH = "/content/mydata"
FAISS_INDEX_PATH = "faiss_index.bin"

# Load your SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example input sentences (replace this with your actual corpus)
sentences = [
    "There are five ways of attacking with fire.",
    "In attacking with fire, one should be prepared to meet five possible developments.",
    "Sun Tzu was a Chinese military strategist.",
    "Strategy is the key to winning without fighting."
]

# Step 1: Create dataset
dataset = Dataset.from_dict({"context": sentences})

# Step 2: Embed the text
def get_embeddings(texts):
    return model.encode(texts, normalize_embeddings=True)

dataset = dataset.map(lambda x: {"embedding": get_embeddings([x["context"]])[0]})

# Step 3: Add FAISS index
dataset.add_faiss_index(column="embedding")

# Step 4: Save model
model.save(MODEL_SAVE_PATH)

# Step 5: Save FAISS index separately
dataset.save_faiss_index("embedding", FAISS_INDEX_PATH)

# Step 6: Drop index to save dataset
dataset = dataset.drop_index("embedding")
# dataset.save_to_disk(DATASET_SAVE_PATH)

print("Model, FAISS index, and dataset saved successfully!")

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Model, FAISS index, and dataset saved successfully!


# Later: Reloading for Inference

In [None]:
from datasets import load_from_disk
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("my_rag_model")

# Load dataset
dataset = load_from_disk("my_dataset")

# Load FAISS index back
dataset.load_faiss_index("embedding", "faiss_index.bin")

# Now ready for semantic search again
query = "What are the five ways of attack?"
query_embedding = model.encode([query], normalize_embeddings=True)
scores, results = dataset.get_nearest_examples("embedding", query_embedding, k=3)
print("Answer:", " ".join(results["context"]))
