## **0. Install dependencies**

In [1]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
%pip install transformers sentence-transformers datasets evaluate langchain accelerate bitsandbytes
%pip install numpy faiss-cpu
%pip install absl-py nltk rouge_score
%pip install ipywidgets

Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.ngc.nvidia.com
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp310-cp310-win_amd64.whl (2449.4 MB)
     ---------------------------------------- 0.0/2.4 GB ? eta -:--:--
     ---------------------------------------- 0.0/2.4 GB 93.0 MB/s eta 0:00:27
      --------------------------------------- 0.0/2.4 GB 91.0 MB/s eta 0:00:27
      --------------------------------------- 0.1/2.4 GB 90.8 MB/s eta 0:00:27
     - -------------------------------------- 0.1/2.4 GB 92.0 MB/s eta 0:00:26
     - -------------------------------------- 0.1/2.4 GB 91.7 MB/s eta 0:00:26
     - -------------------------------------- 0.1/2.4 GB 90.4 MB/s eta 0:00:26
     -- ------------------------------------- 0.1/2.4 GB 91.1 MB/s eta 0:00:26
     -- ------------------------------------- 0.1/2.4 GB 91.6 MB/s eta 0:00:26
     -- ------------------------------------- 0.2/2.4 GB 92.2 MB/s eta 0:00:25
  

In [2]:
import json
import os
from collections import defaultdict
from functools import partial
from typing import Dict, List, Optional

import evaluate
import faiss
import numpy as np
import pandas as pd
import torch
from datasets import Dataset, concatenate_datasets, load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GenerationConfig,
)

## **1. Create config**
Contains all the necessary parameters to easily modify the behaviour of RAG system.

In [3]:
class Cfg:
    # General configs
    SAVE_PATH: str = "./" 
    DEVICE: str = "cuda" if torch.cuda.is_available() else "cpu"

    # Dataset configs
    WIKIPEDIA_DATASET_NAME: str = "wikipedia"
    WIKIPEDIA_VERSION: str = "20220301.en"
    WIKIPEDIA_SPLIT: str = "train[:1%]"
    SHORT_RUN_DATASET_LEN: Optional[int] = 1000  # Use to short run, None: Use the whole dataset
    NUM_PROC: int = max(os.cpu_count() - 4, 1)  # Number of processors to preprocess the dataset.

    # Chunking configs
    CHUNK_SIZE: int = 128  # Chunk size to split text. This is relevant if the context length is very limited.
    CHUNK_OVERLAP: int = 0  # Overlap between chunks.
    SEPARATORS: List[str] = ["\n\n", "\n\t", "\n", ".", " "]  # Separators to chunking.

    # Retrieve configs
    OVERRIDE_EXISTING_VECTOR_DB: bool = True  # Override the existing vector store or not.
    RETRIEVE_DISTANCE_THRESHOLD: float = 0.6  # Only the best retrieved documents will present.
    RETRIEVE_TOP_K: int = 5  # Number of documents to retrieve and add to the context.
    RETRIEVE_MIN_K: int = 4  # Minimum number of documents to add to the context. The threshold is not taken into account.

    # Embedding model configs
    EMBEDDING_BATCH_SIZE: int = 128  # Embedding model batch size to encode text.
    EMBEDDING_MODEL_NAME: str = "multi-qa-distilbert-cos-v1"

    # Generation model config
    GENERATION_MODEL_NAME: str = "google/flan-t5-large"  # encoder-decoder model, fine-tuned on instruction and Chain-of-thought datasets.
    QUANTIZE_GENERATION_MODEL: bool = True  # Quantize the generation model to save memory and speed up inference. Loads in int8 precision.

    # Generation strategies and configs
    GENERATION_BATCH_SIZE: int = 8
    MAX_NEW_TOKENS: int = 64  # max tokens to predict. The prompt does not count.
    NUM_BEAMS: int = 1  # 1 is greedy search, take only output with the highest probability. This will make the output reproducible.
    EARLY_STOPPING: bool = False  # Stop when EOS is predicted. Only make sense if num_beams > 1.
    DO_SAMPLE: bool = False  # Random sample from the predicted tokens with high propability. False: deterministic True: random/undeterministic.
    TEMPERATURE: Optional[float] = None  #  Constrols the random sample. If DO_SAMPLE false, then this parameter is irrelevant. lower value: output will be more predictable, higher value: creative
    LENGTH_PENALTY: float = 1.0  # Controls the output length. If the output is too long then the penalty will be bigger, and short answers will be preferred.
    GENERATION_TOP_K: Optional[float] = 50  # If DO_SAMPLE = True, choose only the k most likely tokens when sampling.
    GENERATION_TOP_P: Optional[float] = 1.0  # If DO_SAMPLE = True, sort the predicted tokens by there probability and sum up while the sum is lower than TOP_P.

    # Test configs
    MAX_QUESTION_TO_EACH_TOPIC: int = 10  # Maximum question number to eval the RAG on it.
    TEST_DATASET_NAME: str = "rajpurkar/squad"
    TEST_COS_SIM_THRESHOLD: float = 0.6

cfg = Cfg()

Create config object and save as a json.

In [4]:
def config_to_dict(config_class: Cfg) -> Dict:
    config_dict = {}
    for attribute_name in dir(config_class):
        if not attribute_name.startswith("__") and not callable(
            getattr(config_class, attribute_name)
        ):
            config_dict[attribute_name] = getattr(config_class, attribute_name)
    return config_dict


def write_json(data_to_write: Dict, file_name: str) -> None:
    with open(cfg.SAVE_PATH + file_name, "w", encoding="utf-8") as f:
        json.dump(data_to_write, f, ensure_ascii=False, indent=4)


write_json(data_to_write=config_to_dict(cfg), file_name="cfg.json")

Create embedding_model and tokenizer to the model. Set generation config params.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(cfg.GENERATION_MODEL_NAME)
embedding_model = SentenceTransformer(cfg.EMBEDDING_MODEL_NAME)

generation_config = GenerationConfig(
    temperature=cfg.TEMPERATURE,
    max_new_tokens=cfg.MAX_NEW_TOKENS,
    num_beams=cfg.NUM_BEAMS,
    early_stopping=cfg.EARLY_STOPPING,
    do_sample=cfg.DO_SAMPLE,
    length_penalty=cfg.LENGTH_PENALTY,
    top_p=cfg.GENERATION_TOP_P,
    top_k=cfg.GENERATION_TOP_K,
)

## **2. Load and preprocess data**

In [6]:
def preprocess_text(example: Dict) -> Dict:
    """
    Cleans up the text by removing unwanted characters and extra whitespace.
    """
    import re

    text = example["text"]
    text = re.sub(r'(\\n)+', ' ', text).strip()  # Removes the newline characters
    text = re.sub(r"[^a-zA-Z0-9\s.,!?'\';:(){}[\]-]+", "", text)  # Removes unwanted characters
    text = re.sub(r"(^|\.\s+)[^a-zA-Z0-9]+", "", text)  # Keep only relevant characters in front of the text
    example["text"] = text
    return example


def remove_irrelevant_sections(example: Dict) -> Dict:
    """
    Removes irrelevant sections from an article.
    """
    import re

    sections = re.split(
        r"\b(References|External links|Further reading|See also|Notes|Bibliography|Sources|External references|Related topics|Image credits|Historical context)\b",
        example["text"],
        flags=re.IGNORECASE,
    )
    example["text"] = sections[0].strip()
    return example


def split_document(example: Dict, text_splitter: TextSplitter) -> Dict:
    """
    Splits the article into chunks.
    """
    from langchain.schema import Document as LangchainDocument

    doc = LangchainDocument(
        page_content=str(example["text"]), metadata={"title": example["title"]}
    )
    chunks = text_splitter.split_documents([doc])
    return {
        "text": [chunk.page_content for chunk in chunks],
        "title": [chunk.metadata for chunk in chunks],
    }


def split_to_chunks(dataset: Dataset) -> Dataset:
    """
    Splits long articles into chunks. This prevent too long context in prediction.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        tokenizer=tokenizer,
        chunk_size=cfg.CHUNK_SIZE,
        chunk_overlap=cfg.CHUNK_OVERLAP,
        strip_whitespace=True,
        separators=cfg.SEPARATORS,
    )

    dataset = dataset.map(
        partial(split_document, text_splitter=text_splitter),
        batched=True,
        batch_size=1,
        remove_columns=dataset.column_names,
        num_proc=cfg.NUM_PROC,
    )

    return dataset


def load_and_preprocess_data() -> List[str]:
    """
    Loads and preprocesses the dataset. Saves relevant topics. Returns with the preprcoessed text list.
    """
    print("Loading Wikipedia dataset.")
    wikipedia = load_dataset(
        cfg.WIKIPEDIA_DATASET_NAME,
        cfg.WIKIPEDIA_VERSION,
        split=cfg.WIKIPEDIA_SPLIT,
    )

    if cfg.SHORT_RUN_DATASET_LEN is not None:
        wikipedia = wikipedia.select(range(cfg.SHORT_RUN_DATASET_LEN))

    write_json(data_to_write={"topics": wikipedia["title"]}, file_name="topics.json")

    print("Removing irrelevant sections from dataset.")
    wikipedia = wikipedia.map(remove_irrelevant_sections, num_proc=cfg.NUM_PROC)

    print("Splitting long texts into chunks.")
    wikipedia = split_to_chunks(dataset=wikipedia)

    print("Removing unnecessary tokens.")
    wikipedia = wikipedia.map(preprocess_text, num_proc=cfg.NUM_PROC)

    print("Deleting duplicates.")
    wikipedia = pd.DataFrame(wikipedia).drop_duplicates("text", ignore_index=True)

    wikipedia_texts = wikipedia["text"]
    print(f"Total documents: {len(wikipedia_texts)}")

    return wikipedia_texts

## **3. Embedding generation**
Depending on the batch size, the embedding may be a little bit different for the same input.

For example:
```python
e1 = embedding_model.encode(
    ["foo"] * 128, show_progress_bar=True, convert_to_numpy=True, batch_size=128
)
e2 = embedding_model.encode(
    ["foo"], show_progress_bar=True, convert_to_numpy=True, batch_size=1
)
(e1[0] == e2[0]).all() == False
```
These differences are due to numerical floating point differences.
BatchNorm and LayerNorm are sensitive to batch size as these layers normalize the output, so the distibution can be slightly different.

**Numerical consistency check:**
```python
from numpy.testing import assert_allclose
assert_allclose(e1[0], e2[0], rtol=1e-3, atol=0)
```
Here "rtol" is the relative tolerance, and "atol" is the absolute tolerance.

In [7]:
def generate_embeddings(texts: List[str]) -> np.ndarray:
    """
    Generate embedding from texts.
    """
    embeddings = embedding_model.encode(
        texts,
        show_progress_bar=True,
        convert_to_numpy=True,
        batch_size=cfg.EMBEDDING_BATCH_SIZE,
    )
    return np.array(embeddings, dtype=np.float32)

The cosine similarity focuses on the direction rather than the distance between vectors.

Pros:
- The magnitude of the vectors may vary significantly, but cosine similarity focuses on direction, so magnitude is less critical
- If the features of a vector are scaled, cosine similarity remains unchanged because it is based on the cosine of the angle and not the vector’s length
- Directly measure how aligned two vectors are in their feature spac
- 1: perfect alignment (high similarity)
- 0: orthogonal to each other (no similarity)
- -1: exactly opposite in direction (anti-similarity)

In [8]:
def create_vector_db(embeddings: np.ndarray) -> faiss.Index:
    """
    Create vector db. The vector db uses cosine similarity.
    """
    embeddings = embeddings.astype("float32")

    faiss.normalize_L2(embeddings)
    vector_db = faiss.IndexFlatIP(embeddings.shape[1])  # cosine similarity
    vector_db.add(embeddings)

    return vector_db

## **4. Create retrieval**

In [9]:
class CustomRAGRetrieval:
    def __init__(self, vector_db: faiss.Index, texts: List[str]):
        super().__init__()
        self.vector_db = vector_db
        self.texts = texts

    def retrieve(
        self,
        query: str | List[str],
        top_k: int,
        retrieve_distance_threshold: float = 0.0,
        retrieve_min_k: int = 1,
    ):
        """
        Retrieve the relevant documents based on cosine similarity.
        """
        query = [query] if isinstance(query, str) else query
        query_embedding = embedding_model.encode(
            query, convert_to_numpy=True, precision="float32"
        )

        faiss.normalize_L2(query_embedding)
        similarity_scores, indices = self.vector_db.search(query_embedding, top_k)

        valid_mask = similarity_scores > retrieve_distance_threshold
        valid_indices = [
            indices[row][valid_mask[row]] for row in range(indices.shape[0])
        ]

        for idx in range(len(valid_indices)):
            if len(valid_indices[idx]) < retrieve_min_k:
                valid_indices[idx] = np.array(indices[idx, :retrieve_min_k])

        retrieved_texts = [[self.texts[idx] for idx in idxs] for idxs in valid_indices]
        return retrieved_texts

## **5. Generate answer functions**

In [10]:
class CustomRAGModel:
    def __init__(
        self, generation_model: AutoModelForSeq2SeqLM, tokenizer: AutoTokenizer
    ):
        super().__init__()
        self.generation_model = generation_model
        self.tokenizer = tokenizer

    def _generate_outputs(self, input: Dict, max_input_length: int):
        """
        Generates the output from the input. 
        """
        inputs = self.tokenizer(
            input["prompt"],
            return_tensors="pt",
            truncation=True,
            max_length=max_input_length,
            padding=True,
        )["input_ids"]

        inputs = inputs.to(self.generation_model.device)

        if generation_config is not None:
            outputs = self.generation_model.generate(
                inputs, generation_config=generation_config
            )
        else:
            outputs = self.generation_model.generate(inputs)

        answer = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
        return answer

    def generate(
        self,
        prompt: str | List[str],
        generation_config: Optional[GenerationConfig] = None,
        batch_size: int = 8,
    ) -> List[str]:
        """
        Generate answer based on the prompt.
        """
        max_new_tokens = (
            generation_config.max_new_tokens
            if generation_config is not None
            and generation_config.max_new_tokens is not None
            else 20
        )
        max_input_length = self.tokenizer.model_max_length - max_new_tokens

        if len(prompt) > batch_size:
            from torch.utils.data import DataLoader

            dataloader = DataLoader(
                Dataset.from_dict({"prompt": prompt}), batch_size=batch_size
            )
            answers = []
            for batch in tqdm(dataloader, total=len(dataloader)):
                answer = self._generate_outputs(
                    input=batch, max_input_length=max_input_length
                )
                answers.extend(answer)
        else:
            answers = self._generate_outputs(
                input={"prompt": prompt}, max_input_length=max_input_length
            )

        return answers

In [11]:
class CustomRAGPipeline:
    def __init__(
        self,
        retrieval: CustomRAGRetrieval,
        model: CustomRAGModel,
        tokenizer: AutoTokenizer,
    ):
        super().__init__()
        self.retrieval = retrieval
        self.model = model
        self.tokenizer = tokenizer

    def _create_prompt(
        self, context: List[str], question: str | List[str]
    ) -> List[str]:
        """
        Creates the final prompt.
        """
        if isinstance(question, str):
            return [f"Context: {context} Question: {question} Answer: "]
        return [
            f"Context: {c} Question: {q} Answer: " for c, q in zip(context, question)
        ]

    def generate_answer(
        self,
        question: str | List[str],
        top_k: int = 5,
        retrieve_distance_threshold: float = 0.0,
        retrieve_min_k: int = 1,
        generation_config: Optional[GenerationConfig] = None,
        batch_size: int = 8,
    ) -> Dict | List[Dict]:
        """
        Generates answer to the given question.
        """
        retrieved_texts = self.retrieval.retrieve(
            query=question,
            top_k=top_k,
            retrieve_distance_threshold=retrieve_distance_threshold,
            retrieve_min_k=retrieve_min_k,
        )
        context = [" ".join(rt) for rt in retrieved_texts]
        prompt = self._create_prompt(context=context, question=question)
        answers = self.model.generate(
            prompt=prompt, generation_config=generation_config, batch_size=batch_size
        )
        if isinstance(question, str):
            return {
                "answer": answers[0],
                "context": retrieved_texts,
                "prompt": prompt[0],
                "question": question,
            }
        return [
            {"answer": a, "retrieved_texts": rt, "prompt": p, "question": q}
            for a, rt, p, q in zip(answers, retrieved_texts, prompt, question)
        ]

## **6. Create pipeline for text data embedding and retrieval**

Load and prerocess data

In [12]:
texts = load_and_preprocess_data()

Loading Wikipedia dataset.
Removing irrelevant sections from dataset.
Splitting long texts into chunks.
Removing unnecessary tokens.
Deleting duplicates.
Total documents: 40034


Loading or creating a vector database from the embeddings

In [13]:
index_path = cfg.SAVE_PATH + "vector_db.index"
if os.path.exists(index_path) and cfg.OVERRIDE_EXISTING_VECTOR_DB:
    print(f"Loading existing vector database from {index_path}.")
    vector_db = faiss.read_index(index_path)
else:
    print("Creating a new vector database.")
    embeddings = generate_embeddings(texts)
    vector_db = create_vector_db(embeddings)
    faiss.write_index(vector_db, index_path)

Creating a new vector database.


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Create generation model

In [14]:
def get_generation_model() -> AutoModelForSeq2SeqLM:
    if cfg.QUANTIZE_GENERATION_MODEL:
        quantization_config = BitsAndBytesConfig(load_in_8bit=True)
        return AutoModelForSeq2SeqLM.from_pretrained(
            cfg.GENERATION_MODEL_NAME,
            quantization_config=quantization_config,
            device_map="auto",
        )
    return AutoModelForSeq2SeqLM.from_pretrained(cfg.GENERATION_MODEL_NAME,)

generation_model = get_generation_model()

Create the RAG pipeline

In [15]:
rag_retrieval = CustomRAGRetrieval(vector_db=vector_db, texts=texts)
rag_model = CustomRAGModel(generation_model=generation_model, tokenizer=tokenizer)
rag_pipeline = CustomRAGPipeline(retrieval=rag_retrieval, model=rag_model, tokenizer=tokenizer)

Generate some example

In [16]:
example_questions = [
    "What is ASCII?",
    "What is artificial intelligence?",
    "Who was Albert Einstein?",
    "Where is Alabama located?",
    "What is the function of an astronaut?",
    "What is the purpose of an albedo measurement?",
    "Who wrote Animal Farm?",
]

results = rag_pipeline.generate_answer(
    question=example_questions,
    top_k=cfg.RETRIEVE_TOP_K,
    retrieve_distance_threshold=cfg.RETRIEVE_DISTANCE_THRESHOLD,
    retrieve_min_k=cfg.RETRIEVE_MIN_K,
    generation_config=generation_config,
    batch_size=cfg.GENERATION_BATCH_SIZE,
)

for result in results:
    question = result["question"]
    answer = result["answer"]
    print(f"\n" + "=" * 50)
    print(f"Question: {question}\nAnswer: {answer}")


Question: What is ASCII?
Answer: American Standard Code for Information Interchange

Question: What is artificial intelligence?
Answer: intelligence demonstrated by machines

Question: Who was Albert Einstein?
Answer: theoretical physicist

Question: Where is Alabama located?
Answer: Southeastern region of the United States

Question: What is the function of an astronaut?
Answer: serve as a commander or crew member aboard a spacecraft

Question: What is the purpose of an albedo measurement?
Answer: energy estimates

Question: Who wrote Animal Farm?
Answer: George Orwell


### Bottleneck test
If the vector db does not contain a relevant answer to the question, it will not be able to generate a good answer based on the context.

In [17]:
q = "What is RAG system in computer science?"
res = rag_pipeline.generate_answer(
        question=q,
        top_k=cfg.RETRIEVE_TOP_K,
        retrieve_distance_threshold=cfg.RETRIEVE_DISTANCE_THRESHOLD,
        retrieve_min_k=cfg.RETRIEVE_MIN_K,
        generation_config=generation_config,
    )
ans = res["answer"]
print(f"\n" + "=" * 50)
print(f"Question: {q} \nAnswer: {ans}")


Question: What is RAG system in computer science? 
Answer: Structured systems analysis and design


## **7. Performance and testing**

To test the RAG pipeline I used a test set, which contains questions about the wikipedia dataset.

"Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."

Dataset available: https://rajpurkar.github.io/SQuAD-explorer/

In [18]:
def read_json(file_path: str) -> Dict:
    with open(cfg.SAVE_PATH + file_path, "r", encoding="utf-8") as f:
        return json.load(f)

titles = set(read_json(file_path= "topics.json")["topics"])

test_dataset = load_dataset(cfg.TEST_DATASET_NAME)

def filter_by_topics(example: Dict, titles: List[str]) ->bool:
    """
    Filter the titles. If the vector db does not contains relevant information about the given topic, the answer will not be good.
    """
    return example["title"] in titles

def filter_by_topic_count(dataset, max_question_to_each_topic):
    """
    Select 'max_question_to_each_topic' question about each topic.
    """
    title_counts = defaultdict(int)
    filtered_data = []

    for example in dataset:
        if title_counts[example["title"]] < max_question_to_each_topic:
            filtered_data.append(example)
            title_counts[example["title"]] += 1

    return Dataset.from_list(filtered_data)

test_dataset = test_dataset.filter(partial(filter_by_topics, titles=titles), num_proc=cfg.NUM_PROC)
test_dataset = concatenate_datasets([test_dataset["train"], test_dataset["validation"]])
test_dataset = filter_by_topic_count(test_dataset, cfg.MAX_QUESTION_TO_EACH_TOPIC)

Generate answers to all the questions

In [19]:
questions = test_dataset["question"]
test_generated_answers = rag_pipeline.generate_answer(
    question=questions,
    top_k=cfg.RETRIEVE_TOP_K,
    retrieve_distance_threshold=cfg.RETRIEVE_DISTANCE_THRESHOLD,
    retrieve_min_k=cfg.RETRIEVE_MIN_K,
    generation_config=generation_config,
    batch_size=cfg.GENERATION_BATCH_SIZE,
)

100%|██████████| 10/10 [00:21<00:00,  2.18s/it]


In [20]:
test_results = {}
for test_data, test_gen in zip(test_dataset, test_generated_answers):
    test_results[test_data["id"]] = {
        "title": test_data["title"],
        "question": test_data["question"],
        "reference": test_data["answers"],
        "gt_context": [test_data["context"]],
        "response": test_gen["answer"],
        "retrieved_contexts": test_gen["retrieved_texts"]
    }

write_json(test_results, "test_results.json")

Evaluation of the results using different metrics
- squad_v2 metric: evaluates exact matching and f1-score
- rouge metric: evaluates the overlap between generated text and reference text by comparing common n-grams, sequences, or words
- meteor metric: evaluates the similarity between generated text and reference text by considering exact matches, synonyms, stemming, and word order

In [21]:
GENERATION_METRICS = {
    "squad_v2": evaluate.load("squad_v2"),
    "rouge": evaluate.load("rouge", trust_remote_code=True),
    "meteor": evaluate.load("meteor", trust_remote_code=True)
    }

def calculate_metrics(metrics: Dict, results: Dict):
    metrics_result = {}
    pred_answers = [value["response"] for key, value in results.items()]
    gt_answers = [value["reference"]["text"][0] for key, value in results.items()]
    for metric_name, metric in metrics.items():
        if "squad_v2" == metric.name:
            predictions_squad_v2_format = [{'prediction_text': value["response"], 'id': key, 'no_answer_probability': 0.} for key, value in results.items()]
            references_squad_v2_format = [{"answers": value["reference"], 'id': key} for key, value in results.items()]
            metric_res = metric.compute(predictions=predictions_squad_v2_format, references=references_squad_v2_format)
        else:
            metric_res = metric.compute(predictions=pred_answers, references=gt_answers)
        metrics_result[metric_name] = metric_res

    return metrics_result

test_generation_metrics_result = calculate_metrics(metrics=GENERATION_METRICS, results=test_results)
write_json(test_results, "test_generation_metrics_result.json")
for m_name, m_res in test_generation_metrics_result.items():
    print(f"{m_name}: {m_res}")
    print("="*50)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Amon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Amon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Amon\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


squad_v2: {'exact': 30.0, 'f1': 35.571022727272734, 'total': 80, 'HasAns_exact': 30.0, 'HasAns_f1': 35.571022727272734, 'HasAns_total': 80, 'best_exact': 30.0, 'best_exact_thresh': 0.0, 'best_f1': 35.571022727272734, 'best_f1_thresh': 0.0}
rouge: {'rouge1': 0.35419084589684685, 'rouge2': 0.1375, 'rougeL': 0.3490970807514925, 'rougeLsum': 0.34859474129338924}
meteor: {'meteor': 0.24793768635741764}


Test retrieval:

To evaluate the retrieval, cosine similarity score was calculated between the ground truth context and the retrieved contexts.
Then computes additional metrics such as average similarity, maximum similarity, and Precision@k.

In [22]:
def compute_cosine_similarity(gt_context: List, retrieved_contexts: List):
    """
    Computes cosine similarity score between gt context and retrieved context.
    """
    results = []
    for gt_cont, retrieved_cont in zip(gt_context, retrieved_contexts):
        gt_embeddings = embedding_model.encode(gt_cont, convert_to_tensor=True)
        tr = [
            util.pytorch_cos_sim(
                embedding_model.encode(ret_cont, convert_to_tensor=True),
                gt_embeddings
            ).mean().item()
            for ret_cont in retrieved_cont
        ]
        results.append(tr)
    return results

def precision_at_k(cosine_similarities: List, threshold: float = 0.6):
    """
    Computes precision based on cosine similarity scores using a specified threshold value.
    """
    precisions = []
    for sim_list in cosine_similarities:
        relevant_in_k = sum(1 for sim in sim_list[:len(sim_list)] if sim > threshold)
        precision = relevant_in_k / len(sim_list)
        precisions.append(precision)
    return sum(precisions) / len(precisions)

def recall_at_k(cosine_similarities: List, threshold:float = 0.6):
    """
    Computes recall based on cosine similarity scores using a specified threshold value.
    """
    recalls = []
    for sim_list in cosine_similarities:
        total_relevant = sum(1 for sim in sim_list if sim > threshold)
        relevant_in_k = sum(1 for sim in sim_list[:len(sim_list)] if sim > threshold)
        recall = relevant_in_k / total_relevant if total_relevant > 0 else 0
        recalls.append(recall)
    return sum(recalls) / len(recalls)

gt_context = [value["gt_context"] for _, value in test_results.items()]
retrieved_contexts = [value["retrieved_contexts"] for _, value in test_results.items()]
contexts_cosine_similarity = compute_cosine_similarity(gt_context, retrieved_contexts)

max_similarity = [max(lst) for lst in contexts_cosine_similarity if len(lst) > 0]
average_max_similarity = sum(max_similarity) / len(max_similarity)

precision = precision_at_k(cosine_similarities=contexts_cosine_similarity, threshold=cfg.TEST_COS_SIM_THRESHOLD)
recall = recall_at_k(cosine_similarities=contexts_cosine_similarity, threshold=cfg.TEST_COS_SIM_THRESHOLD)

In [23]:
print(f"Average max similarity: {average_max_similarity}")
print(f"Precision@{cfg.RETRIEVE_TOP_K}: {precision}")
print(f"Recall@{cfg.RETRIEVE_TOP_K}: {recall}")

Average max similarity: 0.7396492242813111
Precision@5: 0.48625000000000007
Recall@5: 0.825


## **8. Conclusion, summary**

##### **Dataset**
I have chosen the wikipedia dataset to create the RAG pipeline. It contains a huge amount of diverse data for the vector store.
Following steps were applied:
- removing irrelevant sections: remove irrelevant and unhelpful data that is just an overhead to the retrieval
- splitting: long texts can not be fed to the embedding model well, there will be a data loss, then the generative model will not be able to handle due to context window limit.
- cleaning: after the chunking there are many leftower tokens that are not useful. (Eg. chunk starts like: ". A ....", where the ". " is a placeholder)
- removing duplicates: removing the duplicated chunks, to ensure each data is unique

##### **Huggingface**
I have chosen the Huggingface library to load models and create the RAG pipeline.
Huggingface is an open-source library specialized to NLP tasks, offering a lot of useful API and infrastructure.
It provides access to several models, datasets and tokenizers that are easy to use.
The Huggingface library also has strong community support, and many fine tuned models are available for specialized tasks.

##### **Embedding model**

The embedded model I have chosen is: **multi-qa-distilbert-cos-v1**

This model is fine-tuned on a QA dataset which makes it a good choice if the query will be a question. It is lightweight, making inference very fast.
It is based on DistilBERT, which means the teacher network for this model was the BERT model—a large, robust model with a significant number of parameters.
Distillation refers to a process where the original BERT model acts as a teacher model, and a smaller model, with fewer parameters, is trained to replicate the predictions of the original BERT. According to the original paper, DistilBERT reduces the size of BERT by 40% while retaining 97% of its language understanding capabilities.
The "multi-qa-distilbert-cos-v1" model is also specifically optimized for cosine similarity search, making it particularly useful as a retrieval.

Original paper about distilbert: https://arxiv.org/pdf/1910.01108

Model: https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1

##### **Vector database**

The vector database I have chosen is: **Faiss**

Faiss is an extremely fast and efficient vector database, offering reduced memory usage compared to other vector stores. It is highly scalable and provides accurate vector search capabilities. While Faiss operates as an in-memory vector store (e.g., unlike Chroma, which is a persistent database), it can handle datasets that exceed the available RAM. Additionally, it supports both GPU and CPU for enhanced performance and flexibility.

##### **Generative model**

The vector database I have chosen is: **google/flan-t5-large**

T5 is an encoder-decoder model, meaning the input is encoded (enabling better input understanding), and then the output is generated using cross attention between the input embeddings and output tokens. 
The model genretes the output in an autoregressive manner, with the decoder leveraging the encoded data for each token generation. 
"flan" refers to the model was fine-tuned on QA datasets and other specialized tasks (e.g., chain-of-thought reasoning) to enhance the ability to provide accurate ansers from the given context, making it ideal for QA pipelines. 
Encoder only models are not the best for generating anwsers based on context, they are better at classification tasks, like sentiment analysis.
On the other hand, decoder only models are  effective for text generation tasks, but these models use only self attention to predict the next token based on the previously predicted tokens.
However, if the question is too short, or the context is too large, the generated answer may lack faithfulness or accuracy.
This is because these models do not use cross-attention, they only use self attention.

Original paper: https://arxiv.org/pdf/2210.11416v5

##### **Test**

Test dataset: **rajpurkar/squad** (https://rajpurkar.github.io/SQuAD-explorer/)
Available on Huggingface: https://huggingface.co/datasets/rajpurkar/squad

SQuAD (Stanford Question Answering Dataset) contains questions and answers from wikipedia dataset.
The question might be unanswerable. There is version differnce between SQuAD and the downloaded "20220301.en" dataset, so the context may not be the same to each question.
Because the dataset which I used is not the whole wikipedia datasert, and the vector db does not contains all the necessary data to answer all the questions, filtering is applied to those topics which are in the vector store.

**Retrieval test**

To test the retrieval, based on the question and the ground truth context, the retrieved contexts are evaluated by cosine similarity and some other derived metrics namely, cosine similarity, recall@top_k, precision@top_k and average maximum similarity.
- cosine similarity: measures the similarity between the retrieved context and the ground truth context by calculating the cosine of the angle between their vector representations. To enchance the accuracy normalization to the vectors are necessary.
- recall@top_k: evaluates the percentage of relevant contexts that are present within the top-k retrieved results
- precision@top_k: measures the proportion of retrieved contexts within the top-k results that are relevant
- average maximum similarity: computes the average of the highest similarity scores between the retrieved contexts and the ground truth for each query


**Generated answers test**
To test the quality of the generated answers, I used SQuAD (Exact match, F1-score), ROUGE and METEOR metrics.
- Exact match: measures the percentage of responses that match the ground truth answer exactly
- F1-score: calculates the harmonic mean of precision and recall, considering partial matches
- ROUGE: evaluates the overlap between the generated anwser and the reference text, focusing on the recall of n-grams, longest common subsequences, or word sequences
- METEOR: measures the similarity between the generated text and the reference, accounting for synonyms, stemming, and word order flexibility

**Other test metrics**

There are other tool like RAGAS, DeepEval to evaluate the RAG system performance but these tools requires some extra steps like set API key.
These tools can evaluate the whole RAG pipeline and calculates metrics like faithfulness, context recall, context precision, response relevancy and more.


##### **Bottlenecks**
- Data: preprocessing before embedding is a critical step. If the document contains lot of missinformation or badly structuralized documents, or even with duplicates, no matter the embedding model capability, both the embedding and retrieval process amy fail
- Retrieve: if the retrieved contexts are not representative, the anser will not be helpful
- Embedding: the embedding quality is key to retrieve the documents in a fast and efficient way. If the embeddings are not create good embedding, the retrieved documents can be different from the original topic
- Hallucination: if the retrieved context is not informative or contains misinformation, the model can halucinate
- Context window size: the Flan-T5-large model is a good choice to generate anwsers, but the context window is small: 512 tokens. Some questions may require long anwsers, and this system is not capable to do it, due to it's context window size

##### **Future work**
- improved data preprocessing: although, the data is preprocessed, but some other preprocessing methods can be applied.
    - removing low-quality or redundant chunks by cosine similarity, L2 distance, embedding based similarity
    - summarize long texts before chunking, this prevents the chunking method to split the paragraph into too many pieces, which are spearately meaningless
    - more precise chunking: logically coherent chunking
    - use text data cleaning tools like Cleanlab, SpaCy or NLTK
- advanced prompting: the prompt is very clan and easy but with a more precise prompt the model may be able to generate better answers
- fine-tuning: fine tune model on wikipedia QA dataset and fine-tuning the embedding model
- hybrid retrieval methods: combine dense vector retrieval with sparse retrieval methods, like BM25 or DPR to impore retrieved documents relevancy
- using advanced quantization techniques to load a bigger model
- context summarization: after the retrieval summarize the documents to keep only the relevant part of the long context
- using map reduce or refinement techniques to handle longer context and questions
- more accurate testing: use some external tools, for example RAGAS or DeepEval
    - LLM model testing: ROUGE, METEOR and SQuAD test are good, but these metrics are not reveals the contextual meanings.
    - Retrieval testing: recall, precision and max similarity scores are a good baseline, but using rerank methods can be a big plus