## Overview: Evaluating a Quote-Based RAG System

This project focuses on building and evaluating a Retrieval Augmented Generation (RAG) system designed to answer user queries using a curated collection of philosophical quotes. The goal is to create a 'wise philosopher' AI that can leverage a knowledge base of quotes to provide insightful and relevant responses.

### The Process:

Our RAG system operates in several key steps:

1.  **Data Preparation (Offline/Pre-processing)**:
    *   **Embedding Generation**: Philosophical quotes are transformed into numerical vector representations (embeddings) using a fine-tuned `SentenceTransformer` model (`fine_tuned_qoute-retriever`). This process captures the semantic meaning of each quote.
    *   **Vector Database Creation**: These embeddings are then stored in a `FAISS` vector index (`quotes_vector_db.faiss`) for efficient similarity search. Associated metadata (the original quotes) is stored in a Pandas DataFrame (`quotes_metadata.pkl`).

2.  **Query Processing (During Runtime - `query_response` function)**:
    *   **User Query Embedding**: When a user submits a query, it is also converted into a vector embedding using the same `fine_tuned_qoute-retriever` model.
    *   **Context Retrieval**: The query embedding is used to search the `FAISS` vector index for the top `k` (in our case, `k=3`) most semantically similar quote embeddings. The corresponding original quotes are retrieved from the metadata.
    *   **Answer Generation**: The retrieved quotes are then provided as context to a Large Language Model (LLM). The LLM's task is to act as a 'wise philosopher' and synthesize an answer to the user's question, drawing inspiration from or directly using the provided quotes.

3.  **System Evaluation (Using Ragas)**:
    *   To assess the performance of our RAG system, we use the `Ragas` framework. Ragas helps us quantitatively measure how well our system retrieves relevant information and generates grounded, relevant answers.
    *   A set of `evaluation_queries` is used to simulate user interactions.
    *   For each query, the `query_response` function is called, and the question, generated answer, and retrieved contexts are logged.
    *   Ragas then calculates several metrics using an independent LLM and embedding model to provide an objective score for different aspects of the RAG pipeline.


### Large Language Models (LLMs) and Embedding Models Used:

1.  **For Answer Generation (within `query_response`)**:
    *   **Model**: `llama-3.3-70b-versatile` from Groq.
    *   **Why**: Groq's inference engine provides extremely fast response times, which is crucial for a responsive RAG system. The `llama-3.3-70b-versatile` model is chosen for its strong performance in complex reasoning and text generation tasks, making it suitable for generating philosophical answers.

2.  **For Retrieval (Embedding Model)**:
    *   **Model**: A fine-tuned `SentenceTransformer` model (`/content/drive/MyDrive/Colab Notebooks/fine_tuned_qoute-retriever`).
    *   **Why**: `SentenceTransformers` are highly effective for generating dense vector embeddings that capture semantic similarity. The fact that it's *fine-tuned* specifically for quotes suggests it's optimized to understand the nuances and themes within our quote dataset, leading to more accurate retrieval of relevant context.

3.  **For Ragas Evaluation Metrics**:
    *   **LLM**: `llama-3.3-70b-versatile` from Groq (re-used via `langchain_groq.ChatGroq`).
    *   **Why**: Using a powerful and fast LLM for Ragas evaluation ensures that the metrics themselves are computed efficiently and accurately, as these metrics often involve the LLM judging aspects like relevance and faithfulness.
    *   **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (via `HuggingFaceEmbeddings`).
    *   **Why**: This is a widely used and efficient general-purpose embedding model, suitable for calculating semantic similarities required by Ragas metrics like `AnswerRelevancy` and `ContextPrecision`.


### Evaluation Results (from Ragas):

After running the evaluation, we obtained the following scores for our RAG system:

```
{'context_precision': 1.0000, 'context_recall': 0.3790, 'faithfulness': 0.4556, 'answer_relevancy': 0.7978}
```

*   **Context Precision: 1.0000**
    *   **Meaning**: This score indicates that 100% of the retrieved quotes were relevant to the user's question. Our retrieval mechanism is excellent at finding *on-topic* information.
    *   **Implication**: The fine-tuned embedding model and FAISS index are effectively identifying relevant quotes.

*   **Context Recall: 0.3790**
    *   **Meaning**: Only about 37.9% of the *total necessary* information to fully answer the questions was retrieved. This means our system often misses some critical pieces of context.
    *   **Implication**: This is an area for significant improvement. We might need to retrieve more quotes (`k` value), refine the embedding model further, or explore alternative retrieval strategies to ensure more comprehensive context is provided to the LLM.

*   **Faithfulness: 0.4556**
    *   **Meaning**: Approximately 45.6% of the statements made in the generated answers were directly supported by the retrieved quotes. The LLM sometimes generates information not explicitly present in the provided context.
    *   **Implication**: This suggests a degree of 'hallucination' or generation beyond the given facts. Improving `Context Recall` could indirectly boost faithfulness, as more complete context would give the LLM more information to ground its answers. We could also fine-tune the LLM's prompt to be stricter about sticking to the provided context.

*   **Answer Relevancy: 0.7978**
    *   **Meaning**: The generated answers were, on average, highly relevant to the user's questions, with nearly 80% relevance.
    *   **Implication**: Despite the issues with `Context Recall` and `Faithfulness`, the LLM is generally successful at addressing the core of the user's query.

### Overall Summary of Results:

Our RAG system demonstrates strong capabilities in identifying relevant information (`Context Precision`) and generating overall relevant answers (`Answer Relevancy`). However, its main weaknesses lie in `Context Recall` and `Faithfulness`. The system struggles to retrieve *all* the necessary information, which in turn leads the LLM to generate answers that are not always entirely grounded in the provided quotes. Future work should focus on improving the comprehensiveness of retrieval and ensuring the LLM adheres more strictly to the given context to enhance the reliability and completeness of its philosophical responses.

In [1]:
!pip install -U ragas datasets langchain-community langchain-core sentence-transformers groq  langchain-groq faiss-cpu


Collecting ragas
  Downloading ragas-0.4.3-py3-none-any.whl.metadata (23 kB)
Collecting datasets
  Downloading datasets-4.4.2-py3-none-any.whl.metadata (19 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-core
  Downloading langchain_core-1.2.7-py3-none-any.whl.metadata (3.7 kB)
Collecting groq
  Downloading groq-1.0.0-py3-none-any.whl.metadata (16 kB)
Collecting langchain-groq
  Downloading langchain_groq-1.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting instructor (from ragas)
  Downloading instructor-1.14.3-py3-none-any.whl.metadata (12 kB)
Collecting scikit-network (from r

In [3]:
from groq import Groq
import faiss
import pandas as pd
from sentence_transformers import SentenceTransformer
import json

client_key_config = Groq(api_key="Enter Grop Api")
fine_tune_embedding = SentenceTransformer('./fine_tuned_qoute-retriever')
fine_tune_vector_index = faiss.read_index("quotes_vector_db.faiss")
fine_tune_metadata = pd.read_pickle("quotes_metadata.pkl")

def query_response(query):
    query_vector = fine_tune_embedding.encode([query]).astype('float32')
    distances, indices = fine_tune_vector_index.search(query_vector, k=3)

    retrieved_quotes = [fine_tune_metadata.iloc[idx]['quote_clean'] for idx in indices[0]]
    context_text = "\n".join([f"- {q}" for q in retrieved_quotes])

    prompt = f"""
    You are a wise philosopher. Use the following quotes to answer the user's question.
    If the quotes aren't enough, use them as inspiration for your answer.

    Retrieved Quotes:
    {context_text}

    User Question: {query}

    Respond in JSON format with your answer.
    """

    response = client_key_config.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": "You are a wise philosopher who responds in JSON format."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content, retrieved_quotes

In [4]:

import pandas as pd
from datasets import Dataset
from ragas import evaluate
from langchain_community.embeddings import HuggingFaceEmbeddings

from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall
)

from langchain_groq import ChatGroq


llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0,
    groq_api_key="enter your api key"
)


embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


evaluation_queries = [
    "Quotes about hope",
    "Oscar Wilde quotes about life",
    "Motivational quotes on success",
    "Quotes related to wisdom and philosophy",
    "Humorous quotes about human nature"
]

records = []

for query in evaluation_queries:
    answer, retrieved_quotes = query_response(query)

    records.append({
        "question": query,
        "answer": answer,
        "contexts": retrieved_quotes,
        "ground_truth": answer
    })

dataset = Dataset.from_pandas(pd.DataFrame(records))


results = evaluate(
    dataset,
    metrics=[
        ContextPrecision(),
        ContextRecall(),
        Faithfulness(),
        AnswerRelevancy()
    ],
    llm=llm,
    embeddings=embeddings
)

print("\nRAG EVALUATION RESULTS\n")
print(results)


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)
  from ragas.metrics import (
  from ragas.metrics import (
  from ragas.metrics import (
  from ragas.metrics import (
  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[19]: BadRequestError(Error code: 400 - {'error': {'message': "'n' : number must be at most 1", 'type': 'invalid_request_error'}})
ERROR:ragas.executor:Exception raised in Job[11]: BadRequestError(Error code: 400 - {'error': {'message': "'n' : number must be at most 1", 'type': 'invalid_request_error'}})
ERROR:ragas.executor:Exception raised in Job[3]: BadRequestError(Error code: 400 - {'error': {'message': "'n' : number must be at most 1", 'type': 'invalid_request_error'}})



===== RAG EVALUATION RESULTS (GROQ) =====

{'context_precision': 1.0000, 'context_recall': 0.3790, 'faithfulness': 0.4556, 'answer_relevancy': 0.7978}
