# Retrieval-Augmented Generation (RAG) using Ollama and FAISS

This project is a Python-based Retrieval-Augmented Generation (RAG) system that uses a local Ollama model to answer questions based on a provided PDF document. The system works by first extracting text from the PDF and splitting it into manageable chunks. These text chunks are then converted into numerical embeddings using a Sentence Transformer model. The embeddings are stored in a FAISS index, which enables fast and efficient similarity searches.

When a user submits a query, the system retrieves the most relevant text chunks from the FAISS index. This retrieved context is then provided to the Ollama large language model along with the original question. The LLM uses this specific, relevant information to formulate a concise and accurate answer, effectively grounding its response in the document's content. This process allows the system to provide knowledgeable answers without hallucinating, as it is restricted to the information found in the PDF.

### Importing Libraries

In [8]:
from textwrap import wrap
import os
import numpy as np
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient
import openai
import fitz

### Defines the file path for the PDF document that will be used as the knowledge base.

In [6]:
# the pdf (math chapter)
file = "data/data.pdf"

### Uses the fitz library to open and read the PDF file. It iterates through each page to extract all text content and prints the total number of characters extracted.



In [9]:
pdf_path = file
# Load the PDF file
doc = fitz.open(pdf_path)

text = ""
for page in doc:
    text += page.get_text()

print(f"Extracted {len(text)} characters from PDF.")

Extracted 69078 characters from PDF.


### Splits the extracted text into smaller, manageable chunks of 500 characters using textwrap.wrap. This is an important step to prepare the data for embedding and retrieval.


In [10]:
chunk_size = 500 
chunks = wrap(text, chunk_size)
print(f"Document split into {len(chunks)} chunks.")

Document split into 139 chunks.


### Initializes a pre-trained SentenceTransformer model (all-MiniLM-L6-v2) to convert the text chunks into numerical vectors (embeddings). It then encodes each chunk and stores the resulting embeddings in a NumPy array.

In [11]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = [embedding_model.encode(chunk) for chunk in chunks]
embeddings = np.array(embeddings, dtype="float32")

print(f"Generated embeddings with shape: {embeddings.shape}")

Generated embeddings with shape: (139, 384)


### Imports the faiss library, a highly efficient library for similarity search. It then creates a Faiss index (IndexFlatL2) and adds all the generated embeddings to it, making them searchable.

In [13]:
import faiss

embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings)

print("FAISS index created with", index.ntotal, "chunks.")

FAISS index created with 139 chunks.


### Defines the search function. This function takes a user query, encodes it into an embedding, and uses the Faiss index to find the top k most similar text chunks based on L2 distance. It returns the chunks and their corresponding distances.

In [14]:
def search(query, top_k=3):
    query_emb = embedding_model.encode(query)
    query_emb = np.array([query_emb], dtype="float32")
    distances, indices = index.search(query_emb, top_k)
    return [(chunks[i], distances[0][pos]) for pos, i in enumerate(indices[0])]

### Imports the Client from the ollama library, which is used to interact with a local Ollama server running a large language model. It defines the answer_question function, which orchestrates the RAG process. This function retrieves relevant context using the search function, constructs a prompt for the LLM using that context, and sends the prompt to the llama3 model to generate a final answer.

In [None]:
from ollama import Client

client = Client()

def answer_question(query):
    # Retrieve context
    relevant_chunks = search(query, top_k=3)
    context = "\n".join([chunk for chunk, _ in relevant_chunks])

    messages = [
    {
        "role": "system",
        "content": "You are an AI assistant that should only rely on the supplied context when answering."
    },
    {
        "role": "user",
        "content": f"""Rely on the context below to respond to the question.
Guidelines:
1. If the answer is not in the context, reply with "I don't know".
2. Keep your response short—no more than five sentences.
3. Use strictly the provided context.

Context:
{context}

Question: {query}

Answer:"""
    }
]


    response = client.chat(
        model="llama3",  # replace with your local model
        messages=messages
    )

    # Access the content properly
    return response.message.content


### This is the example usage of the answer_question function. It demonstrates how to call the function with a specific query, "What is a Pre-trained neural language model?". The code then prints the final answer generated by the RAG system, which includes the retrieval of relevant context and the subsequent generation of a response by the LLM.

In [42]:
# Example
print(answer_question("What is a Pre-trained neural language model?"))

According to the provided context, a pre-trained neural language model refers to a single, left-to-right or encoder-decoder model that can achieve strong performance across both discriminative and generative tasks. This type of model is trained on a large corpus of text data before being fine-tuned for specific tasks.
