## Local RAG using Llama 3.2 1B and 3B parameter model
Reference: 
- [Langchain RAG tutorial](#https://python.langchain.com/docs/tutorials/rag/)
- [Langchain document loaders][def2]
- [Ollama library][def]
- [Tutorial adaptive RAG from Langchain][def3]

[def]: #https://ollama.com/library
[def2]: #https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/
[def3]: #https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_adaptive_rag_local/

In [None]:
#install necessary packages
%pip install ipykernel -U --user --force-reinstall
%pip install --quiet --upgrade langchain langchain-community
%pip install pypdf
%pip install --upgrade --quiet  sentence_transformers
%pip install faiss-cpu
%pip install --quiet -U langchain-ollama scikit-learn

FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions.

## Part 1 - Data ingestion workflow
In this part, we will first load the documents. 
- In this case we are using PyPDF loader from langchain `document_loaders` into in-memory vector store/index.
- Second, we will chunk the documents. 
- Third, once the documents are chunked, we will use the embeddings model to generate vector embeddings. 
- Fourth, we will store the embeddings into a vector index to prepare the data. 

In [None]:
# Load PDF documents 
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/2401.07883v1.pdf")
pages = loader.load_and_split()
print(f"Length of document: {len(pages)} \n Review first page: {pages[0]}")

In [None]:
# Chunk documents
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(pages)
print("Sample chunks: ", chunks[:1])
print(len(chunks))

In [None]:
# Initialize embeddings model
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [None]:
#initialize in-memory vector store
from langchain_community.vectorstores import FAISS

# Step 1: Create in-memory vector index using HuggingFace embeddings
faiss_index = FAISS.from_documents(chunks, hf)

# Step 2: Perform similarity search to retrieve top 2 documents which are similar to the query
# Step 2a: faiss_index.similarity_search will first embed the query using embedding model
# Step 2b: Search the vector store to retrieve top 2 matching documents
docs = faiss_index.similarity_search("Full form of RAG", k=5)

# print top 2 chunks that match the query.
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

## Part 2 - Retrieval or text generation workflow. 

In this section, following actions will be performed. 
- First, search top 5 documents from vector index, that match the user query. 
- Second, we will create a RAG prompt which will be passed as input to the model.
- Third, initiliaze the text generation model. 
- Fourth, augment the prompt and pass it to the model to get the answer from the model. 

In [None]:
# Indexing: Store to search similar documents matching the query based on the meaning of the query.
retriever = faiss_index.as_retriever(search_type="similarity", search_kwargs={"k": 5})
context = retriever.invoke("What is RAG?")
print("length of search results: ", len(context))
print("Search results: page content", context[0].page_content)
print("Search results: metadata", context[0].metadata)

## Download llama model using ollama (Llama 3.2 is available on Ollama!)
It's lightweight and multimodal! It's so fast and good!)
ollama pull llama3.2:1b

`ollama pull llama3.2:3b-instruct-fp16` 

This will download the model on your local computer. Since, we are using floating point 16 (fp16) the downloaded model will be smaller as it will use half the memory. 
We are using a model with 3B parameters which by default is stored in floating point 32(fp32). However, we are using floating point 16 which will use half the memory. The model size will approximately 6.4 GB which can easily fit on your local machine. 



In [None]:
### LLM
from langchain_ollama import ChatOllama

# model name
local_llm_eval = "llama3.2:3b-instruct-fp16"

local_llm = "llama3.2:1b"

# will be used to generate responses from the model
llm = ChatOllama(model=local_llm, temperature=0)

# will be used later while detecting hallucination in the generated response
llm_json_mode = ChatOllama(model=local_llm_eval, temperature=0, format="json")

In [None]:
### Generate
from langchain_core.messages import HumanMessage

# Prompt
rag_prompt = """You are an assistant for question-answering tasks. 

Here is the context to use to answer the question:

{context} 

Think carefully about the above context. 

Now, review the user question:

{question}

Provide an answer to this questions using only the above context. 

Use five sentences maximum and keep the answer concise.
If you don't know the answer simply say "Sorry, answer not found in the context provided".
DO NOT use your existing knowledge, only provide answer based on the context provided. 

Answer:"""


# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


In [None]:
# Test
import pprint

question = "Full form of RAG?"
docs = retriever.invoke(question)
docs_txt = format_docs(docs)
rag_prompt_formatted = rag_prompt.format(context=docs_txt, question=question)
generation = llm.invoke([HumanMessage(content=rag_prompt_formatted)])
pprint.pp(generation.content)

In [None]:
# Returning sources or source attribution
for i, doc in enumerate(docs):
    print("chunk number: ", i+1)
    print(doc.page_content)
    print(doc.metadata)
    print()

In [None]:
# Test 2
question = "What are the different types of chunking strategies?"
docs = retriever.invoke(question)
docs_txt = format_docs(docs)
rag_prompt_formatted = rag_prompt.format(context=docs_txt, question=question)
generation = llm.invoke([HumanMessage(content=rag_prompt_formatted)])
pprint.pp(generation.content)

In [None]:
for i, doc in enumerate(docs):
    print("chunk number: ", i+1)
    print(doc.page_content)
    print(doc.metadata)
    print()

In [None]:
### Hallucination Grader
from langchain_core.messages import HumanMessage, SystemMessage
import json
def halluncination_evaluator(generation, docs_txt):
    # Hallucination grader instructions
    hallucination_grader_instructions = """

    You are a teacher grading a quiz. 

    You will be given FACTS and a STUDENT ANSWER. 

    Here is the grade criteria to follow:

    (1) Ensure the STUDENT ANSWER is grounded in the FACTS. 

    (2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

    Score:

    A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score. 

    A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.

    Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

    Avoid simply stating the correct answer at the outset."""

    # Grader prompt
    hallucination_grader_prompt = """FACTS: \n\n {documents} \n\n STUDENT ANSWER: {generation}. 

    Return JSON with two two keys, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER is grounded in the FACTS. And a key, explanation, that contains an explanation of the score."""

    # Test using documents and generation from above
    hallucination_grader_prompt_formatted = hallucination_grader_prompt.format(
        documents=docs_txt, generation=generation.content
    )
    result = llm_json_mode.invoke(
        [SystemMessage(content=hallucination_grader_instructions)]
        + [HumanMessage(content=hallucination_grader_prompt_formatted)]
    )
    return json.loads(result.content)

In [None]:
# Test 3
question = "What is RAG?"
docs = retriever.invoke(question)
docs_txt = format_docs(docs)
rag_prompt_formatted = rag_prompt.format(context=docs_txt, question=question)
generation = llm.invoke([HumanMessage(content=rag_prompt_formatted)])

print('------------------------------------------------')
print('------------- Generated answer -----------------')
print('------------------------------------------------')
pprint.pp(generation.content)


In [None]:
print('------------------------------------------------')
print('---------- Hallucination evaluator -------------')
print('------------------------------------------------')
pprint.pp(halluncination_evaluator(generation, docs_txt))