# Talk to your data with RAG and Llama 3.2

In this notebook, you will learn how to use RAG and Llama 3.2 to talk to your data. Llama 3.2 is chosen because of its smaller size and faster speed compared to the original Llama. This allows us to run the code locally. RAG allows the model to generate text that is factually accurate and coherent.

In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.llms import Ollama

In [8]:
# Load the paper
loader = PyMuPDFLoader("~/Downloads/2410.05258v1.pdf")
documents = loader.load()

# Split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# Create embeddings
embeddings = HuggingFaceEmbeddings()

# Create a vector store
db = Chroma.from_documents(texts, embeddings)

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
  from tqdm.autonotebook import tqdm, trange


In [9]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

retriever = db.as_retriever()

llm = Ollama(model="llama3.2")

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to augment your own knowledge. The context may not have all the information needed to answer the question, so use your own knowledge to provide a complete answer."),
    ("human", "Context: {context}"),
    ("human", "Question: {input}"),
    ("human", "Please provide a detailed answer, combining information from the context (if relevant) and your own knowledge.")
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

llm_default = Ollama(model="llama3.2")

In [10]:
def ask_question(chain, question):
    result = chain.invoke({"input": question})
    print("Question:", question)
    print("\n ** WITH CONTEXT **\n")
    print("Answer:", result['answer'])
    print("\nSources:")
    for doc in result['context']:
        print(doc.metadata)
    print("\n")

    default_result = llm_default.invoke(question)
    print("\n** WITHOUT CONTEXT **\n")
    print(default_result)

question = "How does the differential transformer differ from a traditional transformer?"
ask_question(rag_chain, question)

Question: How does the differential transformer differ from a traditional transformer?

 ** WITH CONTEXT **

Answer: Here's a detailed explanation of how the Differential Transformer differs from a traditional Transformer:

**Traditional Transformer**

A traditional Transformer is a neural network architecture introduced in 2017 by Vaswani et al. [1] for natural language processing tasks, such as machine translation and text generation. The core idea behind the Transformer is to use self-attention mechanisms instead of convolutional layers to process sequential data.

In a traditional Transformer, the input sequence is divided into overlapping windows of fixed size, which are then processed using self-attention mechanisms. This allows the model to attend to all positions in the sequence simultaneously and weigh their importance relative to each other.

**Differential Transformer**

The Differential Transformer is an extension of the original Transformer architecture that incorporates a

Run the query a few times and analyze the results. With or without context, the quality of the generated text is extremely inconsistent. If the purpose of this agent is to act as a study guide, there are certain pieces of information we would always like it provide in the response.

- A general definition or answer to the question
- A specific example
- A related fact or piece of information

This is something we might be able to inject in the model as a prompt. Let's try it.

In [11]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to augment your own knowledge. The context may not have all the information needed to answer the question, so use your own knowledge to provide a complete answer. Your answer should always include the following: 1. A general definition or answer to the question. 2. A specific example. This includes code snippets either from the context provided or your own knowledge. 3. A related fact or piece of information."),
    ("human", "Context: {context}"),
    ("human", "Question: {input}"),
    ("human", "Please provide a detailed answer, combining information from the context (if relevant) and your own knowledge.")
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [12]:
question = "How does the differential transformer differ from a traditional transformer?"
ask_question(rag_chain, question)

Question: How does the differential transformer differ from a traditional transformer?

 ** WITH CONTEXT **

Answer: I'll do my best to provide a detailed answer.

The Differential Transformer is a novel architecture for natural language processing tasks that differs significantly from traditional Transformers in several key ways. Here's a breakdown of the main differences:

1. **Differential Attention Mechanism**: The most distinctive feature of Differential Transformer is its differential attention mechanism, which modifies the standard self-attention mechanism used in traditional Transformers. In traditional Transformers, all tokens are equally important and interact with each other through self-attention. In contrast, the Differential Transformer introduces a new attention mechanism that takes into account the relative importance of different tokens. This is achieved by learning a set of differential weights (W G, W1) for each token, which are used to compute the attention weights.

Based on the above testing, our RAG agent is beginning to produce more consistent results. One that that has been consistent is that it always pulls the correct notes. Another way of enhancing the responses lies in *query translation*. The first approach we will try is called **multi-query**. This will generate multiple questions based on the user's initial question.

In [13]:
# Multi Query
template = """Your task is to generate five different versions of the given user question to retrieve relevant documents from a vector database.
Using the different perspectives from the retrieved documents, you should generate a response to the user question. Original question: {question}"""
prompt_multi_query = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser
from langchain.load import dumps, loads

generate_queries = (
    prompt_multi_query
    | llm
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

def get_unique_union(documents: list[list]):
    """ Unique union of retrieved docs """
    # Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    # Get unique documents
    unique_docs = list(set(flattened_docs))
    # Return
    return [loads(doc) for doc in unique_docs]

# Retrieve
retrieval_chain = generate_queries | retriever.map() | get_unique_union

In [14]:
from operator import itemgetter

# RAG
template = """You are a helpful AI oracle used to assist students in studying.
Use the following pieces of retrieved context along with your own knowledge to provide thorough answers to the user's questions.
The context may not have all the information needed to answer the question, so use your own knowledge to provide a complete answer.
Your answer should always include the following:
# Summary
A general definition or answer to the question.

# Example
A specific example from the context, including code examples. If no examples are provided in the context, you can use your own knowledge to provide an example.

# Related Information
A related fact or piece of information.

{context}

Question: {question}

Please provide a detailed answer, combining information from the context (if relevant) and your own knowledge.
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain, 
     "question": itemgetter("question")} 
    | prompt
    | llm
    | StrOutputParser()
)

result = final_rag_chain.invoke({"question": question})
print(result)


  return [loads(doc) for doc in unique_docs]


The Differential Transformer (DIFF) is a proposed architecture that differs from the traditional Transformer in several key ways. Here's a detailed comparison:

**1. Multi-needle retrieval protocol:** DIFF uses a multi-needle evaluation protocol, where multiple needles are inserted into varying depths within contexts of different lengths. Each needle consists of a concise sentence that assigns a unique magic number to a specific city. This is distinct from the traditional Transformer, which typically uses a single query and key.

**2. Distraction noise:** DIFF introduces distraction noise by placing other distracting needles randomly, while maintaining a constant depth and length for the answer needle. This simulates real-world scenarios where relevant information may be surrounded by irrelevant noise. In contrast, traditional Transformers do not use distraction noise in their evaluation protocols.

**3. Multi-needle placement:** The DIFF architecture evaluates 50 samples for each comb

# RAG-Fusion

Generating multiple queries did not seem to enhance the quality of the model's reponses in this case. 

In [15]:
# RAG-Fusion: Related
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

In [16]:
generate_queries = (
    prompt_rag_fusion 
    | llm
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

In [17]:
def reciprocal_rank_fusion(results: list[list], k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents 
        and an optional parameter k used in the RRF formula """
    
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results

retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})
len(docs)

12

In [18]:
# RAG
template = """You are a helpful AI oracle used to assist students in studying.
Use the following pieces of retrieved context along with your own knowledge to provide thorough answers to the user's questions.
The context may not have all the information needed to answer the question, so use your own knowledge to provide a complete answer.
Your answer should always include the following:
# Summary
A general definition or answer to the question.

# Example
A specific example from the context, including code examples. If no examples are provided in the context, you can use your own knowledge to provide an example.

# Related Information
A related fact or piece of information.

{context}

Question: {question}

Please provide a detailed answer, combining information from the context (if relevant) and your own knowledge.
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain_rag_fusion, 
     "question": itemgetter("question")}
    | prompt
    | llm
    | StrOutputParser()
)

print("** WITH CONTEXT **")
result = final_rag_chain.invoke({"question":question})
print(result)

print("\n** WITHOUT CONTEXT **\n")
result = llm.invoke(question)
print(result)

** WITH CONTEXT **
The Differential Transformer (DIFF Transformer) differs from a traditional Transformer in its attention mechanism. In a traditional Transformer, the attention mechanism is based on softmax functions to compute attention scores between query, key, and value vectors. However, this mechanism can be noisy due to the presence of noise in the attention scores.

In contrast, the DIFF Transformer uses a differential attention mechanism to cancel out the noise in the attention scores. This mechanism involves computing two separate softmax attention maps for two groups of query and key vectors, and then subtracting these two maps to obtain the final attention scores.

The main difference between DIFF Transformer and traditional Transformer can be summarized as follows:

1. **Attention Mechanism**: Traditional Transformer uses a single softmax function to compute attention scores, while DIFF Transformer uses two separate softmax functions to cancel out noise in the attention sc