# Retrieval Augmented Generation With Ollama and Langchain

Retrieval Augmented Generation (RAG) is a powerful approach that combines retrieval-based methods with generative models to provide contextually relevant and informative answers. In this notebook, we use LangChain's ecosystem to set up a conversational RAG system that uses documents stored as embeddings for rapid retrieval and accurate responses.

## Lab Description

In this lab, you will build a Conversational Retrieval-Augmented Generation (RAG) system using LangChain and Ollama. The system retrieves relevant context from PDF documents, processes it with Mistral-7B, and generates concise, context-aware answers.

## Lab Objectives:

After Completing the Lab, Participants will be able to:

- Load and process PDF documents into embeddings using FAISS.
- Implement a history-aware retriever to enhance conversational understanding.
- Construct a RAG pipeline for efficient document-based Q&A.
- Run an interactive chat loop that enables dynamic question-answering based on document content.

## Importing the Libraries

In [1]:
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains import create_history_aware_retriever
from langchain_core.messages import HumanMessage
import os
import asyncio
import nest_asyncio

## Retrieval-Augmented Generation (RAG) Workflow

### Document Processing Pipeline
The following image represents the **document processing pipeline** in a **Retrieval-Augmented Generation (RAG) system**. The pipeline consists of four main stages:

1. **Load**: Raw documents in various formats (JSON, PDFs, URLs, etc.) are ingested.
2. **Split**: The documents are chunked into smaller pieces to facilitate efficient retrieval.
3. **Embed**: Each chunk is converted into numerical representations (embeddings) using an embedding model.
4. **Store**: The embeddings are stored in a vector database, such as FAISS, for fast retrieval.

This process ensures that information is structured efficiently for retrieval-based generation.

<img src="one.png" alt="Document Processing Pipeline" width="800">

### Retrieval & Response Generation
This image illustrates how a **Retrieval-Augmented Generation (RAG) pipeline** answers user queries:

1. **User Question**: A query is input by the user.
2. **Retrieve**: The system fetches relevant document chunks from the vector store based on semantic similarity.
3. **Prompt**: The retrieved context is combined into a structured prompt.
4. **LLM (Large Language Model)**: The prompt is sent to an LLM (e.g., Mistral, GPT) to generate an informed response.
5. **Answer**: The final answer is provided to the user.

This process ensures that responses are grounded in factual and retrieved data rather than purely relying on the model's pre-trained knowledge.

<img src="two.png" alt="RAG Retrieval & Response Generation" width="800">


## Allow nested asynchronous loops

Jupyter notebooks already have an event loop running in the background, making it challenging to run asynchronous code directly. `nest_asyncio.apply()` resolves this by allowing asynchronous code to run within a notebook cell, even if the loop is already active.

In [5]:
nest_asyncio.apply()

## The `get_conversational_answer` Function

### Contextualizing the Question
- The function starts by setting up a `contextualize_q_system_prompt`, which is a system instruction that reformulates the user's question based on the chat history. This step ensures that questions referencing past conversation context are rewritten as standalone questions that can be understood without that context.
- The prompt is then fed into a `ChatPromptTemplate`, which organizes the messages for the language model. It includes placeholders for the system message, chat history, and user input.

### Creating a History-Aware Retriever
- `mistral:7b` is initialized as the LLM.
- Using this LLM, `create_history_aware_retriever` is called, which combines the LLM with a retriever (a tool that fetches relevant documents). This retriever will be context-aware, ensuring conversational flow.

### Setting Up the Question-Answering (QA) System Prompt
- The `qa_system_prompt` is another system message that directs the assistant to answer the question concisely and to only respond if it has enough information.
- A second `ChatPromptTemplate` is created to format these QA instructions, integrating context, chat history, and user input.

### Creating the RAG Chain
- A `question_answer_chain` is created using `create_stuff_documents_chain`, which combines the LLM and the QA prompt. This chain processes the retrieved documents (context) and provides answers.
- Next, `create_retrieval_chain` links the history-aware retriever and the question-answering chain to form a RAG pipeline. The pipeline retrieves relevant context from documents and uses it to generate concise and precise answers.

### Generating the Answer
- The `rag_chain.invoke` method is called with the user input and chat history, returning a response (`ai_msg`) from the RAG pipeline. This response is structured to provide clear, contextually accurate answers based on both the user’s question and the retrieved documents.


In [6]:
async def get_conversational_answer(retriever, input, chat_history):
    contextualize_q_system_prompt = """Given a chat history and the latest user question \
    which might reference context in the chat history, formulate a standalone question \
    which can be understood without the chat history. Do NOT answer the question, \
    just reformulate it if needed and otherwise return it as is."""
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    llm = ChatOllama(model="llama3.1:8b", base_url="http://10.79.253.112:11434")

    history_aware_retriever = create_history_aware_retriever(
        llm, retriever, contextualize_q_prompt
    )

    qa_system_prompt = """You are an assistant for question-answering tasks. \
    Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. \
    Use three sentences maximum and keep the answer concise.\
    Do not generate any additional text unless you are asked to.\
    Keep the answers really short and concise.\

    {context}"""
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)
    ai_msg = rag_chain.invoke({"input": input, "chat_history": chat_history})
    return ai_msg

## The `main` Function 

The `main` function initiates the conversational question-answering chain based on PDF documents stored in the specified directory.

### PDF Document Loading
- The directory containing PDF files is specified (`./data`), and these documents are loaded using `PyPDFDirectoryLoader`.
- `documents` stores the loaded PDF data, which will be converted into embeddings for retrieval.

### Vector Store Initialization
- `OllamaEmbeddings` (using the "mistral:7b" model) is used to generate embeddings for the documents, which allows for semantic similarity searching.
- Facebook AI Similarity Search (FAISS) is an open-source library that helps developers search for similar multimedia documents in large datasets. It stores these document embeddings, enabling quick retrieval based on the user's questions.
- The vector store’s `as_retriever()` method provides a retriever object for retrieving relevant document chunks.

### Conversation State Initialization
- `chat_history` is initialized as an empty list to store user inputs and assistant responses. This is later used for reformulating user questions to enable contextual question answering. 

### Interactive Question-Answer Loop
- A loop takes user input (prompt) to ask questions based on the uploaded PDF documents.
- The loop breaks if the user types "exit".

### Getting the AI Response
- The `get_conversational_answer` function is called using `asyncio.run()`, taking in the retriever, user prompt, and chat history to generate contextually relevant responses.
- The AI’s answer (`ai_msg["answer"]`) and the user’s question are added to `chat_history` for providing context in the future responses.

### Displaying the Assistant’s Response
- The assistant’s response is printed to the console.
- This loop continues until the user types in 'exit'. 


In [7]:
def main():
    # Specify the directory where the PDF is stored
    pdf_directory = "./data"

    # Load the PDF documents
    loader = PyPDFDirectoryLoader(pdf_directory)
    documents = loader.load()

    # Initialize the vector store using the embeddings model
    embed_model = OllamaEmbeddings(model='nomic-embed-text:latest', base_url="http://10.79.253.112:11434")
    vector_store = FAISS.from_documents(documents, embed_model)
    retriever = vector_store.as_retriever()

    # Initialize the conversation state
    chat_history = []

    while True:
        # Take user input for a question
        prompt = input("Ask your question based on the uploaded PDF (or type 'exit' to quit): ")

        if prompt.lower() == 'exit':
            print("Exiting the conversation.")
            break

        # Get the AI response using the retriever and chain
        ai_msg = asyncio.run(get_conversational_answer(retriever, prompt, chat_history))

        # Store the user input and AI response in the chat history
        chat_history.extend([HumanMessage(content=prompt), ai_msg["answer"]])

        # Display the assistant's response
        print("Assistant: ", ai_msg["answer"])

## Call the `main` function to initiate the chat

In [8]:
if __name__ == '__main__':
    main()

Ask your question based on the uploaded PDF (or type 'exit' to quit):  what is the paper about ?


  llm = ChatOllama(model="gemma2:2b")


Assistant:  The provided context discusses attention mechanisms in neural networks, particularly focusing on their applications in natural language processing (NLP). It specifically mentions:

* **Attention mechanism:** A method to weigh the importance of different parts of the input sequence for generating an output. 
* **Contextual representation:**  Neural networks learn representations that take into account the relationship between words and phrases based on surrounding context. This is achieved through attention mechanisms.
* **Applications in NLP:** The paper highlights how attention mechanisms are used to improve natural language understanding, translation, and summarization tasks.

The context suggests a deep dive into the workings of these neural network techniques and their contributions to NLP research and applications. 





Ask your question based on the uploaded PDF (or type 'exit' to quit):  explain more about attention mechanism 


Assistant:  Let's delve into the fascinating world of attention mechanisms!

**The Essence of Attention: Focusing on Relevant Information**

Imagine you're reading a book. You don't process every word equally—you focus on key phrases, words related to your current understanding, and important entities.  This is what attention does for AI models. It helps them determine which parts of the input data are most relevant for a specific task at hand.

**How Attention Works: A Step-by-Step Guide**

1. **Input Representation:** The model starts with a raw sequence (like words in a sentence) and encodes it into a representation called an embedding, essentially turning each word into a vector. 
2. **Query, Key, Value Matrices:** The encoding is transformed through multiple matrices: the query (Q), key (K), and value (V) matrices. Think of these as a set of spotlights that focus on specific parts of the input sequence.
3. **Attention Score Calculation:**  The attention mechanism calculates an "at

Ask your question based on the uploaded PDF (or type 'exit' to quit):  exit


Exiting the conversation.


## Integration with Streamlit UI 

Run this cell to copy the entire code to a `.py` named `app.py`. Launch a new terminal an type `streamlit run app.py` to see the entire rag system demonstarted above with an interactive UI.

In [None]:
%%writefile ./app.py

import streamlit as st
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate,MessagesPlaceholder
from langchain.chains import create_history_aware_retriever
from langchain_core.messages import HumanMessage
import os
import asyncio


async def get_conversational_answer(retriever,input,chat_history):
    contextualize_q_system_prompt = """Given a chat history and the latest user question \
    which might reference context in the chat history, formulate a standalone question \
    which can be understood without the chat history. Do NOT answer the question, \
    just reformulate it if needed and otherwise return it as is."""
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )


    llm = ChatOllama(model="mistral")

    history_aware_retriever = create_history_aware_retriever(
        llm, retriever, contextualize_q_prompt
    )

    qa_system_prompt = """You are an assistant for question-answering tasks. \
    Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. \
    Use three sentences maximum and keep the answer concise.\
    Donot generate any additional text unless you are asked to.\
    Keep the answers really short and concise.\

    {context}"""
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)
    ai_msg = rag_chain.invoke({"input": input, "chat_history": chat_history})
    return  ai_msg


def main():
    st.header('Chat with your PDF')
    
    if "conversation" not in st.session_state:
        st.session_state.conversation = None

    if "activate_chat" not in st.session_state:
        st.session_state.activate_chat = False

    if "messages" not in st.session_state:
        st.session_state.messages = []
        st.session_state.chat_history=[]

    for message in st.session_state.messages:
        with st.chat_message(message["role"], avatar = message['avatar']):
            st.markdown(message["content"])

    embed_model = OllamaEmbeddings(model='mistral')

    with st.sidebar:
        st.subheader('Upload Your PDF File')
        docs = st.file_uploader('Upload your PDF & Click to process',accept_multiple_files = True, type=['pdf'])
        if st.button('Process'):
            if docs is not None:
                os.makedirs('./data', exist_ok=True)
                for doc in docs:
                    save_path = os.path.join('./data', doc.name)
                    with open(save_path, 'wb') as f:
                        f.write(doc.getbuffer())
                    st.write(f'Processed file: {save_path}')
           
            with st.spinner('Processing'):
                loader = PyPDFDirectoryLoader("./data")
                documents = loader.load()
                vector_store = FAISS.from_documents(documents, embed_model)
                retriever=vector_store.as_retriever()
                if "retriever" not in st.session_state:
                    st.session_state.retriever = retriever
                st.session_state.activate_chat = True

            # Delete uploaded PDF files after loading
            for doc in os.listdir('./data'):
                os.remove(os.path.join('./data', doc))

    if st.session_state.activate_chat == True:
        if prompt := st.chat_input("Ask your question based on the uploaded PDF"):
            with st.chat_message("user", avatar = '👨🏻'):
                st.markdown(prompt)
            st.session_state.messages.append({"role": "user",  "avatar" :'👨🏻', "content": prompt})
            retriever = st.session_state.retriever

            ai_msg = asyncio.run(get_conversational_answer(retriever,prompt,st.session_state.chat_history))
            st.session_state.chat_history.extend([HumanMessage(content=prompt), ai_msg["answer"]])
            cleaned_response=ai_msg["answer"]
            with st.chat_message("assistant", avatar='🤖'):
                st.markdown(cleaned_response)
            st.session_state.messages.append({"role": "assistant",  "avatar" :'🤖', "content": cleaned_response})
        else:
            st.markdown('Upload your PDFs to chat')


if __name__ == '__main__':
    main()

## Accessing the UI

### **Step 1: Go to http://10.79.253.111:8501 on a new tab**
The application starts with an interface where users can upload **PDF files**. The sidebar provides an option to **browse** and select files.

<img src="1.png" alt="Chat with Your PDF - Initial Interface" width="800">

---

### **Step 2: Selecting a PDF File**
Users can browse their system and select a **PDF file** for processing.

<img src="2.png" alt="Selecting a PDF File" width="800">

---

### **Step 3: Uploading and Processing the PDF**
After selecting the file, users click **"Process"** to upload and analyze the document.

<img src="3.png" alt="Uploading and Processing the PDF" width="800">

---

### **Step 4: Conversational Chat Based on PDF**
Once processing is complete, users can chat with the PDF content by asking questions. The system retrieves relevant information and provides responses.

<img src="4.png" alt="Chatting with the PDF Document" width="800">


## **Testing the RAG System with HPE ProLiant Compute DL380a Gen12 QuickSpecs**

### **Instructions:**
Use the following questions to test the **Retrieval-Augmented Generation (RAG) system**. These questions are designed to verify if the system can accurately retrieve and generate responses from the **HPE ProLiant Compute DL380a Gen12 QuickSpecs** document.

### **Test Questions:**
1. **What are the key features of the HPE ProLiant Compute DL380a Gen12 server?**
2. **Which Intel® Xeon® 6 processors are supported by the HPE ProLiant Compute DL380a Gen12, and what are their specifications?**
3. **What are the supported GPU configurations for the HPE ProLiant Compute DL380a Gen12, and what are the power requirements for different configurations?**

### **How to Use:**
- Upload the **QuickSpecs PDF** to the RAG system.
- Ask each question one by one.
- Verify if the response is **accurate and based on the document**.

 **If the system retrieves correct responses, the RAG pipeline is working effectively!**
