# Project: Building a RAG chatbot using python, Langchain and a pre-defined knowledge-base

# 1. Retrieval-Augmented Generation (RAG)

## What is RAG?
One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications that can answer questions about specific source information. These applications use a technique known as Retrieval Augmented Generation,

**This tutorial will show how to build a simple Q&A application over a text data source**
### Overview:
A typical RAG application has two main components:

### Indexing

1. **Load**:
   - First, we need to load our data. This is done using **Document Loaders**.
   - Document loaders handle various file formats (e.g., text files, PDFs, web pages) and convert them into a standardized format for processing.

2. **Split**:
   - Text splitters break large documents into smaller chunks.
   - This is useful for both indexing data and passing it into a model, as:
     - Large chunks are harder to search over.
     - They may not fit in a model's finite context window.

3. **Store**:
   - We need a place to store and index our splits so they can be searched over later.
   - This is often done using a **VectorStore** and an **Embeddings model**.
   - The embeddings model converts text into numerical vectors, and the vector store allows for efficient similarity search.

### Retrieval and generation

1. **Retrieve**:
   - Given a user input, relevant splits are retrieved from storage using a **Retriever**.

2. **Generate**:
   - A **ChatModel** / **LLM** produces an answer using a prompt that includes both the question and the retrieved data.

# 2. Building the Bot

### 1. Installing required packages

for this tutorial we need the following python packages:


1. **langchain-text-splitters**: A library for splitting text into chunks (e.g., for RAG pipelines).
2. **langchain-community**: A collection of community-contributed tools and integrations for LangChain.
3. **langchain[mistralai]**: A wrapper that connects our application with **open-source** and **free** **ChatModel**.
4. **langchain-mistralai**: A library for integrating Mistral AI models with LangChain.
5. **langchain-core**: Provides core functionalities for LangChain, including an in-memory vector store.
6. **faiss-cpu** : faiss-cpu: A library for efficient similarity search and clustering of dense vectors (CPU version).
7. **Flask**: A lightweight Python web framework for building APIs and web servers.  
8. **Flask-CORS**: Enables Cross-Origin Resource Sharing (CORS) in Flask for frontend-backend communication.  
9. **python-dotenv**: Loads environment variables from a `.env` file for secure configuration management.  

In [8]:
# This command installs or upgrades specific Python packages using pip.
# install: The pip subcommand used to install Python packages.

# --upgrade or -U: Upgrades the specified packages to their latest versions if they are already installed.

!pip install -U langchain-text-splitters langchain-community 
!pip install -U "langchain[mistralai]"
!pip install -U langchain-mistralai
!pip install -U langchain-core
!pip install faiss-cpu
!pip install Flask
!pip install python-dotenv
!pip install flask-cors

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## 2. Detailed walkthrough

### 1. Indexing:
- **Loading documents**:
  - We need to first load the articles we already have.
  - We can use [DocumentLoaders](https://python.langchain.com/docs/concepts/document_loaders/) for this, which are objects
    that load in data from a source and return a list of Document objects.

In [9]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Step 1: Load all .txt files from a aricles directory
loader = DirectoryLoader(
    path="articles",  # Path to the directory
    glob="**/*.txt",                # Pattern to match .txt files
    loader_cls=TextLoader,          # Use TextLoader for .txt files
    show_progress=True ,           # Show progress bar while loading
)

# Step 2: Load the documents
docs = loader.load()

# Step 3: Verify the loaded documents
print(f"Number of documents loaded: {len(docs)}")
print(f"Total characters in the first document: {len(docs[0].page_content)}")
print("First 200 characters of the first document:")
print(docs[0].page_content[:200])  # Print the first 200 characters for a quick preview

100%|█████████████████████████████████████| 15/15 [00:00<00:00, 545.18it/s]

Number of documents loaded: 15
Total characters in the first document: 30930
First 200 characters of the first document:
 Willkommen zum zweiten Teil für konzentrierte Systeme zur Nutzung der Solarenergie. Ich wollte mit Ihnen über Solarturme sprechen, aber vorher noch über ein paar Besonderheiten bezüglich der Parabolr





### 2. Splitting Documents
Splitting documents into smaller chunks ensures they fit within a model's context window and improves retrieval precision. It also enhances embedding quality and allows for parallel processing, making the system more efficient.

We will use a [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.


In [10]:
# Import the RecursiveCharacterTextSplitter class
# This is a text splitter that recursively splits documents into smaller chunks
# using common separators like new lines, spaces, and punctuation.
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the RecursiveCharacterTextSplitter with the following parameters:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # The maximum size (in characters) of each chunk.
                      # Documents will be split into chunks of this size or smaller.
    
    chunk_overlap=200,  # The number of characters that adjacent chunks will overlap.
                        # Overlapping helps preserve context between chunks.
    
    add_start_index=True,  # If True, adds a metadata field `start_index` to each chunk,
                           # which tracks the starting position of the chunk in the original document.
)

# Split the documents into smaller chunks using the text splitter
# `docs` is a list of Document objects (e.g., loaded from a file or directory).
all_splits = text_splitter.split_documents(docs)

# Print the number of chunks created
print(f"Split blog post into {len(all_splits)} sub-documents.")

Split blog post into 1006 sub-documents.


### 3. Storing documents

**What is Embedding?**
**Embedding** is the process of converting text into numerical vectors (arrays of numbers) that capture the semantic meaning of the text. These vectors allow us to perform mathematical operations, such as calculating similarity between texts.

Now we need to index our generated text chunks so that we can search over them at runtime. 
our approach is to embed the contents of each document split and insert these embeddings into a vector store. Given an input query, we can then use vector search to retrieve relevant documents.


In [11]:
# Step 1: Import required modules
import os  # For interacting with the operating system (e.g., environment variables)
from langchain_mistralai import MistralAIEmbeddings  # For MistralAI embeddings
from langchain_community.vectorstores import FAISS  # For FAISS vector storage
from dotenv import load_dotenv  # For loading environment variables from a .env file

# Step 2: Load environment variables from a .env file
# This is useful for storing sensitive information like API keys
load_dotenv()

# Step 3: Set up MistralAI API key
# Check if the MistralAI API key is already set in the environment variables
mistral_api_key = os.environ.get("MISTRAL_API_KEY")
if not mistral_api_key:
    raise ValueError("MISTRAL_API_KEY not found in .env file. Please add it.")

# Step 4: Set up Hugging Face Token (optional, if needed for other tasks)
# Check if the Hugging Face token is already set in the environment variables
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    # If not, try to get it from another environment variable (e.g., HF_API_TOKEN)
    hf_token = os.environ.get("HF_API_TOKEN")
    if hf_token:
        os.environ["HF_TOKEN"] = hf_token  # Set HF_TOKEN if HF_API_TOKEN is found
    else:
        print("Warning: Hugging Face token not found in .env file. Some features may not work.")

# Step 5: Initialize MistralAI embeddings
# Use the MistralAI embedding model ("mistral-embed") to convert text into numerical vectors
embeddings = MistralAIEmbeddings(model="mistral-embed", mistral_api_key=mistral_api_key)

# Step 6: Create a FAISS vector store
# FAISS is a library for efficient similarity search and clustering of dense vectors
# `from_documents` creates a vector store from the split documents (`all_splits`) and their embeddings
vector_store = FAISS.from_documents(documents=all_splits, embedding=embeddings)

# Step 7: Verify the documents were added to the vector store
# FAISS does not return document IDs, but you can check the number of documents stored in the vector store
# `vector_store.docstore._dict` contains the documents, and its length gives the number of documents
print(f"Number of documents in vector store: {len(vector_store.docstore._dict)}")

Number of documents in vector store: 1006


### 4. Retrieval and Generation

We’ll use **LangGraph** to tie together the **retrieval** and **generation** steps into a single application. This makes our RAG pipeline more modular, scalable, and easier to debug.

#### What is LangGraph?

LangGraph is a tool that helps you build workflows for your application. Think of it like a recipe where each step is clearly defined, and you can easily see how the steps connect to each other.

---

#### Why Use LangGraph for RAG?

Imagine you’re building a RAG application that does two main things:
1. **Retrieval**: Find relevant information from a knowledge base (like searching a library).
2. **Generation**: Use a language model (like MistralAI) to generate an answer based on the retrieved information.

LangGraph helps you organize these steps and make them work together smoothly.

---

#### How LangGraph Helps

1. **Modular Design**:
   - Break your RAG pipeline into smaller steps (e.g., retrieval, generation).
   - Each step is like a building block that you can test and reuse.

2. **Scalability**:
   - Handle multiple requests at the same time (e.g., answering many questions at once).
   - Support streaming (real-time answers), async (non-blocking), and batched (processing multiple inputs together).

3. **Debugging**:
   - Use LangSmith (a debugging tool) to see what happens at each step of your pipeline.
   - This helps you find and fix problems easily.

4. **Flexibility**:
   - Add new features (like saving results or asking for human approval) without rewriting your entire code.

---

#### Simple Example

Imagine you’re building a RAG application to answer questions. Here’s how LangGraph helps:

1. **Step 1: Retrieve Documents**:
   - You ask a question (e.g., "What is LangChain?").
   - The system searches a knowledge base (like a library) to find relevant information.

2. **Step 2: Generate an Answer**:
   - The system takes the retrieved information and uses a language model (like MistralAI) to generate an answer.

3. **LangGraph Workflow**:
   - LangGraph connects these two steps into a workflow.
   - You can easily add more steps (e.g., saving the answer or asking for human approval).

---

### Why is This Useful?

- **Reusability**: You can reuse the same workflow for different tasks (e.g., answering questions, summarizing documents).
- **Scalability**: Handle many questions at once without slowing down.
- **Debugging**: Easily find and fix problems in your pipeline.
- **Flexibility**: Add new features without rewriting everything.

---

### Key Takeaway

LangGraph makes it easy to build and manage complex workflows (like RAG) by breaking them into smaller, reusable steps. It’s like having a recipe for your application that you can tweak and improve over time.


To use **LangGraph**, we need to define three things:

1. The state of our application: controls what data is input to the application, transferred between steps, and output by the application.
2. The nodes of our application (i.e., application steps);
3. The "control flow" of our application (e.g., the ordering of the steps): compiling the steps into a single object

In [12]:
# Step 1: Import required modules
from langchain_core.documents import Document  # For handling document objects
from typing_extensions import List, TypedDict  # For type hints
from langgraph.graph import START, StateGraph  # For building the workflow graph
from langchain_core.prompts import PromptTemplate  # For customizing the prompt

import getpass
import os

# Step 2: Set up the Mistral API key
# If the Mistral API key is not already set in the environment, prompt the user to enter it
if not os.environ.get("MISTRAL_API_KEY"):
    os.environ["MISTRAL_API_KEY"] = getpass.getpass("Enter API key for Mistral AI: ")

# Step 3: Initialize the Mistral language model
from langchain.chat_models import init_chat_model

# Initialize the Mistral language model with the specified model name and provider
llm = init_chat_model("mistral-large-latest", model_provider="mistralai")

# Step 4: Define the State
# The State is a dictionary that holds the data passed between steps in the workflow
class State(TypedDict):
    question: str  # The user's question
    context: List[Document]  # The retrieved documents
    answer: str  # The generated answer

# Step 5: Define the retrieval step
def retrieve(state: State):
    """
    Retrieves relevant documents for the given question.
    
    Args:
        state (State): The current state of the workflow, containing the question.
    
    Returns:
        dict: A dictionary with the "context" key containing the retrieved documents.
    """
    # Search the vector store for documents relevant to the question
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}  # Update the state with the retrieved documents

# Step 6: Define the enhanced prompt template
template = """
You are a knowledgeable and helpful assistant. Your task is to answer the user's question based on the provided context. Follow these rules strictly:

1. **Language**:
   - The answer must be written exclusively in **German**.

2. **Document Processing**:
   - Carefully analyze all provided documents to extract relevant information.
   - Do not provide an answer until you have checked all documents.

3. **Answer Synthesis**:
   - Combine all relevant information from the documents into a single, well-structured answer.
   - The answer must be exclusively from the documents only. Do not generate answers from your own knowledge.
   - For each piece of information, you must follow this style:
       [Dokumentname: [Dokumentname] | Absatz: [Absatzindex] | Zeilen: [Startzeile - Endzeile]] : [die Antwort]

4. **Answer Structure**:
   - Begin your answer with a brief summary of the key points.
   - List every single piece of relevant information found in the documents, formatted as specified.
   - Ensure the answer is concise, clear, and easy to read.
   - Avoid redundancy and focus on providing unique information.

5. **Source Attribution**:
   - Always include the document name, paragraph index, and line numbers for each piece of information.
   - If the document name is missing, use "Unbekannt" as the document name.
   - If paragraph index or line numbers are missing, use "N/A" as a placeholder.

6. **Handling Missing Information**:
   - If the context does not contain enough information to answer the question, say "Ich weiß es nicht."
   - Do not make up or assume any information.

---

Frage: {question}

Kontext: {context}

Antwort:
"""

# Step 7: Define the generation step
def generate(state: State):
    """
    Generates an answer using the retrieved documents and the question.
    
    Args:
        state (State): The current state of the workflow, containing the question and context.
    
    Returns:
        dict: A dictionary with the "answer" key containing the generated response.
    """
    # Combine the content of all retrieved documents into a single string with source information
    docs_content = "\n\n".join(
        f"Dokumentname: {doc.metadata.get('source', 'Unbekannt')}\n"
        f"Dokument-ID: {doc.id}\n"
        f"Inhalt: {doc.page_content}"
        for doc in state["context"]
    )
    
    # Define the enhanced prompt template
    prompt_template = PromptTemplate(
        input_variables=["question", "context"],
        template=template
    )
    
    # Format the prompt using the PromptTemplate
    formatted_prompt = prompt_template.format(question=state["question"], context=docs_content)
    
    # Generate a response using the language model
    response = llm.invoke(formatted_prompt)
    
    return {"answer": response.content}

# Step 8: Build the workflow graph
graph_builder = StateGraph(State)

# Add the retrieval and generation steps to the graph
graph_builder.add_sequence([retrieve, generate])

# Define the starting point of the workflow
graph_builder.add_edge(START, "retrieve")

# Compile the graph into an executable workflow
graph = graph_builder.compile()



# 3. Creating the API endpoint to communicate with the bot

Creating an API endpoint is essential for enabling communication between the frontend (e.g., a web or mobile app) and the backend (our RAG-based bot). This endpoint acts as a bridge, allowing users to send queries (e.g., text inputs) to the bot and receive responses in real-time. By encapsulating the bot's logic in an API, we ensure modularity, scalability, and ease of integration with various client applications. Additionally, it centralizes the processing of requests, making it easier to manage, debug, and extend the system in the future.

In [None]:
from flask import Flask, request, jsonify
from flask_cors import CORS
import threading
import time


# Create an instance of the Flask class
app = Flask(__name__)

CORS(app)

port = 5800

@app.route("/api", methods=['POST'])
def chat():
    errors = []
    if request.method == 'POST':
        try:
            # Get JSON data from the request
            data = request.json
            question = data['message']
            answer = graph.invoke({"question": question})['answer']
            # Process the data (example: echo the data back)
            response = {
                "data": answer
            }
            return jsonify(response), 200
        except Exception as e:
            errors.append(str(e))  # Add the error to the errors list
    else:
        errors.append("Invalid request method")  # Add error for invalid method

    # If there are errors, return them in the response
    if errors:
        return jsonify({"errors": errors}), 400

def run_flask():
    print(f"Flask app is running on http://127.0.0.1:{port}/")
    app.run(port=port, debug=False, use_reloader=False)

# Global variable to track the Flask thread
flask_thread = None

# Main entry point
if __name__ == '__main__':
    # Start the Flask app in a separate thread
    flask_thread = threading.Thread(target=run_flask)
    flask_thread.daemon = True  # Daemonize the thread so it exits when the main program exits
    flask_thread.start()

    try:
        # Keep the main thread alive to keep the Flask app running
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("Shutting down Flask app...")
        # No need to explicitly stop the thread since it's daemonized

Flask app is running on http://127.0.0.1:5800/
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5800
[33mPress CTRL+C to quit[0m
127.0.0.1 - - [23/Mar/2025 09:37:35] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:37:39] "POST /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:40:45] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:40:57] "POST /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:42:43] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:43:08] "POST /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:43:44] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:43:56] "POST /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:44:39] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:44:40] "[31m[1mPOST /api HTTP/1.1[0m" 400 -
127.0.0.1 - - [23/Mar/2025 09:44:55] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:45:01] "POST /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:45:24] "OPTIONS /api HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2025 09:45:39] "POST /api HTTP/1.1" 200 -
127.