# Goal: Building a Retrieval-Augmented Generation
- **Overview**: After completing this tutorial, you'll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the `FileTypeRouter`.

> 💡 (Optional): After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) for this section

## Components Used

- [`FileTypeRouter`](https://docs.haystack.deepset.ai/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components
- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents
- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents
- [`TextFileToDocument`](https://docs.haystack.deepset.ai/docs/textfiletodocument): This component will help you convert text files into Haystack Documents
- [`DocumentJoiner`](https://docs.haystack.deepset.ai/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline
- [`DocumentCleaner`](https://docs.haystack.deepset.ai/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.
- [`DocumentSplitter`](https://docs.haystack.deepset.ai/docs/documentsplitter): This component will help you to split your Document into chunks
- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.
- [`DocumentWriter`](https://docs.haystack.deepset.ai/docs/documentwriter): This component will help you write Documents into the DocumentStore

## Overview

In this tutorial, you'll build an indexing pipeline that preprocesses different types of files (markdown, txt and pdf). Each file will have its own `FileConverter`. The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store.

Optionally, you can keep going to see how to use these documents in a query pipeline as well.

## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installing dependencies


In [None]:
# Install Haystack-AI: A framework for building search systems and question answering with AI
# This provides the core functionality for document retrieval and question answering pipelines
!pip install haystack-ai

# Install Mistral-Haystack integration: Enables using Mistral AI models within Haystack pipelines
# Mistral models are efficient open-weight language models
!pip install mistral-haystack

# Install sentence-transformers and huggingface_hub with version constraints:
# - sentence-transformers: For generating embeddings (dense vector representations of text)
# - huggingface_hub: To interact with Hugging Face's model repository
# Version constraints ensure compatibility with other packages
!pip install "sentence-transformers>=3.0.0" "huggingface_hub>=0.23.0"

# Install text processing utilities:
# - markdown-it-py: Markdown parser for processing markdown documents
# - mdit_plain: Extension to extract plain text from markdown
# - pypdf: For processing and extracting text from PDF documents
!pip install markdown-it-py mdit_plain pypdf

# Installing Flask: A lightweight web framework for Python
# This will be used to create the API endpoints for the application
!pip install Flask

# Install python-dotenv: For loading environment variables from .env files
# This helps manage configuration and secrets separately from code
!pip install python-dotenv

# Install flask-cors: Flask extension for handling Cross-Origin Resource Sharing (CORS)
# This is necessary when the API needs to be accessed from different domains
!pip install flask-cors

## Create a Pipeline to Index Documents

Next, you'll create a pipeline to index documents. To keep things uncomplicated, you'll use an `InMemoryDocumentStore` but this approach would also work with any other flavor of `DocumentStore`.

You'll need a different file converter class for each file type in our data sources: `.pdf`, `.txt`, and `.md` in this case. Our `FileTypeRouter` connects each file type to the proper converter.

Once all our files have been converted to Haystack Documents, we can use the `DocumentJoiner` component to make these a single list of documents that can be fed through the rest of the indexing pipeline all together.

In [None]:
# Import DocumentWriter: Component that writes documents to a document store
from haystack.components.writers import DocumentWriter

# Import document converters for different file types:
# - MarkdownToDocument: Converts markdown files to Haystack Document objects
# - PyPDFToDocument: Extracts text from PDF files and converts to Documents
# - TextFileToDocument: Reads plain text files and converts to Documents
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument

# Import document preprocessors:
# - DocumentSplitter: Splits large documents into smaller chunks
# - DocumentCleaner: Cleans and normalizes document content
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner

# Import FileTypeRouter: Routes files to appropriate converters based on MIME type
from haystack.components.routers import FileTypeRouter

# Import DocumentJoiner: Merges documents from multiple sources into a single sequence
from haystack.components.joiners import DocumentJoiner

# Import SentenceTransformersDocumentEmbedder: Generates embeddings for documents using sentence-transformers
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

# Import Pipeline: For creating and running document processing pipelines
from haystack import Pipeline

# Import InMemoryDocumentStore: Lightweight document storage that keeps data in memory
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize an in-memory document store for temporary storage of documents
# Note: Data will be lost when the program ends since it's in-memory
document_store = InMemoryDocumentStore()

# Create a file type router that can identify and route different file types:
# - text/plain: Plain text files
# - application/pdf: PDF documents
# - text/markdown: Markdown files
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown"])

# Initialize converters for different file types:
text_file_converter = TextFileToDocument()  # For plain text files
markdown_converter = MarkdownToDocument()   # For markdown files
pdf_converter = PyPDFToDocument()           # For PDF files

# Initialize document joiner to merge documents from different converters
document_joiner = DocumentJoiner()

From there, the steps to this indexing pipeline are a bit more standard. The `DocumentCleaner` removes whitespace. Then this `DocumentSplitter` breaks them into chunks of 150 words, with a bit of overlap to avoid missing context.

In [None]:
# Initialize DocumentCleaner: A component that cleans and normalizes document content
# Performs operations like:
# - Removing extra whitespace
# - Normalizing unicode characters
# - Standardizing formatting
# Helps ensure consistent document processing downstream
document_cleaner = DocumentCleaner()

# Initialize DocumentSplitter: Splits documents into smaller chunks for processing
# Configuration:
# - split_by="word": Splits documents based on word count
# - split_length=150: Creates chunks of approximately 150 words each
# - split_overlap=50: Maintains 50 words overlap between chunks to preserve context
# This is particularly important for:
# - Language models with limited context windows
# - Maintaining semantic relationships across chunks
# - Preventing information loss at chunk boundaries
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)

Now you'll add a `SentenceTransformersDocumentEmbedder` to create embeddings from the documents. As the last step in this pipeline, the `DocumentWriter` will write them to the `InMemoryDocumentStore`.


In [None]:
# Initialize SentenceTransformersDocumentEmbedder: 
# This component generates vector embeddings for documents using a specified sentence-transformers model
# - model="sentence-transformers/all-MiniLM-L6-v2": Uses the MiniLM-L6-v2 model which is:
#   * A lightweight but powerful sentence embedding model
#   * 384-dimensional embeddings
#   * Good balance between performance and speed
#   * Trained on large datasets for general-purpose semantic understanding
# Embeddings enable semantic search and retrieval by converting text to numerical vectors
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Initialize DocumentWriter:
# This component writes processed documents to the specified document store
# - document_store: The InMemoryDocumentStore instance created earlier
# Handles:
# * Writing both document content and their embeddings
# * Batch operations for efficient storage
# * Maintaining document metadata
# This is typically the final step in an indexing pipeline
document_writer = DocumentWriter(document_store)

After creating all the components, add them to the indexing pipeline.

In [None]:
# Initialize the preprocessing pipeline - this will handle the complete document processing workflow
# The pipeline will process files through multiple stages:
# 1. File type identification
# 2. File conversion
# 3. Document cleaning and splitting
# 4. Embedding generation
# 5. Storage in document store
preprocessing_pipeline = Pipeline()

# Add file type router as first component:
# - Routes files to appropriate converters based on MIME type
# - Acts as the entry point for different file types
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")

# Add file converters for different file types:
# - Each converter handles a specific file format
# - Converters run in parallel based on file type
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")  # Handles .txt files
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")  # Handles .md files
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")  # Handles .pdf files

# Add document joiner:
# - Merges documents from all converters into a single stream
# - Ensures uniform processing regardless of original file type
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")

# Add document cleaner:
# - Standardizes document format and cleans content
# - Removes extra whitespace, normalizes text, etc.
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")

# Add document splitter:
# - Splits large documents into manageable chunks
# - Uses word-based splitting with overlap (as configured earlier)
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")

# Add document embedder:
# - Generates vector embeddings for each document chunk
# - Uses the specified sentence-transformers model
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")

# Add document writer as final component:
# - Stores processed documents in the document store
# - Includes both content and generated embeddings
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

Next, connect them 👇

In [None]:
# Connect the file type router outputs to their respective converters:
# - Plain text files (.txt) get routed to the text file converter
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
# - PDF files get routed to the PDF converter
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
# - Markdown files (.md) get routed to the markdown converter
preprocessing_pipeline.connect("file_type_router.text/markdown", "markdown_converter.sources")

# Connect all converters to the document joiner:
# This merges the output from all file type converters into a single document stream
# regardless of their original file format
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("markdown_converter", "document_joiner")

# Connect the joiner to the document cleaner:
# The unified document stream now goes through cleaning/normalization
preprocessing_pipeline.connect("document_joiner", "document_cleaner")

# Connect the cleaner to the document splitter:
# Cleaned documents are split into appropriately sized chunks
preprocessing_pipeline.connect("document_cleaner", "document_splitter")

# Connect the splitter to the document embedder:
# Each document chunk gets converted to a vector embedding
preprocessing_pipeline.connect("document_splitter", "document_embedder")

# Connect the embedder to the document writer:
# Final step stores both the document content and its embedding in the document store
preprocessing_pipeline.connect("document_embedder", "document_writer")

Let's test this pipeline with a few articles.

In [None]:
# Import Path from pathlib for cross-platform path operations
from pathlib import Path

# Define the directory containing documents to process
# 'articles' is the folder where input documents are stored
# This should contain various file types (txt, pdf, md) in any subdirectory structure
output_dir = 'articles'

# Execute the preprocessing pipeline by:
# 1. Using Path(output_dir).glob("**/*") to:
#    - Recursively find all files in 'articles' and its subdirectories
#    - Return Path objects for each file found
# 2. Converting the Path objects to a list to feed into the pipeline
# 3. Running the pipeline with these files as input to the file_type_router component
preprocessing_pipeline.run({
    "file_type_router": {
        "sources": list(Path(output_dir).glob("**/*"))  # Process all files recursively
    }
})

## (Optional) Build a pipeline to query documents

Now, let's build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the [`MistralChatGenerator`](https://docs.haystack.deepset.ai/docs/mistralchatgenerator) so must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) and [Mistralai](https://mistral.ai/) API key for this section.

In [None]:
# Import required modules
import os
from dotenv import load_dotenv  # For loading environment variables from .env file

# Load environment variables from .env file into the current environment
# This is typically used for:
# - API keys and secrets
# - Configuration settings
# - Sensitive credentials that shouldn't be hardcoded
load_dotenv()

# Check and set environment variables with fallback to .env values
# Note: The original condition has a logical issue (always evaluates to True)
# This is the corrected version that properly checks if either variable is missing
if "HF_API_TOKEN" not in os.environ or "MISTRAL_API_KEY" not in os.environ:
    # Set Hugging Face API token from .env if not in environment
    # Used for accessing Hugging Face models and services
    os.environ["HF_API_TOKEN"] = os.getenv('HF_API_TOKEN') 
    
    # Set Mistral API key from .env if not in environment
    # Used for authenticating with Mistral AI services
    os.environ["MISTRAL_API_KEY"] = os.getenv('MISTRAL_API_KEY')

In this step you'll build a query pipeline to answer questions about the documents.

This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer.

> ⚠️ Notice how we used `sentence-transformers/all-MiniLM-L6-v2` to create embeddings for our documents before. This is why we will be using the same model to embed incoming questions.

In [None]:
# Import required components for building a RAG (Retrieval-Augmented Generation) pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder  # For embedding user queries
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever  # For retrieving relevant documents
from haystack.components.builders import ChatPromptBuilder  # For constructing the LLM prompt
from haystack.dataclasses import ChatMessage  # For structured chat messages
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator  # HF chat generator (alternative)
from haystack_integrations.components.generators.mistral import MistralChatGenerator  # Mistral chat generator
from haystack.utils import Secret  # For secure API key handling
from haystack import Pipeline  # For building the processing pipeline

# Define the chat prompt template with system and user messages
template = [
    # System message defining the assistant's behavior and response format
    ChatMessage.from_system("""
You are a helpful assistant that answers questions in German language only. Your goal is to provide accurate and detailed answers based exclusively on the specified context. If the answer to a question is not found in the context, you must respond with: "Ich weiß es nicht" (I don't know). 

When answering, always include the following details:
1. Dokumentenindex: Specify the document number or identifier.
2. Zeilen: Reference the exact line numbers where the information is located.
3. Detailed Explanation: Provide a thorough and clear explanation in German, ensuring the response is based solely on the provided context.

Answer Format:
[Dokumentenindex : {} | Zeilen : {}] {detailed answer in German}

Rules:
1. Do not make assumptions or provide information outside the given context.
2. If the context does not contain the answer, respond with: "Ich weiß es nicht".
3. Always answer in German and maintain a formal and professional tone.
4. Do not include source citations, annotations, or any additional information outside the specified format.
    """),
    
    # User message template that will be filled with context and question
    ChatMessage.from_user(
        """
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""
    )
]

# Initialize the processing pipeline
pipe = Pipeline()

# Add components to the pipeline:
# 1. Embedder - converts text queries to embeddings using MiniLM model
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))

# 2. Retriever - finds relevant documents using embeddings
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))

# 3. Prompt Builder - constructs the LLM prompt using our template
pipe.add_component("chat_prompt_builder", ChatPromptBuilder(template=template))

# 4. LLM Generator - Mistral chat model (currently active)
# Alternative HuggingFace generator is commented out
pipe.add_component(
    "llm",
    MistralChatGenerator(api_key=Secret.from_env_var("MISTRAL_API_KEY")),
    # Alternative option:
    # HuggingFaceAPIChatGenerator(
    #     api_type="serverless_inference_api", 
    #     api_params={"model": "HuggingFaceH4/zephyr-7b-beta"}
    # ),
)

# Connect the pipeline components:
# 1. Query embedding to retriever
pipe.connect("embedder.embedding", "retriever.query_embedding")

# 2. Retrieved documents to prompt builder
pipe.connect("retriever", "chat_prompt_builder.documents")

# 3. Constructed prompt to LLM
pipe.connect("chat_prompt_builder.prompt", "llm.messages")

# Interactive usage example (commented out):
# This block would enable a question-answering loop in the console
# while True:
#     question = input("Ask a question: ")
#     answer = pipe.run({
#         "embedder": {"text": question}, 
#         "chat_prompt_builder": {"question": question}
#     })
#     print(answer['llm']['replies'][0].text)

# Creating the API endpoint to communicate with the bot

In [None]:
# Import required Flask modules and supporting libraries
from flask import Flask, request, jsonify  # Flask web framework components
from flask_cors import CORS  # Cross-Origin Resource Sharing support
import threading  # For running Flask in a separate thread
import time  # For sleep operations

# Initialize the Flask application
app = Flask(__name__)

# Enable CORS for all routes to allow frontend connections
CORS(app)

# Define the port number for the Flask server
port = 5800

# Define the main chat API endpoint
@app.route("/api", methods=['POST'])
def chat():
    errors = []  # List to collect any errors that occur
    
    if request.method == 'POST':
        try:
            # Extract JSON data from the incoming request
            data = request.json
            
            # Get the user's question from the message field
            question = data['message']
            
            # Run the question through the Haystack pipeline
            answer = pipe.run({
                "embedder": {"text": question}, 
                "chat_prompt_builder": {"question": question}
            })
            
            # Prepare the response with the LLM's answer
            response = {
                "data": answer['llm']['replies'][0].text  # Extract the generated response
            }
            return jsonify(response), 200  # Return success response
        
        except Exception as e:
            errors.append(str(e))  # Capture any processing errors
    else:
        errors.append("Invalid request method")  # Error for non-POST requests

    # Return error response if any occurred
    if errors:
        return jsonify({"errors": errors}), 400

def run_flask():
    """Function to run the Flask server with configuration"""
    print(f"Flask app is running on http://127.0.0.1:{port}/")
    app.run(port=port, debug=False, use_reloader=False)

# Global variable to track the Flask thread
flask_thread = None

# Main entry point
if __name__ == '__main__':
    # Start the Flask app in a separate thread to allow for graceful shutdown
    
    # Create and configure the thread
    flask_thread = threading.Thread(target=run_flask)
    flask_thread.daemon = True  # Daemonize thread (will exit when main program exits)
    
    # Start the Flask server thread
    flask_thread.start()

    try:
        # Keep the main thread alive while Flask runs in background
        while True:
            time.sleep(1)  # Prevent CPU overload
    except KeyboardInterrupt:
        # Handle Ctrl+C gracefully
        print("Shutting down Flask app...")
        # Daemon thread will automatically terminate with main program