# RAG Load & Split

### Indexing

- Load: First we need to load our data. 

- Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.

- Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

- Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

- Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

In [1]:
# !pip install --upgrade pymupdf
# !pip install pymupdf4llm

In [2]:
import os
import pathlib
from typing import Generator
import pymupdf4llm
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter


In [3]:
class DocLoader:
    """
    A class to load and process documents from a specified directory.
    
    Attributes:
        path (str): The path to the directory containing the documents.
        chunk_size (int): The size of the text chunks to split the documents into.
        chunk_overlap (int): The amount of overlap between text chunks.
        enable_logging (bool): Flag to show progress of loading documents.
        docs (Generator): A generator yielding processed markdown documents.
    """
    
    def __init__(self, path: str,chunk_size=512,chunk_overlap=128,enable_logging=True):
        """
        Initializes the DocLoader with the specified path and loads the documents.
        
        Args:
            path (str): The path to the directory containing the documents.
            chunk_size (int): The size of the text chunks to split the documents into.
            chunk_overlap (int): The amount of overlap between text chunks.
            enable_logging (bool): Flag to show progress of loading documents.
        """
        self.path = path
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.enable_logging = enable_logging
        if self.enable_logging:
            print("\nLoading files from {}".format(self.path))
        self.docs = self.load()
        self.file_count = self.count_total_markdown_files()

    def load(self) -> Generator:
        """
        Loads the documents by processing PDFs and loading markdown files.
        
        Returns:
            Generator: A generator yielding processed markdown documents.
        """
        self.process_pdfs()
        return self.load_markdown_files()

    def process_pdfs(self):
        """
        Converts PDF files in the specified directory and all subdirectories to markdown format.
        """
        if self.enable_logging:
            print("\nProcessing PDFs")
        for root, _, files in os.walk(self.path):
            for file in files:
                if file.endswith(".pdf"):
                    base_name = os.path.splitext(file)[0]
                    md_file_path = os.path.join(root, f"{base_name}_pdf_converted.md")
                    if not os.path.exists(md_file_path):
                        if self.enable_logging:
                            print(f"Converted {os.path.join(root, file)} to markdown")
                        md_text = pymupdf4llm.to_markdown(os.path.join(root, file))
                        pathlib.Path(md_file_path).write_bytes(md_text.encode())
                    else:
                        if self.enable_logging:
                            print(f"Skipping {os.path.join(root, file)} as it has already been converted to markdown")

    def count_total_markdown_files(self) -> int:
        """
        Counts the total number of markdown files in the specified directory and all subdirectories.
        
        Returns:
            int: The total number of markdown files.
        """
        count = 0
        for root, _, files in os.walk(self.path):
            for file in files:
                if file.endswith(".md"):
                    count += 1
        return count
    
    def load_markdown_files(self) -> Generator:
        """
        Loads markdown files from the specified directory and all subdirectories.
        
        Yields:
            tuple: A tuple containing header splits and text chunks of the markdown file.
        """
        for root, _, files in os.walk(self.path):
            for file in files:
                if file.endswith(".md"):
                    yield self.load_markdown(os.path.join(root, file), self.chunk_size, self.chunk_overlap)

    @staticmethod
    def load_markdown(file: str, chunks=512, chunk_overlap=128) -> tuple:
        """
        Loads and splits a markdown file into headers and text chunks.
        
        Args:
            file (str): The path to the markdown file.
            chunks (int): The size of the text chunks to split the documents into.
            chunk_overlap (int): The amount of overlap between text chunks.
        
        Returns:
            tuple: A tuple containing the file name, text chunks, and metadata.
        """
        with open(file, 'r') as f:
            md_text = f.read()

        headers_to_split_on = [
            ("#", "Header 1"),
            ("##", "Header 2"),
            ("###", "Header 3"),
            ("####", "Header 4"),
            ("#####", "Header 5"),
        ]
        markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
        md_header_splits = markdown_splitter.split_text(md_text)
        
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunks, chunk_overlap=chunk_overlap)
        chunks = text_splitter.split_documents(md_header_splits)
        
        documents = [chunk.page_content for chunk in chunks]
        for chunk in chunks:
            chunk.metadata.update({"file": file})
        metadatas = [chunk.metadata for chunk in chunks]
        
        return file,documents,metadatas
    
    @staticmethod
    def load_single_file(file: str, chunks=512, chunk_overlap=128) -> tuple:
        # file can be a pdf or a markdown file
        # if file is a pdf, convert it to markdown then call load_markdown
        if file.endswith(".pdf"):
            base_name = os.path.splitext(file)[0]
            md_file_path = f"{base_name}_pdf_converted.md"
            if not os.path.exists(md_file_path):
                md_text = pymupdf4llm.to_markdown(file)
                pathlib.Path(md_file_path).write_bytes(md_text.encode())
            return DocLoader.load_markdown(md_file_path, chunks, chunk_overlap)
        elif file.endswith(".md"):
            return DocLoader.load_markdown(file, chunks, chunk_overlap)
        else:
            raise ValueError("File must be a PDF or markdown file")

In [4]:
dl = DocLoader("doc")
for doc in dl.docs:
    print(doc[0])


Loading files from doc

Processing PDFs
Skipping doc/example_doc.pdf as it has already been converted to markdown
Skipping doc/subdir/subdir_example_doc.pdf as it has already been converted to markdown
doc/example_doc_pdf_converted.md
doc/example_doc.md
doc/subdir/subdir_example_doc.md
doc/subdir/subdir_example_doc_pdf_converted.md


In [5]:
dl = DocLoader("doc")


Loading files from doc

Processing PDFs
Skipping doc/example_doc.pdf as it has already been converted to markdown
Skipping doc/subdir/subdir_example_doc.pdf as it has already been converted to markdown


In [6]:
dl.file_count

4

In [7]:
file,documents,metadatas = next(dl.docs)
print("current file:", file)

current file: doc/example_doc_pdf_converted.md


In [8]:
documents

['**Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs)**  \n**1. Introduction**  \nRetrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of\ninformation retrieval and generative models. By integrating a retrieval mechanism with a large  \nlanguage model (LLM), RAG can provide more accurate and contextually relevant responses, especially\nin knowledge-intensive tasks.  \n**2. Objectives**  \nEnhance the accuracy of responses generated by LLMs.',
 'in knowledge-intensive tasks.  \n**2. Objectives**  \nEnhance the accuracy of responses generated by LLMs.  \nProvide up-to-date information by retrieving relevant documents from a knowledge base.  \nImprove user experience by delivering contextually rich and informative answers.  \n**3. Components**  \n3.1 Large Language Model (LLM)  \nAn LLM, such as GPT-4, is responsible for generating human-like text based on input prompts. It can',
 'An LLM, such as GPT-4, is responsible for generating

In [9]:
metadatas

[{'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'},
 {'file': 'doc/example_doc_pdf_converted.md'}]

In [13]:
dl.load_single_file("doc/example_doc.pdf")

('doc/example_doc_pdf_converted.md',
 ['**Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs)**  \n**1. Introduction**  \nRetrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of\ninformation retrieval and generative models. By integrating a retrieval mechanism with a large  \nlanguage model (LLM), RAG can provide more accurate and contextually relevant responses, especially\nin knowledge-intensive tasks.  \n**2. Objectives**  \nEnhance the accuracy of responses generated by LLMs.',
  'in knowledge-intensive tasks.  \n**2. Objectives**  \nEnhance the accuracy of responses generated by LLMs.  \nProvide up-to-date information by retrieving relevant documents from a knowledge base.  \nImprove user experience by delivering contextually rich and informative answers.  \n**3. Components**  \n3.1 Large Language Model (LLM)  \nAn LLM, such as GPT-4, is responsible for generating human-like text based on input prompts. It can',
  'An LLM, such