In [1]:
%%capture
%pip install llama-index llama-index-readers-smart-pdf-loader pymupdf llamasherpa

Note, you will need to install the following before running this notebook:

`pip install llama-index-readers-smart-pdf-loader`

`pip install pymupdf`

`pip install llmsherpa`


In [2]:
import os
import sys
import getpass
import nest_asyncio
import fitz
from dotenv import load_dotenv 

nest_asyncio.apply()

load_dotenv()

sys.path.append('../helpers')

from text_cleaning_helpers import clean

  "\*",
  remove_citations = lambda text: re.sub("\[\d{1,3}\]", "", text)


# Data Preparation and Cleaning for RAG

Your RAG system is only as good as the data you retrieve. 

That's why data preparation and cleaning are important steps to ensure high-quality results. **This course purposefully uses simple PDF files, specifically books, to demonstrate the process.** There's so much to data preparation for RAG that I could write another two-hour course just on that topic. However, it's important to acknowledge that real-world PDFs and other documents can be much more complex, requiring additional processing and cleaning techniques.

### Considerations for data prep

- 📜 **Document Content**: Utilize text from documents for keyword searches or to find similar content in RAG applications.

- 📑 **Document Elements**: Break down documents into fundamental parts to assist in RAG tasks like filtering and segmenting, like:
  - Titles
  - Narrative text
  - List items
  - Tables
  - Images

- 🏷 **Element Metadata**: Provide additional details for each document element to support hybrid search and track information origin, such as:
  - Filename
  - Filetype
  - Page number
  - Section

- 🔄 **Summary**: Explains document preprocessing for retrieval systems, focusing on transforming documents into searchable elements and metadata.



#### Let's inspect our PDFs

**Now that the disclaimer is out of the way, let's work with the PDFs that we have.**

In [3]:
PDF_PATH = "../data/almanack_of_naval_ravikant.pdf"

#LLMSHERPA_API_URL = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"

In [4]:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader
from llama_index.readers.smart_pdf_loader import SmartPDFLoader

simple_directory_reader_docs = SimpleDirectoryReader(input_files=[PDF_PATH]).load_data()

#smart_pdf_loader_docs = SmartPDFLoader(llmsherpa_api_url=LLMSHERPA_API_URL).load_data(PDF_PATH)

pdf_reader_docs = PDFReader().load_data(PDF_PATH)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /opt/conda/envs/llama/lib/python3.13/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
len(simple_directory_reader_docs)

242

In [6]:
print(simple_directory_reader_docs[100].get_content())

BUILDING  JUDGMENT ·  101SHED YOUR IDENTITY TO SEE REALITY
Our egos are constructed in our formative years—our first 
two decades. They get constructed by our environment, our 
parents, society. Then, we spend the rest of our life trying to 
make our ego happy. We interpret anything new through our 
ego: “How do I change the external world to make it more how 
I would like it to be?” [8]
“Tension is who you think you should be.  
Relaxation is who you are.”
—Buddhist saying
You absolutely need habits to function. You cannot solve every 
problem in life as if it is the first time it’s thrown at you. We 
accumulate all these habits. We put them in the bundle of 
identity, ego, ourselves, and then we get attached to them. “I’m 
Naval. This is the way I am.”
It’ s really important to be able to uncondition yourself, to be 
able to take your habits apart and say, “Okay, this is a habit I 
probably picked up when I was a toddler trying to get my par-
ent’s attention. Now I’ve reinforced it a

In [None]:
#len(smart_pdf_loader_docs)

In [None]:
#print(smart_pdf_loader_docs[100].get_content())

In [7]:
len(pdf_reader_docs)

242

In [8]:
print(pdf_reader_docs[100].get_content())

BUILDING  JUDGMENT ·  101SHED YOUR IDENTITY TO SEE REALITY
Our egos are constructed in our formative years—our first 
two decades. They get constructed by our environment, our 
parents, society. Then, we spend the rest of our life trying to 
make our ego happy. We interpret anything new through our 
ego: “How do I change the external world to make it more how 
I would like it to be?” [8]
“Tension is who you think you should be.  
Relaxation is who you are.”
—Buddhist saying
You absolutely need habits to function. You cannot solve every 
problem in life as if it is the first time it’s thrown at you. We 
accumulate all these habits. We put them in the bundle of 
identity, ego, ourselves, and then we get attached to them. “I’m 
Naval. This is the way I am.”
It’ s really important to be able to uncondition yourself, to be 
able to take your habits apart and say, “Okay, this is a habit I 
probably picked up when I was a toddler trying to get my par-
ent’s attention. Now I’ve reinforced it a

In [9]:
pdf_reader_docs[100].text == simple_directory_reader_docs[100].text

True

In [10]:
document = fitz.open(PDF_PATH)

def extract_text(document, opt="text"):
    '''Extract text from a page and returns a list of strings'''
    text = document.get_text(opt, sort=True) 
    text = text.split("\n")
    return text

pages = [extract_text(page) for page in document]

In [11]:
pages[42] 

['I have some sales skills, which is a form of specific knowledge.',
 'I have some analytical skills on how to make money. And I',
 'have this ability to absorb data, obsess about it, and break it',
 'down—that is a specific skill that I have. I also love tinkering',
 'with technology. And all of this stuff feels like play to me, but',
 'it looks like work to others.',
 '',
 'There are other people to whom these things would be hard,',
 'and they say, “Well, how do I get good at being pithy and sell-',
 'ing ideas?” Well, if you’re not already good at it or if you’re not',
 'really into it, maybe it’s not your thing—focus on the thing',
 'that you are really into.',
 '',
 'The first person to actually point out my real specific knowl-',
 'edge was my mother. She did it as an aside, talking from the',
 'kitchen, and she said it when I was fifteen or sixteen years old.',
 'I was telling a friend of mine that I want to be an astrophysi-',
 'cist, and she said, “No, you’re going to go into

In [12]:
def get_document(file_path, pages):
    """
    Opens a PDF file and optionally selects specific pages to create a document object.

    This function utilizes the `fitz` library to open a PDF file located at `file_path`. 
    If a list of `pages` is provided, the function selects only these pages from the document.
    This is useful for focusing on certain parts of a PDF without loading the entire document into memory.

    Parameters:
        file_path (str): The path to the PDF file to be opened.
        pages (list of int, optional): A list of page numbers to select from the PDF. 
            If `None`, the entire document is loaded.

    """
    document = fitz.open(file_path)
    if pages is not None:
        document.select(pages)  # Select specific pages if pages are provided
    return document


def handle_chapter_headers_footers(strings, flag):
    """
    Modify a list of strings based on a specified flag and join them into a single string.

    This function first removes any empty strings from the input list. It then checks if the
    remaining list has more than three elements. If so, it modifies the list by removing the
    first element, last element, or both, based on the value of the flag. The final list is then
    joined into a single string with spaces separating the elements.

    Parameters:
        strings (list of str): The list of strings to modify.
        flag (str): A flag indicating the modification to perform on the list:
            - 'remove_first': Remove the first element of the list.
            - 'remove_last': Remove the last element of the list.
            - 'remove_first_last': Remove both the first and last elements of the list.
            - 'remove_first_two': Remove the first two elements of the list.
            - Any other value leaves the list unchanged.

    Returns:
        str: A single string composed of the modified list elements, separated by spaces.
    """
    # Filter out empty strings
    filtered_strings = [s for s in strings if s]
    
    # Check if the filtered list has more than three elements
    if len(filtered_strings) > 3:
        if flag == 'remove_first':
            filtered_strings = filtered_strings[1:]  # Slice off the first element
        elif flag == 'remove_last':
            filtered_strings = filtered_strings[:-1]  # Slice off the last element
        elif flag == 'remove_first_last':
            filtered_strings = filtered_strings[1:-1]  # Slice off the first and last elements
        elif flag == 'remove_first_two':
            filtered_strings = filtered_strings[2:]  # Slice off the first two elements
    
    # Join all strings with a space and return the result
    return ' '.join(filtered_strings).strip()

def extract_text(page, file_name, title, author, flag, opt="text"):
    """
    Extracts text from a specified page of a document and returns a dictionary containing
    the extracted text and associated metadata.

    The function first retrieves text from the given `page` object using the specified `opt` method.
    It then processes this text to remove chapter headers, footers, and applies various cleaning
    procedures according to the `flag` and other parameters set in the `clean` function.

    Parameters:
        page (fitz.Page): The page object from which to extract text.
        file_name (str): The name of the file from which the page is taken.
        title (str): The title of the document.
        author (str): The author of the document.
        flag (str): A flag used to customize how chapter headers and footers are handled.
        opt (str, optional): The method of text extraction to be used by `get_text`.
            Defaults to "text", but can be changed to other methods supported by the library.

    Returns:
        dict: A dictionary with two keys:
            - 'text': A string containing the cleaned and processed text from the page.
            - 'metadata': A dictionary containing metadata about the text, including the
                          page number, file name, title, and author.
    """
    
    text = page.get_text(opt, sort=True)

    text = text.split("\n")

    text = handle_chapter_headers_footers(text, flag)

    text = clean(
        text,
        extra_whitespace=True,
        broken_paragraphs=True,
        bullets=True,
        ascii=True,
        lowercase=False,
        citations=True,
        merge_split_words=True,
    )

    return {
        "text": text,
        "metadata": {
            "page_number": page.number,
            "file_name": file_name,
            "title": title,
            "author": author
        }
    }

def extract_texts_from_pdf(file_path, title, author, pages, flag):
    document = get_document(file_path, pages)
    file_name = os.path.basename(file_path)
    extracted_texts = [extract_text(page, file_path, title, author, flag) for page in document]
    return extracted_texts

In [13]:
pdf_files = [
    {
        "file_path": "../data/almanack_of_naval_ravikant.pdf", 
        "title": "The Almanack of Naval Ravikant", 
        "author": "Naval Ravikant", 
        "pages": list(range(29, 203)),
        "flag": "remove_last"
        },
    {
        "file_path": "../data/anthology_of_balaji.pdf", 
        "title": "The Anthology of Balaji Srinivasan", 
        "author": "Balaji Srinivasan", 
        "pages": list(range(32, 261)),
        "flag": "remove_last"
        },
    {
        "file_path": "../data/hackers_and_painters.pdf", 
        "title": "Hackers and Painters", 
        "author": "Paul Graham", 
        "pages": list(range(14,221)),
        "flag": "remove_first_last"
        },
    {
        "file_path": "../data/skin_in_the_game.pdf", 
        "title": "Skin in the Game", 
        "author": "Nassim Nicholas Taleb", 
        "pages": list(range(15,272)),
        "flag": None
        },
    {
        "file_path": "../data/taoofseneca_vol1-1.pdf", 
        "title": "Letters From a Stoic Volume 1",
        "author": "Seneca", 
        "pages": list(range(15,308)),
        "flag": "remove_first_two"
        },
    {
        "file_path": "../data/taoofseneca_vol2.pdf", 
        "title": "Letters From a Stoic Volume 2",  
        "author": "Seneca", 
        "pages": list(range(7,283)),
        "flag": "remove_first_two"
        },
    {
        "file_path": "../data/taoofseneca_vol3.pdf", 
        "title": "Letters From a Stoic Volume 3",  
        "author": "Seneca", 
        "pages": list(range(7,258)),
        "flag": "remove_first_two"
        },
    {
        "file_path": "../data/striking-thoughts.pdf", 
        "title": "Striking Thoughts",  
        "author": "Bruce Lee", 
        "pages": list(range(20,217)),
        "flag": None
        },
]

all_texts = []

for pdf in pdf_files:
    print(f"Extracting texts from {pdf['title']} by {pdf['author']}...")
    texts = extract_texts_from_pdf(pdf["file_path"], pdf["title"], pdf["author"], pdf["pages"], pdf["flag"])
    print(f"Finished extracting texts from {pdf['title']}.")
    all_texts.extend(texts)

Extracting texts from The Almanack of Naval Ravikant by Naval Ravikant...
Finished extracting texts from The Almanack of Naval Ravikant.
Extracting texts from The Anthology of Balaji Srinivasan by Balaji Srinivasan...
Finished extracting texts from The Anthology of Balaji Srinivasan.
Extracting texts from Hackers and Painters by Paul Graham...
Finished extracting texts from Hackers and Painters.
Extracting texts from Skin in the Game by Nassim Nicholas Taleb...
Finished extracting texts from Skin in the Game.
Extracting texts from Letters From a Stoic Volume 1 by Seneca...
Finished extracting texts from Letters From a Stoic Volume 1.
Extracting texts from Letters From a Stoic Volume 2 by Seneca...
Finished extracting texts from Letters From a Stoic Volume 2.
Extracting texts from Letters From a Stoic Volume 3 by Seneca...
Finished extracting texts from Letters From a Stoic Volume 3.
Extracting texts from Striking Thoughts by Bruce Lee...
Finished extracting texts from Striking Thoughts

In [14]:
len(all_texts)

1884

In [15]:
all_texts[42]

{'text': 'Set a very high hourly aspirational rate for yourself and stick to it. It should seem and feel absurdly high. If it doesnt, its not high enough. Whatever you picked, my advice to you would be to raise it. Like I said, for myself, even before I had money, for the longest time I used $5,000 an hour. And if you extrapolate that out into what it looks like as an annual salary, its multiple millions of dollars per year. Ironically, I actually think Ive beaten it. Im not the hardest working personIm actually a lazy person. I work through bursts of energy where Im really motivated with something. If I actually look at how much Ive earned per actual hour that Ive put in, its probably quite a bit higher than that. Can you expand on your statement, If you secretly despise wealth, it will elude you? If you get into a relative mindset, youre always going to hate people who do better than you, youre always going to be jealous or envious of them. Theyll sense those feelings when you try an

# Create and persist a Document store

In [16]:
from llama_index.core import Document

llama_index_docs = [Document(text=doc["text"], metadata=doc["metadata"]) for doc in all_texts]

In [17]:
len(llama_index_docs)

1884

In [19]:
llama_index_docs[42].__dict__

{'id_': '0bd19825-4d4f-48e3-a267-c3d0dff337ec',
 'embedding': None,
 'metadata': {'page_number': 42,
  'file_name': '../data/almanack_of_naval_ravikant.pdf',
  'title': 'The Almanack of Naval Ravikant',
  'author': 'Naval Ravikant'},
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'relationships': {},
 'text': 'Set a very high hourly aspirational rate for yourself and stick to it. It should seem and feel absurdly high. If it doesnt, its not high enough. Whatever you picked, my advice to you would be to raise it. Like I said, for myself, even before I had money, for the longest time I used $5,000 an hour. And if you extrapolate that out into what it looks like as an annual salary, its multiple millions of dollars per year. Ironically, I actually think Ive beaten it. Im not the hardest working personIm actually a lazy person. I work through bursts of energy where Im really motivated with something. If I actually look at how much Ive earned per actual hour that Iv

In [20]:
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage import StorageContext

# Create a SimpleDocumentStore and add the documents
docstore = SimpleDocumentStore()
docstore.add_documents(llama_index_docs)

# Create a storage context
storage_context = StorageContext.from_defaults(docstore=docstore)

# Persist the document store to disk
storage_context.persist("../data/words-of-the-senpais")

# Challenges with Complex PDFs and Documents

1. 📑 **Formatting inconsistencies**: PDFs and other documents can have varying layouts, fonts, and styles, making it difficult to extract text consistently.

2. 🏞️ **Images and graphics**: Documents may contain images, charts, and other visual elements that need to be handled separately or extracted using Optical Character Recognition (OCR) techniques.

3. 💽 **Tables and structured data**: Extracting information from tables and structured data within documents can be challenging and may require specialized tools or techniques.

4. 💾 **Metadata and noise**: Documents may include metadata, headers, footers, and other noise that needs to be handled before processing.

While this course won't cover these complex scenarios in depth, it's essential to understand the potential challenges and the need for more advanced data preparation and cleaning techniques when working with diverse document types.

## Options for parsing complex pdfs

### General PDFs

 - [LlamaParse](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/llama_parse/) - LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.

 - [pdfminer.six](https://pdfminersix.readthedocs.io/en/latest/) - A tool for extracting information from PDF documents. It focuses on getting and analyzing text data.

- [pdfplumber](https://github.com/jsvine/pdfplumber) - Gives you detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

- [pypdf](https://pypdf.readthedocs.io/en/latest/) - Capable of splitting, merging, cropping, and transforming the pages of PDF files.

- [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) - A high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

- [Camelot](https://camelot-py.readthedocs.io/en/master/) - This library is specifically for extracting data from tables in PDFs. This repo also has a [nice comparison](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools) of other table extraction libraries.

- [LLMSherpa](https://github.com/nlmatics/llmsherpa) - The main class here is the `LayoutPDFReader`, and a good read about the problem and proposed solution is [here](https://ambikasukla.substack.com/p/efficient-rag-with-document-layout)

- [unstructured](https://github.com/Unstructured-IO/unstructured) - This has components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. 

- [Table Transformer](https://github.com/microsoft/table-transformer) - A deep learning model for extracting tables from unstructured documents (PDFs and images)

- [Layout Parser](https://github.com/Layout-Parser/layout-parser) - This is a unified toolkit for deep learning based document image analysis which has a rich repository of deep learning models for layout detection, as well as a set of unified APIs for using them.

 - [marker](https://github.com/VikParuchuri/marker) - Converts PDF to markdown quickly with high accuracy.

 - [surya](https://github.com/VikParuchuri/surya) -  A document OCR toolkit for accurate OCR in 90+ languages, line-level text detection in any language, layout analysis (table, image, header, etc detection) in any language.

### Academic PDFs

- [nougat](https://github.com/facebookresearch/nougat) - This is an academic document PDF parser that understands LaTeX math and tables.

- [GROBID](https://grobid.readthedocs.io/en/latest/Introduction/) - This is a a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

- [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR/) - Uses a vision transformer (ViT) to convert images of equations into LaTeX code.


