# Loading Multimodal PDF using PyMuPDF Library

In [6]:
# Import libraries
import pymupdf
from langchain_core.documents import Document

In [2]:
pdf_path = "./pdf-docs/rag_llm.pdf"

### Extract text

In [3]:
doc = pymupdf.open(pdf_path)

In [4]:
doc

Document('./pdf-docs/rag_llm.pdf')

In [11]:
# Extract only text
docs = []
for i, page in enumerate(doc):
    text = page.get_text()
    if text.strip():
        text_doc = Document(
            metadata= {
                "page": i,
                "source": pdf_path,
                "type": "text"
            },
            page_content= text.strip()
        )
        docs.append(text_doc)

In [10]:
docs

[Document(metadata={'page': 0, 'source': './pdf-docs/rag_llm.pdf', 'type': 'text'}, page_content='A Retrieval-Augmented Generation Based Large \nLanguage Model Benchmarked on a Novel Dataset \nKieran Pichai \nMenlo School \nABSTRACT \nThe evolution of natural language processing has seen marked advancements, particularly with the advent of models \nlike BERT, Transformers, and GPT variants, with recent additions like GPT and Bard. This paper investigates the \nRetrieval-Augmented Generation (RAG) framework, providing insights into its modular design and the impact of its \nconstituent modules on performance. Leveraging a unique dataset from Amazon Rainforest natives and biologists, our \nresearch demonstrates the signiﬁcance of preserving indigenous cultures and biodiversity. The experiment employs a \ncustomizable RAG methodology, allowing for the interchangeability of various components, such as the base language \nmodel and similarity score tools. Findings indicate that while GPT pe

In [15]:
# Extract Text and OCR for images
docs = []
for i, page in enumerate(doc):
    tp = page.get_textpage_ocr()
    text = page.get_text(textpage=tp)
    if text.strip():
        text_doc = Document(
            metadata= {
                "page": i,
                "source": pdf_path,
                "type": "text"
            },
            page_content= text.strip()
        )
        docs.append(text_doc)

In [16]:
print(docs[5].page_content)

In conclusion, the proposed experiment holds the potential to make significant contributions to both the field of AI and 
the preservation of human cultural heritage. The insights gained could lead to a more inclusive and representative 
future for LLMs, where the voices of all communities can be heard and understood. 
 
 
 
 
Figure 1. Venn Diagram of Data Sources for RAG. This figure represents a venn diagram of 3 sources of information 
(google search results, OpenAI/Palm, proprietary data collected by the author) combined in order to create the “out-
putted answer.” 
 
 
Figure 2. Executive Diagram of Proposed RAG. This diagram outlines the various steps and procedures of the RAG 
algorithm from the input of the “user question” to the “outputted answer of the user question.” 
 
Volume 12 Issue 4 (2023) 
ISSN: 2167-1907
www.JSR.org/hs
6
=
Diferent
AP! Calls
nat@PT
SepAP Google
‘Search Results)
Scrape tough a premade
lst
nd whch moet
close she user question
pamembed |
open mbed
Large

### Extract Images

In [17]:
from PIL import Image
import io

In [None]:

for i, page in enumerate(doc):
    for img_index, img in enumerate(page.get_images(full=True)):
        try:
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            # Convert to PIL Image
            pil_image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
            # Create unique identifier
            image_id = f"page_{i}_img_{img_index}"
            img_path = f"./extracted_images/{image_id}.png"
            # Save the image
            pil_image.save(img_path, format="PNG")
        except e:
            print(e)