<a href="https://colab.research.google.com/github/ahsanrazi/LangChain/blob/main/10_Multimodal_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multimodal RAG

In [None]:
# There are three ways to make the Multimodal RAG System.

# Option 1: Use Multimodal Embeddings (Like CLIP)
# Option 2: Convert Images to Text Summaries First
# Option 3: Store Image Summaries + Link to Images

# Doc Link: https://blog.langchain.dev/semi-structured-multi-modal-rag/

# Simple, powerful idea for RAG: decouple documents, which we want to use for answer synthesis, from a reference, which we want to use for retriever.
# As a simple example, we can create a summary of a verbose document optimized to vector-based similarity search,
# but still pass the full document into the LLM to ensure no context is lost during answer synthesis.


# Unstructured processes PDF files by breaking them into meaningful text sections. Here's how it works in a simple way:

# Removes images → First, it removes all pictures from the PDF to focus only on the text.
# Detects sections using AI → It uses a model called YOLOX to identify important parts of the document, like: Titles (e.g., "Introduction," "Conclusion")
# Tables (by recognizing their position and layout)
# Groups text under titles → Once titles are detected, it collects all the text that belongs to each section.
# Breaks text into chunks → It further divides the text into smaller parts based on user preferences, such as: Minimum chunk size


# Unstructured file parsing and multi-vector retrieval work together to improve RAG for semi-structured data (like PDFs with text and tables).

# Here’s a simple breakdown:
# Problem with basic chunking → Regular chunking methods may split tables incorrectly, making it hard for the LLM to understand them.
# How Unstructured helps → It processes files intelligently, recognizing tables and generating summaries of their contents instead of breaking them randomly.
# How multi-vector retrieval helps → Instead of searching full documents, it retrieves both text chunks and table summaries based on semantic similarity.
# Better retrieval process:
# If a table summary matches the user’s question, the system retrieves it.
# The full/raw table is then sent to the LLM for a more accurate and complete answer.

In [None]:
# https://github.com/sunnysavita10/Generative-AI-Indepth-Basic-to-Advance/blob/main/MultiModal%20RAG/Extract_Image%2CTable%2CText_from_Document_MultiModal_Summrizer_AAG_App_YT.ipynb

# Extract Images, Tables, Text from Documents

In [1]:
!pip install "unstructured[all-docs]" pillow pydantic lxml matplotlib

Collecting unstructured[all-docs]
  Downloading unstructured-0.16.17-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting dataclasses-json (from unstructured[all-docs])
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2025.1.28-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collec

In [1]:
!sudo apt-get update

0% [Working]            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,643 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,521 kB]
Get:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,904 kB]
Hit:13 https://ppa.launchpadc

In [2]:
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.6 [186 kB]
Fetched 186 kB in 0s (403 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package poppler-utils.
(Reading database ... 124926 

In [3]:
!sudo apt-get install libleptonica-dev tesseract-ocr libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libarchive-dev libimagequant0 libraqm0 python3-olefile tesseract-ocr-osd
Suggested packages:
  python-pil-doc
The following NEW packages will be installed:
  libarchive-dev libimagequant0 libleptonica-dev libraqm0 libtesseract-dev
  python3-olefile python3-pil tesseract-ocr tesseract-ocr-eng
  tesseract-ocr-osd tesseract-ocr-script-latn
0 upgraded, 11 newly installed, 0 to remove and 20 not upgraded.
Need to get 39.9 MB of archives.
After this operation, 123 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libarchive-dev amd64 3.6.0-1ubuntu1.3 [581 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libimagequant0 amd64 2.17.0-1 [34.6 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libleptonica-dev amd64 1.82.0-3build1 [1,562 kB]
Get:4 http://archive.ubuntu

In [4]:
!pip install unstructured-pytesseract
!pip install tesseract-ocr

Collecting tesseract-ocr
  Downloading tesseract-ocr-0.0.1.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tesseract-ocr
  Building wheel for tesseract-ocr (setup.py) ... [?25l[?25hdone
  Created wheel for tesseract-ocr: filename=tesseract_ocr-0.0.1-cp311-cp311-linux_x86_64.whl size=179086 sha256=368248d079503baf63765458beec5f40aafd23f312fb2a3168ba513c07baebc4
  Stored in directory: /root/.cache/pip/wheels/90/83/3c/d2b68d844d169d6015fc2ad8c93207d778829c87e26c6f2206
Successfully built tesseract-ocr
Installing collected packages: tesseract-ocr
Successfully installed tesseract-ocr-0.0.1


In [5]:
from unstructured.partition.pdf import partition_pdf

In [6]:
raw_pdf_elements=partition_pdf(
    filename="/content/transformer.pdf",
    strategy="hi_res",
    extract_images_in_pdf=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=False,
    extract_image_block_output_dir="extracted_data"
  )

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]