# Multimodal RAG for a Website

### Problem Statement
Extract text, images and tables in each page, and summarize the extracted images with a vision model, and store the text and summarized images of website in a vectorDB after embedding. Then create a RAG pipeline to ask question related to this website.

In [1]:
# Define the list of pages of sonfmeets.com website
page_list = [
    "https://softmeets.com/",
    "https://softmeets.com/automation/",
    "https://softmeets.com/internet-of-things/",
    "https://softmeets.com/artificial-intelligence/",
    "https://softmeets.com/analytics/",
    "https://softmeets.com/about-us/",
    "https://softmeets.com/customers/",
    "https://softmeets.com/partners/",
    "https://softmeets.com/contact-us/",
    "https://softmeets.com/vetting-of-detail-project-report-of-automation-of-drinking-water-project-with-national-institute-of-technology-durgapur/",
    "https://softmeets.com/implementation-of-document-management-system-at-sail-bokaro-steel-plant/",
    "https://softmeets.com/application-development-system-modernization-and-support-upv-tms/",
    "https://softmeets.com/barqat/",
    "https://softmeets.com/rbms/",
    "https://softmeets.com/upv/",
    "https://softmeets.com/cms/",
    "https://softmeets.com/privacy-policy/"
]

In [2]:
# Import libraries
from unstructured.partition.html import partition_html

In [44]:
# Read each pages
elements = partition_html(
    url=page_list[5],
    extract_image_block_types=["Image", "Table"],  # Extract image blocks that are images or tables
    extract_image_block_to_payload=False,          # Don't embed image bytes directly into `el.metadata["image"]`

)

In [50]:
# Print the content types and preview
for el in elements:
    print(f"Type: {el.category}")
    print(el.metadata.to_dict())
    print(f"Preview: {el.text[:100]}...")
    print("="* 80)

Type: Image
{'image_url': '//softmeets.com/wp-content/uploads/2024/11/soft-logo-1.png', 'link_texts': ['Softmeets Info Solutions Pvt. Ltd.'], 'link_urls': ['https://softmeets.com/'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/about-us/'}
Preview: Softmeets Info Solutions Pvt. Ltd....
Type: UncategorizedText
{'link_texts': ['Navigation'], 'link_urls': ['#'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/about-us/'}
Preview: Navigation...
Type: ListItem
{'category_depth': 1, 'link_texts': ['Home'], 'link_urls': ['https://softmeets.com/'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/about-us/'}
Preview: Home...
Type: ListItem
{'category_depth': 1, 'link_texts': ['Innovation'], 'link_urls': ['#'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/about-us/'}
Preview: Innovation...
Type: ListItem
{'category_depth': 2, 'link_texts': ['Automation'], 'link_urls': ['https

In [27]:
# Print the image types and preview
for el in elements:
    if el.category == 'Image':
        print(el.metadata.to_dict())
        print(f"Preview:{el.text}")
        print("=" * 80)

{'image_url': '//softmeets.com/wp-content/uploads/2024/11/soft-logo-1.png', 'link_texts': ['Softmeets Info Solutions Pvt. Ltd.'], 'link_urls': ['https://softmeets.com/'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/'}
Preview:Softmeets Info Solutions Pvt. Ltd.
{'image_url': 'https://softmeets.com/wp-content/uploads/2024/11/automation-icon.png', 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/'}
Preview:Image
{'image_url': 'https://softmeets.com/wp-content/uploads/2024/11/world.png', 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/'}
Preview:Image
{'image_url': 'https://softmeets.com/wp-content/uploads/2024/11/artificial-intelligence-icon.png', 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/'}
Preview:Image
{'image_url': 'https://softmeets.com/wp-content/uploads/2024/11/doughnut.png', 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://softmeets.com/'}
Prev

In [None]:
# Group each element type
text_elements = [el for el in elements if el.category in ["NarrativeText", "Title", "ListItem", "UncategorizedText"] and el.text]
table_elements = [el for el in elements if el.category == "Table"]
image_elements = [el for el in elements if el.category == "Image"]

In [29]:
print(len(text_elements))
print(len(table_elements))
print(len(image_elements))

40
0
17


#### Extract image name from image URL

In [3]:
import os
from urllib.parse import urlparse, unquote

In [4]:
def extract_image_filename(image_url):
    """
    Extracts the image filename from a given URL.

    Args:
        image_url (str): The URL of the image.

    Returns:
        str: The extracted filename, or None if no valid URL is provided.
    """
    if not image_url:
        return None

    # Parse the URL into its components
    parsed_url = urlparse(image_url)

    # Get the path part of the URL
    url_path = parsed_url.path

    # Extract the base filename from the path
    image_name = os.path.basename(url_path)

    # Decode any URL-encoded characters (e.g., spaces represented as %20)
    image_name = unquote(image_name)

    return image_name

In [5]:
extract_image_filename("https://softmeets.com/wp-content/uploads/2024/11/Sail-logo.png'")

"Sail-logo.png'"

### Extract data from each page

In [6]:
# Import libraries
import google.generativeai as genai
from PIL import Image

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# Load Visiton model
vision_model =  genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest")

vision_model

genai.GenerativeModel(
    model_name='models/gemini-1.5-pro-latest',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)

In [8]:
# Define Prompt
prompt_text = """You are an assitant tasked with summarizing tables and images for retrieval.
        These summaries will be embeded and used to retrieve the raw image or raw table elements.
        Give a concise summary of the image or table that will be optimized for retrieval.
        """

In [14]:
import requests
from io import BytesIO

In [24]:
# Summarize Image or Table from Image

def summarize_image(image_source):
    """Summarize an image (local path or URL) using Gemini Vision."""
    try:
        if image_source.startswith("http"):
            response = requests.get(image_source)
            response.raise_for_status()
            img = Image.open(BytesIO(response.content))
        else:
            img = Image.open(image_source)

        gemini_response = vision_model.generate_content([prompt_text, img])
        return gemini_response.text
    except Exception as e:
        print(f"Error processing {image_source}: {e}")
        return None

In [16]:
# Import libraries
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [17]:
image_summaries= {}

In [25]:
# Extract data from each page
def extract_from_page(page_url):
    print(f"Processing {page_url}")
    # Read the page
    elements = partition_html(
        url=page_url,
        extract_image_block_types=["Image", "Table"],  # Extract image blocks that are images or tables
        extract_image_block_to_payload=False,          # Don't embed image bytes directly into `el.metadata["image"]`
    )
    splitter = RecursiveCharacterTextSplitter(chunk_size= 500, chunk_overlap= 50)
    # Extract the images
    image_elements = []
    page_text = ""
    # top_menu = True
    for el in elements:
        if el.category == 'Image':
            image_elements.append(el)
        elif el.category in ["NarrativeText", "Title", "ListItem", "UncategorizedText"] and el.text:
            # if el.category == "ListItem" and top_menu:
            #     continue
            # else:
            #     top_menu = False
            if el.text.strip():
                page_text += el.text.strip() + "\n"
    docs = []
    # Create documents for text
    if page_text.strip():
        doc = Document(
            page_content= page_text.strip(),
            metadata= {
                "type": "Text",
                "url": page_url
            }
        )
        docs.append(doc)
    print(f"Text extracted")
    # Split the docs
    splitted_docs = splitter.split_documents(docs)
    # Summarize each image of the page
    print(f"Summarizing {len(image_elements)} image(s):")
    for el in image_elements:
        image_url = el.metadata.to_dict().get("image_url", "")
        print(f"Summarizing {image_url}")
        if image_url:
            image_name = extract_image_filename(image_url)
            print(f"Image Name:{image_name}")
            if image_name not in image_summaries.keys():
                summary = summarize_image(image_url)
                image_summaries[image_name] = summary
                doc = Document(
                    page_content= summary,
                    metadata = {
                        "type": "Image",
                        "url": page_url,
                        "image_url": image_url
                    }
                )
                splitted_docs.append(doc)
    return splitted_docs

In [26]:
# Extract data from all pages
all_docs = []
for page in page_list:
    page_docs = extract_from_page(page)
    all_docs.extend(page_docs)

Processing https://softmeets.com/


Text extracted
Summarizing 17 image(s):
Summarizing //softmeets.com/wp-content/uploads/2024/11/soft-logo-1.png
Image Name:soft-logo-1.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/automation-icon.png
Image Name:automation-icon.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/world.png
Image Name:world.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/artificial-intelligence-icon.png
Image Name:artificial-intelligence-icon.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/doughnut.png
Image Name:doughnut.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/Rail-logo.png
Image Name:Rail-logo.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/Sail-logo.png
Image Name:Sail-logo.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/Ministry-of-Communicationlogo.png
Image Name:Ministry-of-Communicationlogo.png
Summarizing https://softmeets.com/wp-content/uploads/2024/11/indian-army-loo.png
I

In [27]:
image_summaries

{'soft-logo-1.png': None,
 'about-2.jpg': 'Two people working together on a laptop.  A man and a woman are sitting at a desk, facing each other, with a laptop between them.  They are both wearing glasses and appear to be collaborating on a project.  Office environment suggested by potted plants and overhead light.  Cartoon/vector illustration style.\n',
 'automation-icon.png': 'Website redirect settings or configuration, symbolized by a computer screen with a redirect arrow and a gear.',
 'world.png': 'Blue globe with grid lines and three white dots. Icon, flat design. Represents global network, internet, world wide web, or global communication.',
 'artificial-intelligence-icon.png': 'Brain connected to circuit board, artificial intelligence icon.',
 'doughnut.png': 'A donut chart with four segments in light blue, dark blue-grey, gold/yellow, and coral pink. The dark blue-grey segment is the largest and appears to be slightly separated/offset from the rest of the chart.',
 'certified-4

In [28]:
all_docs

[Document(metadata={'type': 'Text', 'url': 'https://softmeets.com/'}, page_content='Navigation\nHome\nInnovation\nAutomation\nInternet of Things\nArtificial intelligence\nAnalytics\nAbout\nAbout Us\nCase Study\nCustomers\nPartners\nContact Us\nYour modernization journey starts here.\nMake technology your growth partner, Optimize operations and improve efficiency.\nLearn More\nServing up transformation & modernization solutions using:\nAutomation\nLearn More\nInternet of Things\nLearn More\nArtificial intelligence\nLearn More\nAnalytics\nLearn More\nTrusted by our Government of India customers.'),
 Document(metadata={'type': 'Text', 'url': 'https://softmeets.com/'}, page_content='Trusted by our Government of India customers.\nTrusted by our esteemed Government of India customers, we deliver reliable, innovative solutions tailored to meet critical operational and strategic needs. Our commitment to excellence ensures robust performance, security, and compliance, fostering enduring partner

### Perform embedding on extracted data and store in ChromaDB

In [29]:
# Import libraries
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

In [30]:
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embedding

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [31]:
# Directory to store Chroma DB
persist_directory = "./chroma_db"

In [32]:
# Store embedded documents
vectorstore = Chroma.from_documents(
    documents=all_docs,
    embedding=embedding,
    persist_directory=persist_directory
)

  return forward_call(*args, **kwargs)


### Create Retriever

In [33]:
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [34]:
retriever = vectorstore.as_retriever(top_k=3)

In [37]:
# Test retriever
query = "What is the address of the company?"
retrieved_docs = retriever.get_relevant_documents(query)
retrieved_docs

  return forward_call(*args, **kwargs)


[Document(id='d62ee98f-7182-4f4d-a1c1-c5f875b1ab67', metadata={'url': 'https://softmeets.com/automation/', 'type': 'Text'}, page_content='Get in touch\nContact us\n+ (91) 9434 811 929, 0341-3500346\nINFO@SOFTMEETS.COM\nSoftmeets Office Address\nWebel IT Park (3rd Floor), Kalyanpur Satellite Township, Asansol – 713302, Dist – Paschim Bardhaman (W.B). India\nImportant Link\nAbout\nTerms & Conditions\nPrivacy Policy\nContact\n© 2005-2024, Softmeets. All Rights Reserved.'),
 Document(id='032720ea-9c55-4bba-8b5e-99d3b8c45c1f', metadata={'type': 'Text', 'url': 'https://softmeets.com/analytics/'}, page_content='Get In Touch\nContact us\n+ (91) 9434 811 929, 0341-3500346\nINFO@SOFTMEETS.COM\nSoftmeets Office Address\nWebel IT Park (3rd Floor), Kalyanpur Satellite Township, Asansol – 713302, Dist – Paschim Bardhaman (W.B). India\nImportant Link\nAbout\nTerms & Conditions\nPrivacy Policy\nContact\n© 2005-2024, Softmeets. All Rights Reserved.'),
 Document(id='04cb8929-d590-4d79-9986-7cd61b5e678a'

### Create RAG Pipeline

In [38]:
# Import libraries
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI

In [39]:
llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro")
llm

ChatGoogleGenerativeAI(model='models/gemini-1.5-pro', google_api_key=SecretStr('**********'), client=<google.ai.generativelanguage_v1beta.services.generative_service.client.GenerativeServiceClient object at 0x7306f412fa40>, default_metadata=())

In [40]:
# Define prompt
template = """You are a helpful assistant for question answering task.
your task is to answer the query of the user from the provided context.
Your answer should be concise and to the point.

Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

In [41]:
# Difine RAG chain
chain = ({"question": RunnablePassthrough(), "context": retriever}
         | prompt
         | llm
         | StrOutputParser())


### Test RAG

In [42]:
response = chain.invoke("What is the address of the company?")
print(response)

  return forward_call(*args, **kwargs)


Webel IT Park (3rd Floor), Kalyanpur Satellite Township, Asansol – 713302, Dist – Paschim Bardhaman (W.B). India


In [43]:
response = chain.invoke("What is the name of the company?")
print(response)

  return forward_call(*args, **kwargs)


Softmeets


In [44]:
response = chain.invoke("Which technologies they use?")
print(response)

  return forward_call(*args, **kwargs)


Platforms: Web and Android
Frontend: Angular, HTML-5, CSS3
Backend: js & Express


In [45]:
response = chain.invoke("Name some of their clients?")
print(response)

  return forward_call(*args, **kwargs)


The provided text doesn't name specific clients. It only mentions working with "top executives" and helping "the world's great companies".


In [46]:
response = chain.invoke("What is the contact number and email of the company?")
print(response)

  return forward_call(*args, **kwargs)


+91 9434 811 929, 0341-3500346 and INFO@SOFTMEETS.COM


In [47]:
response = chain.invoke("Tell me about the company?")
print(response)

  return forward_call(*args, **kwargs)


Softmeets Info Solutions is a company with a four-decade track record of helping large companies become more efficient and intelligent. Their mission is to help management teams create high economic value and redefine industries.  They are located in Webel IT Park, Kalyanpur Satellite Township, Asansol, India.
