<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/19_1_multi_model_rag_with_captioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi Model RAG with captioning

This code implements one of the multiple ways of multi-model RAG. It extracts and processes text and images from PDFs, utilizing a multi-modal Retrieval-Augmented Generation (RAG) system for summarizing and retrieving content for question answering.



## Key Components
1. PyMuPDF: For extracting text and images from PDFs.
2. Gemini 1.5-flash model: To summarize images and tables.
3. Upstage Embeddings: For embedding document splits.
4. Chroma Vectorstore: To store and retrieve document embeddings.
5. LangChain: To orchestrate the retrieval and generation pipeline.

## Import relevant libraries

In [10]:
! pip3 install -qU langchain-upstage langchain langchain-community chromadb PyMuPDF

In [2]:
import fitz  # PyMuPDF
from PIL import Image
import io
import os
from google.colab import userdata

import google.generativeai as genai
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_upstage import ChatUpstage, UpstageEmbeddings
os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")

## Download the "Attention is all you need" paper

In [3]:
!wget https://arxiv.org/pdf/1706.03762
!mv 1706.03762 attention_is_all_you_need.pdf

--2024-10-22 23:35:22--  https://arxiv.org/pdf/1706.03762
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘1706.03762’


2024-10-22 23:35:22 (30.7 MB/s) - ‘1706.03762’ saved [2215244/2215244]



## Data Extraction

In [4]:
text_data = []
img_data = []

In [5]:
with fitz.open('attention_is_all_you_need.pdf') as pdf_file:
    # Create a directory to store the images
    if not os.path.exists("extracted_images"):
        os.makedirs("extracted_images")

    # Loop through every page in the PDF
    for page_number in range(len(pdf_file)):
        page = pdf_file[page_number]

        # Get the text on page
        text = page.get_text().strip()
        text_data.append({"response": text, "name": page_number+1})
        # Get the list of images on the page
        images = page.get_images(full=True)

        # Loop through all images found on the page
        for image_index, img in enumerate(images, start=0):
            xref = img[0]  # Get the XREF of the image
            base_image = pdf_file.extract_image(xref)  # Extract the image
            image_bytes = base_image["image"]  # Get the image bytes
            image_ext = base_image["ext"]  # Get the image extension

            # Load the image using PIL and save it
            image = Image.open(io.BytesIO(image_bytes))
            image.save(f"extracted_images/image_{page_number+1}_{image_index+1}.{image_ext}")

In [6]:
genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

## Image Captioning

In [24]:
for img in os.listdir("extracted_images"):
    image = Image.open(f"extracted_images/{img}")
    response = model.generate_content([image, "You are an assistant tasked with summarizing tables, images and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text or image:"])
    img_data.append({"response": response.text, "name": img})

In [32]:
img_data

[{'response': 'This image shows the architecture of a scaled dot-product attention mechanism. The input is three tensors V, K, and Q, which are first linearly transformed. Then, the transformed K and Q tensors are multiplied with each other. The resulting product is then normalized by the dimension of the Q tensor and scaled down by a factor of the square root of the dimension. Finally, the result is multiplied with the transformed V tensor to produce the final output.',
  'name': 'image_4_2.png'},
 {'response': 'This image shows the steps of the scaled dot-product attention mechanism used in Transformers.  Q, K, and V are inputs that are multiplied with each other and then scaled to ensure that the gradients are well behaved.  The resulting output is masked and then fed into a softmax layer to calculate the attention weights.  The weighted values are then multiplied by V to get the final output.',
  'name': 'image_4_1.png'},
 {'response': 'This is a diagram of the Transformer architec

## Vectorstore

In [26]:
# Set embeddings
embedding_model = UpstageEmbeddings(model="solar-embedding-1-large")

# Load the document
docs_list = [Document(page_content=text['response'], metadata={"name": text['name']}) for text in text_data]
img_list = [Document(page_content=img['response'], metadata={"name": img['name']}) for img in img_data]

# Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=400, chunk_overlap=50
)

doc_splits = text_splitter.split_documents(docs_list)
img_splits = text_splitter.split_documents(img_list)

In [34]:
# Add to vectorstore
vectorstore = Chroma.from_documents(
    documents=doc_splits + img_splits, # adding the both text and image splits
    collection_name="multi_model_rag",
    embedding=embedding_model,
)

retriever = vectorstore.as_retriever(
                search_type="similarity",
                search_kwargs={'k': 1}, # number of documents to retrieve
            )

## Query

In [35]:
query = "What is a Transformer model?"

In [36]:
docs = retriever.invoke(query)

## Output

In [38]:
from langchain_core.output_parsers import StrOutputParser

# Prompt
system = """You are an assistant for question-answering tasks. Answer the question based upon your knowledge.
Use three-to-five sentences maximum and keep the answer concise."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Retrieved documents: \n\n <docs>{documents}</docs> \n\n User question: <question>{question}</question>"),
    ]
)

# LLM
llm = ChatUpstage(model='solar-pro')

# Chain
rag_chain = prompt | llm | StrOutputParser()

# Run
generation = rag_chain.invoke({"documents":docs[0].page_content, "question": query})
print(generation)

A Transformer model is a type of neural network used for natural language processing tasks. It has an encoder-decoder architecture, where the encoder creates a representation of an input sequence and the decoder generates a new sequence based on that representation. The model consists of multiple layers, including attention and feed-forward layers, and is trained to predict the next word in a sequence. Transformer models are commonly used for machine translation and text summarization.
