# Building a RAG Pipeline over IKEA Product Instruction Manuals

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/product_manual_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows how to use LlamaParse and OpenAI's multimodal models to query over IKEA instruction manual PDFs, which mainly contain images and diagrams to show how one can assemble the product.

LlamaParse and multimodal LLMs can interpret these diagrams and translate them into textual instructions. With textual assistance, confusing visual instructions within the IKEA product manuals can be made easier to understand and interpret. Additionally, textual instructions can be helpful for those who are visually impaired.

## Install and Setup

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [1]:
%pip install llama-index llama-parse llama-index-multi-modal-llms-openai git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-eu7l0d13
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-eu7l0d13
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
!wget https://github.com/user-attachments/files/16461058/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

--2024-08-11 20:40:46--  https://github.com/user-attachments/files/16461058/data.zip
Resolving github.com (github.com)... 20.26.156.215
Connecting to github.com (github.com)|20.26.156.215|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/835367238/16461058?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240811%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240811T204046Z&X-Amz-Expires=300&X-Amz-Signature=0f84f886c6a0dd95cd616851d7871abed95bd063f721a847ddeb987c899d2e37&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=835367238&response-content-disposition=attachment%3Bfilename%3Ddata.zip&response-content-type=application%2Fzip [following]
--2024-08-11 20:40:46--  https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/835367238/16461058?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240811%2Fus-east-1%

In [3]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [4]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["LLAMA_CLOUD_API_KEY"] = ""

## Code Implementation

Set up LlamaParse. We will parse the PDF files into markdown and use the GPT-4o multimodal model to parse the PDFs.

Load data from the parser.

In [5]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="You are given IKEA assembly instruction manuals",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    show_progress=True,
)

In [6]:
DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

Load data into docs, and save images from PDFs into `data_images` directory.

In [7]:
md_json_objs = parser.get_json_result(files)
md_json_list = md_json_objs[0]["pages"]
image_dicts = parser.get_images(md_json_objs, download_path="data_images")

Parsing files:   0%|          | 0/5 [00:00<?, ?it/s]

Started parsing the file under job_id 6ee13e2f-7cbc-4e48-9e6e-1cc7457671e7
Started parsing the file under job_id be838cd0-34f0-461a-88cb-c058b0fe0536
Started parsing the file under job_id e2e73452-4caa-449e-9143-16f9142e9447
Started parsing the file under job_id 13422c96-3546-49d2-9081-55dbab21717a


Parsing files:  20%|██        | 1/5 [00:02<00:10,  2.65s/it]

Started parsing the file under job_id 5a9b18b6-adf5-476e-b105-39118aad60e7


Parsing files: 100%|██████████| 5/5 [00:12<00:00,  2.50s/it]


> Image for page 1: [{'name': 'page-0.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 2: [{'name': 'page-1.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 3: [{'name': 'page-2.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 4: [{'name': 'page-3.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 5: [{'name': 'page-4.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 6: [{'name': 'page-5.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 7: [{'name': 'page-6.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 8: [{'name': 'page-7.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 9: [{'name': 'page-8.jpg', 'height': 0,

Create helper functions to create a list of `TextNode`s from the markdown tables to feed into the `VectorStoreIndex`.

In [9]:
fn = './mdobjects.json'
import json
with open(fn,'w') as f:
    json.dump(md_json_objs, f)

fn2 = './imagedicts.json'
with open(fn2, 'w') as f:
    json.dump(image_dicts, f)

In [10]:
import re
from pathlib import Path
import typing as t
from llama_index.core.schema import TextNode


def get_page_number(file_name):
    """Gets page number of images using regex on file names"""
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files


def get_text_nodes(json_dicts, image_dir) -> t.List[TextNode]:
    """Creates nodes from json + images"""

    nodes = []

    docs = [doc["md"] for doc in json_dicts]  # extract text
    image_files = _get_sorted_image_files(image_dir)  # extract images

    for idx, doc in enumerate(docs):
        # adds both a text node and the corresponding image node (jpg of the page) for each page
        node = TextNode(
            text=doc,
            metadata={"image_path": str(image_files[idx]), "page_num": idx + 1},
        )
        nodes.append(node)

    return nodes


text_nodes = get_text_nodes(md_json_list, "data_images")

Index the documents.

In [11]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_ikea"):
    index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_ikea")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_ikea")
    index = load_index_from_storage(ctx)

retriever = index.as_retriever()

Create a custom query engine that uses GPT-4o's multimodal model.

In [12]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import NodeWithScore, MetadataMode
from llama_index.core.base.response.schema import Response
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import ImageNode

QA_PROMPT_TMPL = """\
Below we give parsed text from slides in two different formats, as well as the image.

We parse the text in both 'markdown' mode as well as 'raw text' mode. Markdown mode attempts \
to convert relevant diagrams into tables, whereas raw text tries to maintain the rough spatial \
layout of the text.

Use the image information first and foremost. ONLY use the text/markdown information 
if you can't understand the image.

---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

gpt_4o_mm = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)


class MultimodalQueryEngine(CustomQueryEngine):
    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: OpenAIMultiModal

    def __init__(
        self,
        qa_prompt: PromptTemplate,
        retriever: BaseRetriever,
        multi_modal_llm: OpenAIMultiModal,
    ):
        super().__init__(
            qa_prompt=qa_prompt, retriever=retriever, multi_modal_llm=multi_modal_llm
        )

    def custom_query(self, query_str: str):
        # retrieve most relevant nodes
        nodes = self.retriever.retrieve(query_str)

        # create image nodes from the image associated with those nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=n.node.metadata["image_path"]))
            for n in nodes
        ]

        # create context string from parsed markdown text
        ctx_str = "\n\n".join(
            [r.node.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )
        # prompt for the LLM
        fmt_prompt = self.qa_prompt.format(context_str=ctx_str, query_str=query_str)

        # use the multimodal LLM to interpret images and generate a response to the prompt
        llm_repsonse = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=[image_node.node for image_node in image_nodes],
        )
        return Response(
            response=str(llm_repsonse),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )

Create a query engine instance.

In [13]:
query_engine = MultimodalQueryEngine(
    qa_prompt=QA_PROMPT,
    retriever=index.as_retriever(similarity_top_k=9),
    multi_modal_llm=gpt_4o_mm,
)


## Example Queries

In [14]:
from IPython.display import display, Markdown

response = query_engine.query("What parts are included in the Uppspel?")
display(Markdown(str(response)))

The query asks for the parts included in the Uppspel. However, the images provided are for different IKEA products (SMÅGÖRA, FREDDE, and TUFFING) and do not include any information about the Uppspel.

Given the context information from the parsed markdown and raw text, there is no mention of Uppspel. The parts listed in the parsed text are for different sections labeled A, B, C2, and Additional Parts, but none of these sections are explicitly linked to Uppspel.

Therefore, based on the provided images and text, there is no information available about the parts included in the Uppspel.

In [15]:
response = query_engine.query("What does the Tuffing look like?")
display(Markdown(str(response)))

The query asks about the appearance of the "Tuffing." However, based on the provided images and parsed text, there is no mention or depiction of a product named "Tuffing." The images and text primarily refer to products like "NORDLI," "FREDDE," and "SMÅGÖRA."

Therefore, I cannot provide an image or description of the "Tuffing" as it is not included in the provided materials. My conclusion is based on the images and the parsed text, which do not contain any reference to "Tuffing."

In [16]:
response = query_engine.query("What is step 4 of assembling the Nordli?")
display(Markdown(str(response)))

Step 4 of assembling the Nordli involves the following:

- Insert 4x screws (117345) into the designated holes on the two panels as shown.

This information was obtained from the parsed markdown text provided. The image associated with step 4 was not included in the provided images, so the text was used to determine the step. There were no discrepancies between the markdown and raw text for this step.

In [17]:
response = query_engine.query(
    "What should I do if I'm confused with reading the manual?"
)
display(Markdown(str(response)))

If you are confused with reading the manual, you should refer to the manual itself for additional guidance. If you still need help, you can contact IKEA customer service.

This information was derived from the image associated with the text:

- "If you have questions, refer to the manual."
- "If you still need help, contact IKEA customer service."

There are no discrepancies between the parsed markdown/raw text and the image. The image clearly shows the instructions for seeking help if you are confused.

You can also create an agent around the query engine and chat with the agent.

In [18]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.tools import QueryEngineTool

query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="query_engine_tool",
    description="Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data.",
)
agent = FunctionCallingAgentWorker.from_tools(
    [query_engine_tool], llm=llm, verbose=True
).as_agent()

In [19]:
response = agent.chat(
    "Give a step-by-step instruction guide on how to assemble the Smagora"
)
display(Markdown(str(response)))

Added user message to memory: Give a step-by-step instruction guide on how to assemble the Smagora


=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "step-by-step instruction guide on how to assemble the Smagora"}
=== Function Output ===
The provided images and parsed text do not contain any specific instructions for assembling a product named "Smagora." The images and text provided are related to different IKEA products and their assembly instructions, but none of them mention "Smagora."

Therefore, I cannot provide a step-by-step instruction guide for assembling the "Smagora" based on the given information. If you have the specific instructions or images for "Smagora," please provide them, and I can assist you further.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Smagora assembly instructions"}
=== Function Output ===
The query "Smagora assembly instructions" does not match any of the provided images or parsed text. The images and text provided are related to IKEA assembly instructions for different furniture it

It appears that there are no available instructions for assembling a product named "Smagora" in the provided data. If you have any specific details or additional information about the product, please share them, and I will do my best to assist you further. Alternatively, if you have access to the product manual or any images related to the assembly, those would be helpful as well.

In [20]:
response = agent.chat("How do I assemble the Fredde?")
display(Markdown(str(response)))

Added user message to memory: How do I assemble the Fredde?


=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Fredde assembly instructions"}
=== Function Output ===
The query asks for the assembly instructions for the "Fredde" furniture. Based on the provided images and parsed text, there is no direct mention or image of the "Fredde" assembly instructions. The images and text provided refer to various other furniture assembly instructions, but none specifically for "Fredde."

Therefore, I cannot provide the assembly instructions for "Fredde" from the given information. If you have the specific manual or images for "Fredde," please provide them, and I can assist you further.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "IKEA Fredde assembly instructions"}
=== Function Output ===
The query asks for IKEA Fredde assembly instructions. Based on the provided images and parsed text, the instructions for IKEA Fredde are not included. The images and text provided are for different IKE

It seems that the assembly instructions for the IKEA Fredde desk are not available in the provided data. However, you can typically find the assembly instructions for IKEA products on their official website or in the manual that comes with the product.

Here are general steps you can follow to assemble the IKEA Fredde desk:

1. **Unpack the Components:**
   - Lay out all the parts and hardware included in the package.
   - Check the instruction manual to ensure you have all the necessary components.

2. **Prepare Your Tools:**
   - You will typically need a screwdriver (both flathead and Phillips), a hammer, and possibly an Allen wrench (usually included in the package).

3. **Assemble the Frame:**
   - Start by assembling the main frame of the desk. This usually involves connecting the side panels to the back panel.
   - Use the provided screws and dowels to secure the panels together.

4. **Attach the Shelves:**
   - Follow the instructions to attach the shelves to the frame. This may include the main desk surface and any additional shelves for monitors or accessories.
   - Ensure that each shelf is level and securely fastened.

5. **Install the Legs:**
   - Attach the legs to the bottom of the desk frame. Make sure they are securely fastened and stable.

6. **Add Accessories:**
   - If the desk includes any additional accessories such as cable management systems or monitor stands, attach these according to the instructions.

7. **Final Adjustments:**
   - Check all screws and connections to ensure everything is tight and secure.
   - Adjust the desk to its final position and make sure it is level.

8. **Clean Up:**
   - Remove any packaging materials and tools from the assembly area.
   - Wipe down the desk to remove any dust or fingerprints.

If you need the specific manual for the IKEA Fredde desk, you can usually download it from the [IKEA website](https://www.ikea.com) by searching for the product name and looking for the assembly instructions PDF.