# Building a RAG Pipeline

## Preprocessing with Unstructured.io
The first leg of the RAG pipeline involves taking unstructured data from myriad formats and converting it to semi-structured records (JSON). Unstructured.io provides the toolkit that makes it easy to build a pipeline that transforms many forms of unstructured data. In this demo, I focus on PDF extraction because it's one of the most difficult documents to parse. However, unstructured.io provides the tools to parse many document types.

In [1]:
import logging

# Set up logging to display only ERROR and higher level messages
logging.basicConfig(level=logging.ERROR)

from utils import Utils, Preprocess
from unstructured_client import UnstructuredClient

utils = Utils()

UNSTRUCTURED_API_KEY = utils.get_api_key("UNSTRUCTURED")
client = UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY)

filename = "rtx_paper.pdf"

Documents like HTML, Word, and Markdown can be processed with rules-based parsers because they include formatting information inherenet in the document structure (ie. HTML tags for headers, tables, etc). However, other documents like PDFs and images don't have this structural metadata, so it has to be inferred visually. **Yolox is the Document Layout Detection model I used to draw bounding boxes around text for classification.**   

In [2]:

# instantiate Preprocess class with file path
doc = Preprocess(filename)

# read in the file
files = doc.read_file()

# build request (instruct the API on how to parse the file)
req = doc.partition_file(files, strategy='hi_res', model_name='yolox')

# store the parsed file as records (list of dictionaries) and Elements (list of Element objects)
records, elements = doc.get_structured_text(client, req)


An **Element** is the atomic object of the unstructured.io output. It represents a component of the source document that has been partitioned and is designed to preserve the semantic structure of the document. Each element object has the following information:
- **Type:** indicates the type of element (NarrativeText, Title, Header, etc)
- **Element ID:** unique identifier for the element
- **Text:** the extracted text content of the element
- **Metadata:** dictionary containing additional information about the element

Sometimes it's easier to inspect the partitioned output as records. Other times, it's easier with the Element object. It's the same content either way.

## Inspect

Inspecting the partitioned output is important to get a feel for what matters and what doesn't. In this case, I'm building a RAG pipeline for arXiv research papers. These PDFs tend to be similarly structured (with headers, references, tables and images).

In [3]:
from utils import Inspect

# instantiate Inspect class
inspect = Inspect(records=records, elements=elements)

# get unique elements and counts
inspect.count_elements()


[('ListItem', 101),
 ('NarrativeText', 64),
 ('Title', 6),
 ('Image', 5),
 ('FigureCaption', 5),
 ('UncategorizedText', 4),
 ('Table', 2)]

Above, I can see there are seven types of elements in the PDF. For example, there are 101 elements classified as ListItem (this seems like a lot). I can now start to understand what elements are important (ie. not only what I care about in this particular PDF, but what will generalize over all arXiv PDFs I want to query in the future).

To start, there's six title elements. When I inspect these, I see they're section headers. These will be useful (more on this later). However, when I look at the document , it's clear that not all section headers were identified. There's 15 section headers in the document (I counted these manually), so only about a third were identified. There's also an erroneous title (the first one).

**There's room for improvement in the documment layout detection model. It would perhaps be worth having a separate model just for Title identification as this is the most crucial category to identify**

While this isn't perfect, the Titles are still good to have (I'll use them later). Removing the erroneus title won't generalize to other arXiv papers, so not worth discarding.

In [4]:
# get dictionary of all Title elements with unique IDs
section_ids = inspect.get_section_id_dict()
section_ids

{'735c6af3e3d8a967022eda94d9e1434f': 'O R . s c [',
 '7e8a689d89b024a7dc47ebee59bb07fa': 'III. THE OPEN X-EMBODIMENT REPOSITORY',
 '482139f50b98da4165176eea35acb871': 'B. Dataset Analysis',
 'e110774179bc4d92f2382f2063c16a2f': 'IV. RT-X DESIGN',
 '6ffcbfd5e933ece63fe9d715a734707c': 'A. Data format consolidation',
 '0200b826f862b4f75389fd989b474835': 'REFERENCES'}

Document Layout Detection works by first using computer vision models (I used YOLOX) to classify text as one of the various elements it's trained to identify and put bounding boxes around those elements. Then text is extracted from the bounding box using Object Character Recognition (OCR) if text can't be extracted directly from the document without it.

I wondered whether the issue was with the bounding box or the text extraction, so I wrote the following check. "II. Related Work" is one of the section headers in the PDF. If it were missing, it would be a text extraction issue. But it's there, just nested in a NarrativeText element.

Takeaway: the bounding box isn't super accurate, but this isn't a major issue (failed text extraction would have been). As an alternative approach, I could try using a Vision Transformer, where the image/PDF is encoded and text is output in one step, but this is subject to hallucination. 

In [5]:
# only 5 of 15 section titles are accurately identified in the metadata
# the unidentified section titles show up in narrative text not assigned a parent id 
from pprint import pprint

lookup_string = 'II. RELATED WORK'

for record in records:
    if lookup_string in record['text']:
        pprint(record)
        break

{'element_id': 'af2c605feab1658a4551124324242f92',
 'metadata': {'filename': 'rtx_paper.pdf',
              'filetype': 'application/pdf',
              'languages': ['eng'],
              'page_number': 2},
 'text': 'Following this rationale, we have two goals: (1) Evaluate whether '
         'policies trained on data from many different robots and environments '
         'enjoy the benefits of positive transfer, attaining better '
         'performance than policies trained only on data from each evaluation '
         'setup. (2) Organize large robotic datasets to enable future research '
         'on X-embodiment models. We focus our work on robotic manipulation. '
         'Addressing goal (1), our empirical contribution is to demonstrate '
         'that several recent robotic learning methods, with minimal mod- '
         'ification, can utilize X-embodiment data and enable positive '
         'transfer. Specifically, we train the RT-1 [8] and RT-2 [9] models on '
         '9 dif

Inspect the record types to see if the content is useful.
- ListItem: these are important bulllet points that should be stored
- Image: text from the images isn't well structured, so it's not very interpretable
- UncategorizedText: not much useful information, mainly list of contributor names


In [6]:
# Inspect two records of type 'ListItem'
# --> 'text' key shows these are informative bullet points that should be stored
inspect.inspect_record_type('ListItem', max_items=2)

[{'type': 'ListItem',
  'element_id': '6821f000be4d925646adb547846faa3a',
  'text': '• Open X-Embodiment Dataset: robot learning dataset with 1M+ robot trajectories from 22 robot embodi- ments.',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 3,
   'parent_id': '7e8a689d89b024a7dc47ebee59bb07fa',
   'filename': 'rtx_paper.pdf'}},
 {'type': 'ListItem',
  'element_id': '784f76bdee32da2b616660d706a6013b',
  'text': '• Pre-Trained Checkpoints: a selection of RT-X model checkpoints ready for inference and finetuning.',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 3,
   'parent_id': '7e8a689d89b024a7dc47ebee59bb07fa',
   'filename': 'rtx_paper.pdf'}}]

In [7]:
# image items don't contain much useful text information
# --> remove image content from data set
inspect.inspect_record_type('Image', 2)

[{'type': 'Image',
  'element_id': '89eef53e1c83117154f60e8d8bfaaaab',
  'text': 'TOTO from Cable Routing RT-1 QT-Opt 1 @ 1M Episodes 311 Scenes [ ] — — L Research Labs across 21 Institutions — pour pick anything, 22 Embodiments ‘sweep the green Cloth to the lft ¢ side of the table RERE & RS pick green chip bag from counter S ) 527 Skills L A i o — A A 5 e e A o A= : set the bowl to o1 T8 b L pour stack route the right side of the table sl Stach cups 60 Datasets & place the black block L‘. & ﬁbo wl i the dish rack Y /E. 1) Pog < = A A 1,798 Attributes. 5,228 Objects. 23,486 S elations i = Jaco Play ALOHA Bridge Opening',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 1,
   'filename': 'rtx_paper.pdf'}},
 {'type': 'Image',
  'element_id': '7eb04e8cb6b49bb456577b8e9adea2df',
  'text': 'In ;% & FLS £ Sawyer < & &5 S § S 5 $ ;“ & § e’p & $ K 4 Kinova Gen3 FES & ° & K4 ey g S € ¢ Hello Stretch Bﬁcs\'l g e i WidowX Jackal WidowX Sawyer ¥ (a) # Datase

In [8]:
# uncategorized text doesn't contain much useful text information (it's a long list of contributor names)
# --> remove uncategoriged text content from data set
inspect.inspect_record_type('uncategorizedtext')

[{'type': 'UncategorizedText',
  'element_id': 'f50ad2eb90733ac345a808a1a9e6e75f',
  'text': '4 2 0 2',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 1,
   'filename': 'rtx_paper.pdf'}},
 {'type': 'UncategorizedText',
  'element_id': '7630b82f56224e713057fe23dbfc1122',
  'text': ']',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 1,
   'filename': 'rtx_paper.pdf'}},
 {'type': 'UncategorizedText',
  'element_id': '661298c7e2144c15579c599385497ab9',
  'text': '6 v 4 6 8 8 0 . 0 1 3 2 : v i X r a',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 1,
   'parent_id': '735c6af3e3d8a967022eda94d9e1434f',
   'filename': 'rtx_paper.pdf'}},
 {'type': 'UncategorizedText',
  'element_id': 'bd57b596576ece5d43181c3f0e568935',
  'text': 'Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration0 robotics-transformer-x.github.io Abby O’Neill32,

I wrote a method to view an org chart of the data. It shows every parent element along with the type and number of child elements that belong to it. I see that half of the UncategorizedText belongs to the erroneous Title section and the other half is without a parent. I can be comfortable dropping these.

Generally, this method helps build intuition about the data. For example, the title 'III. The Open X-embodimment Repository' has two child element types that belong to it (NarrativeText and ListItem). But there were a bunch of list items (101)... I can see that many of them belong to the References section (which makes sense). 

In [9]:

# display parent and child relationship structure
inspect.print_child_records()
    

PARENT TYPE: Title, PARENT ID: 735c6af3e3d8a967022eda94d9e1434f
PARENT TEXT: O R . s c [
    CHILD TYPE: UncategorizedText --> NUMBER OF CHILDREN: 2

PARENT TYPE: Title, PARENT ID: 7e8a689d89b024a7dc47ebee59bb07fa
PARENT TEXT: III. THE OPEN X-EMBODIMENT REPOSITORY
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 4

PARENT TYPE: Title, PARENT ID: 7e8a689d89b024a7dc47ebee59bb07fa
PARENT TEXT: III. THE OPEN X-EMBODIMENT REPOSITORY
    CHILD TYPE: ListItem --> NUMBER OF CHILDREN: 2

PARENT TYPE: Title, PARENT ID: 482139f50b98da4165176eea35acb871
PARENT TEXT: B. Dataset Analysis
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 1

PARENT TYPE: Title, PARENT ID: e110774179bc4d92f2382f2063c16a2f
PARENT TEXT: IV. RT-X DESIGN
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 1

PARENT TYPE: Title, PARENT ID: 6ffcbfd5e933ece63fe9d715a734707c
PARENT TEXT: A. Data format consolidation
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 8

PARENT TYPE: Title, PARENT ID: 6ffcbfd5e933e

There are two tables. I'm interested to know how well the model did at inferring the table structure and extracting the text.
- Table 1: The table structure is perfect. There are 6 rows and 4 cols. Only one text entry is incorrect. '9i%' should be '92%'
- Table 2: Again, the table structure is perfect. There are 8 rows and 9 cols. However, much of the Row and Model columns are incorrect. The Row column isn't immportant for retrieval, but the Model column is. Every row below the first should be 'RT-2-X' instead of the varied entries below. 

**Takeaway: Text extraction from tables is nearly flawless. My sense is that dashes throw off the OCR because in the cases where there were issues, dashes either preceeded or were embedded in the text.** 

In [10]:
# there are two tables extracted from the PDF
tables = [el for el in elements if el.category == "Table"]
tables

[<unstructured.documents.elements.Table at 0x711b806aa020>,
 <unstructured.documents.elements.Table at 0x711b806aa5c0>]

In [11]:
# let's see how well the DLD model did at inferring the table structure and translating the visual content to text.
from IPython.core.display import display, HTML

table_html_1 = tables[0].metadata.text_as_html
table_html_2 = tables[1].metadata.text_as_html

display(HTML(str(table_html_1)))
display(HTML(str(table_html_2)))

  from IPython.core.display import display, HTML


Evaluation Setting,Bridge,Bridge.1,RT-1 paper 6 skills
Evaluation Location Robot Embodiment Original Method,RIS (Stanford) WidowX LCBC [95],RAIL Lab (UCB) WidowX LCBC [95],Google Robotic Lab Google Robot
Original Method,13%,13%,
RT-1,40%,30%,9i%
RT-1-X,27%,27%,73%
RT-2-X (55B),50%,30%,91%


Row,Model,Size,History Length,Dataset,Co-Trained w/ Web,Initial Checkpoint,Emergent Skills Evaluation,RT-2 Generalization Evaluation
(6],RT-2,55B,none,Google Robot action,Yes,Web-pretrained,27.3%,62%
),RT-2-X,55B,none,Robotics data,Yes,Web-pretrained,75.8%,61%
3),RT-2-X,55B,none,Robotics data except Bridge,Yes,Web-pretrained,42.8%,54%
“),RT-,5B,2,Robotics data,Yes,Web-pretrained,44.4%,52%
),RT-2-,5B,none,Robotics data,Yes,Web-pretrained,14.5%,30%
©6),RT-2-X,5B,2,Robotics data,No,From scratch,0%,1%
7,RT-2-.,5B,2,Robotics data,No,Web-pretrained,48.7%,47%


## Finish Preprocessing
As a result of inspecting the data, I decided to remove references and header section content (if any) from the output that will be used for retrieval. You can see that ListItem type elements are reduced by the 52 list items that were in References and the UncategorizedText type elements are removed. 

I then converted the final element objects to records and added section titles to metadata with the add_parent_to_metadata function. Metadata is important because it can be used explicitly in the query with hybrid search or implicitly with chunking. 

Hybrid search combines semantic search with a structured query. For example, I could ask the LLM a question about the RTX paper and include a WHERE clause in the context that filters the retrieval to a particular section by using the 'section' key I added to the metadata. An enterprise relevant use case might be filtering to the most recent policy & procedure document for employee onboarding queries. This would require a last updated or created date in the metadata.

Alternatively, metadata can be leveraged implicitly with chunking. This is the approach that's most relevant for my use case. A basic chunking strategy is naive - chunks can include text up to a certain character limit. A chunking strategy informed by metadata can improve response quality because chunks only include topically relevant text. Unstructured.io has a convenient chunk_by_title method that uses the parent_id to chunk section content under the relevant title (parent). The drawback is the vision model correctly identified only 5 of 15 section titles for the RTX paper, so perhaps this strategy isn't as impactful as it could be in this case.

In [14]:
# filter out irrelevant categories from pdf_elements
filter_category_list = ['image','uncategorizedtext']
pdf_elements = [el for el in elements if el.category.lower() not in (filter_category_list)]

# get parent_id of child elements that belong to header and references Title sections 
header_id, references_id = inspect.get_references_and_header_id(records)

# filter child elements from the references and header section
pdf_elements = [el for el in pdf_elements if el.metadata.parent_id not in (references_id, header_id)]

inspect.count_elements(elements=pdf_elements)

Header ID not found.


[('NarrativeText', 50),
 ('ListItem', 49),
 ('Title', 6),
 ('FigureCaption', 5),
 ('Table', 2)]

There are still 6 Title sections, which means the 'References' title is still present (although it's children have been removed). It should be possible to remove these parent elements by accessing the element_id attribute, but it failed. This may be an issue with the unstructured.io class. Converting the elements to records and operating on the record is an easy work around. 

In [15]:

# 1. convert elements to records, 2. add section title to metadata
pdf_records_with_parent = doc.add_parent_to_metadata(pdf_elements, section_ids)

# remove references and header parent elements 
# Note: this should be possible by accessing the Element class attributes (ei. element_id), but didn't work for some reason.
#   Perhaps there's an issue with the unstructured.io class. Fortulately, operating on the record instead is easy work around. 
pdf_records_with_parent = [record for record in pdf_records_with_parent if record['element_id'] not in (references_id, header_id)]


In [16]:
# save records as JSON
utils.save_json_line_by_line('data.json', pdf_records_with_parent)

Data saved successfully to data.json


In [17]:
# load json to work from saved data
data = utils.load_json_line_by_line('data.json')

Data loaded successfully from data.json


In [18]:
# check a random record to make sure section name added to metadata
for record in data:
    if record['type'].lower()== 'listitem' and record['element_id'] == '6821f000be4d925646adb547846faa3a':
        pprint(record)

{'element_id': '6821f000be4d925646adb547846faa3a',
 'metadata': {'filename': 'rtx_paper.pdf',
              'filetype': 'application/pdf',
              'languages': ['eng'],
              'page_number': 3,
              'parent_id': '7e8a689d89b024a7dc47ebee59bb07fa',
              'section': 'III. THE OPEN X-EMBODIMENT REPOSITORY'},
 'text': '• Open X-Embodiment Dataset: robot learning dataset with 1M+ robot '
         'trajectories from 22 robot embodi- ments.',
 'type': 'ListItem'}


In [19]:
# Note: does not show parentless child elements
inspect.print_child_records(records=data)

PARENT TYPE: Title, PARENT ID: 7e8a689d89b024a7dc47ebee59bb07fa
PARENT TEXT: III. THE OPEN X-EMBODIMENT REPOSITORY
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 4

PARENT TYPE: Title, PARENT ID: 7e8a689d89b024a7dc47ebee59bb07fa
PARENT TEXT: III. THE OPEN X-EMBODIMENT REPOSITORY
    CHILD TYPE: ListItem --> NUMBER OF CHILDREN: 2

PARENT TYPE: Title, PARENT ID: 482139f50b98da4165176eea35acb871
PARENT TEXT: B. Dataset Analysis
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 1

PARENT TYPE: Title, PARENT ID: e110774179bc4d92f2382f2063c16a2f
PARENT TEXT: IV. RT-X DESIGN
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 1

PARENT TYPE: Title, PARENT ID: 6ffcbfd5e933ece63fe9d715a734707c
PARENT TEXT: A. Data format consolidation
    CHILD TYPE: NarrativeText --> NUMBER OF CHILDREN: 8

PARENT TYPE: Title, PARENT ID: 6ffcbfd5e933ece63fe9d715a734707c
PARENT TEXT: A. Data format consolidation
    CHILD TYPE: ListItem --> NUMBER OF CHILDREN: 1



In [20]:
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

elements_with_parent = dict_to_elements(data)

chunks = chunk_by_title(
    elements_with_parent,
    combine_text_under_n_chars=100,
    max_characters=3000,
)

In [21]:
from IPython.display import JSON
import json

# take a look at a chunk
JSON(json.dumps(chunks[0].to_dict(), indent=2))



<IPython.core.display.JSON object>

## Indexing with Llama Index

In [22]:
# index douments 
from llama_index.core import VectorStoreIndex, ServiceContext, Document
from llama_index.llms import openai

In [23]:
# compile the chunks into one document to search over
document = Document(text='\n\n'.join([chunk.text for chunk in chunks]))

In [25]:
# get Open AI API key
OPEN_API_KEY = utils.get_api_key("OPENAI")
openai.api_key = OPEN_API_KEY

# define a service context that contains both the llm and the embedding model
llm = openai.OpenAI(model='gpt-3.5-turbo', temperature=0.1)
service_context = ServiceContext.from_defaults(llm=llm, embed_model='local:BAAI/bge-small-en-v1.5')
index = VectorStoreIndex.from_documents([document], service_context=service_context)

  service_context = ServiceContext.from_defaults(llm=llm, embed_model='local:BAAI/bge-small-en-v1.5')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [26]:
query_engine = index.as_query_engine()

In [27]:
response = query_engine.query(
    "what is x-embodiment training?"
)

pprint(str(response))

('X-embodiment training involves training a policy on data from multiple robot '
 'embodiments without employing mechanisms to reduce the embodiment gap. The '
 'goal is to leverage diverse data from various robot embodiments to achieve '
 'positive transfer and improve the range of tasks that can be performed by a '
 'robot.')
