<a href="https://colab.research.google.com/github/asadjv/data/blob/main/notebooks/en/rag_with_unstructured_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building RAG with Custom Unstructured Data

_Authored by: [Maria Khalusova](https://github.com/MKhalusova)_

If you're new to RAG, please explore the basics of RAG first in [this other notebook](https://huggingface.co/learn/cookbook/rag_zephyr_langchain), and then come back here to learn about building RAG with custom data.

Whether you're building your own RAG-based personal assistant, a pet project, or an enterprise RAG system, you will quickly discover that a lot of important knowledge is stored in various formats like PDFs, emails, Markdown files, PowerPoint presentations, HTML pages, Word documents, and so on.

How do you preprocess all of this data in a way that you can use it for RAG?
In this quick tutorial, you'll learn how to build a RAG system that will incorporate data from multiple data types. You'll use [Unstructured](https://github.com/Unstructured-IO/unstructured) for data preprocessing, open-source models from Hugging Face Hub for embeddings and text generation, ChromaDB as a vector store, and LangChain for bringing everything together.

Let's go! We'll begin by installing the required dependencies:

In [1]:
!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.1/981.5 kB[0m [31m17.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m20.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━

Next, let's get a mix of documents. Suppose, I want to build a RAG system that'll help me manage pests in my garden. For this purpose, I'll use diverse documents that cover the topic of IPM (integrated pest management):
* PDF: `https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf`
* Powerpoint: `https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx`
* EPUB: `https://www.gutenberg.org/ebooks/45957`
* HTML: `https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html`

Feel free to use your own documents for your topic of choice from the list of document types supported by Unstructured: `.eml`, `.html`, `.md`, `.msg`, `.rst`, `.rtf`, `.txt`, `.xml`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.heic`, `.csv`, `.doc`, `.docx`, `.epub`, `.odt`, `.pdf`, `.ppt`, `.pptx`, `.tsv`, `.xlsx`.

In [3]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# Create the target folder
!mkdir -p "./documents"

# Copy your contract PDF from Google Drive to local working folder
!cp "/content/drive/My Drive/document/100 Contract & Scope - Trial Data.pdf" "./documents/100 Contract & Scope - Trial Data.pdf"

Mounted at /content/drive


## Unstructured data preprocessing

You can use the Unstructured library to preprocess documents one by one, and write your own script to walk through a directory, but it's easier to use a Local source connector to ingest all documents in a given directory. Unstructured can ingest documents from local directories, S3 buckets, blob storage, SFTP, and many other places your documents might be stored in. The ingestion from those sources will be very similar differing mostly in authentication options.
Here you'll use Local source connector, but feel free to explore other options in the [Unstructured documentation](https://docs.unstructured.io/open-source/ingest/source-connectors/overview).

Optionally, you can also choose a [destination](https://docs.unstructured.io/open-source/ingest/destination-connectors/overview) for the processed documents - this could be MongoDB, Pinecone, Weaviate, etc. In this notebook, we'll keep everything local.

In [4]:
# Optional cell to reduce the amount of logs

import logging

logger = logging.getLogger("unstructured.ingest")
logger.root.removeHandler(logger.root.handlers[0])

In [9]:
import os

!pip install "unstructured[all-docs]"
!pip install "unstructured[local-inference]"

!pip install "unstructured[local-inference,pdf]"

!pip install "unstructured[pdf]"
!pip install langchain chromadb

from unstructured.partition.pdf import partition_pdf

# Parse your contract PDF into structured elements
elements = partition_pdf(filename="./documents/100 Contract & Scope - Trial Data.pdf")

# Convert elements to clean text for embedding
texts = [element.text for element in elements if element.text.strip() != ""]

print(texts[:3])  # Preview first 3 chunks




['DATED', '2023', 'CONFIRMATION NOTICE NO. 2']


NameError: name 'LocalRunner' is not defined

Let's take a closer look at the configs that we have here.

`ProcessorConfig` controls various aspects of the processing pipeline, including output locations, number of workers, error handling behavior, logging verbosity and more. The only mandatory parameter here is the `output_dir` - the local directory where you want to store the outputs.

`ReadConfig` can be used to customize the data reading process for different scenarios, such as re-downloading data, preserving downloaded files, or limiting the number of documents processed. In most cases the default `ReadConfig` will work.

In the `PartitionConfig` you can choose whether to partition the documents locally or via API. This example uses API, and for this reason requires Unstructured API key. You can get yours [here](https://unstructured.io/api-key-free).  The free Unstructured API is capped at 1000 pages, and offers better OCR models for image-based documents than a local installation of Unstructured.
If you remove these two parameters, the documents will be processed locally, but you may need to install additional dependencies if the documents require OCR and/or document understanding models. Namely, you may need to install poppler and tesseract in this case, which you can get with brew:

```
!brew install poppler
!brew install tesseract
```

If you're on Windows, you can find alternative installation instructions in the [Unstructured docs](https://docs.unstructured.io/open-source/installation/full-installation).

Finally, in the `SimpleLocalConfig` you need to specify where your original documents reside, and whether you want to walk through the directory recursively.

Once the documents are processed you'll find 4 json files in the `local-ingest-output` directory, one per document that was processed.
Unstructured partitions all types of documents in a uniform manner, and returns json with document elements.

[Document elements](https://docs.unstructured.io/api-reference/api-services/document-elements) have a type, e.g. `NarrativeText`, `Title`, or `Table`, they contain the extracted text, and metadata that Unstructured was able to obtain. Some metadata is common for all elements, such as filename of the document the element is from. Other metadata depends on file type or element type. For example, a `Table` element will contain table's representation as html in the metadata, and metadata for emails will contain information about senders and recipients.

Let's import element objects from these json files.

In [11]:
import os
from unstructured.partition.pdf import partition_pdf

elements = []

# Parse the PDF directly from your documents folder
pdf_path = "./documents/100 Contract & Scope - Trial Data.pdf"
elements = partition_pdf(filename=pdf_path)

# If you want to combine text from all elements into one list:
texts = [element.text for element in elements if element.text.strip() != ""]

# Now `texts` contains all text chunks extracted from your PDF
print(f"Extracted {len(texts)} text elements.")


Extracted 3679 text elements.


Now that that you have extracted the elements from the documents, you can chunk them to fit the context window of the embeddings model.

## Chunking

If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that Unstructured's chunking methods slightly differ, since the partitioning step already divides an entire document into its structural elements: titles, list items, tables, text, etc. By partitioning documents this way, you can avoid a situation where unrelated pieces of text end up in the same element, and then same chunk.  

Now, when you chunk the document elements with Unstructured, individual elements are already small so they will only be split if they exceed the desired maximum chunk size. Otherwise, they will remain as is. You can also optionally choose to combine consecutive text elements such as list items, for instance, that will together fit within chunk size limit.


In [12]:
from unstructured.chunking.title import chunk_by_title

chunked_elements = chunk_by_title(elements,
                                  # maximum for chunk size
                                  max_characters=512,
                                  # You can choose to combine consecutive elements that are too small
                                  # e.g. individual list items
                                  combine_text_under_n_chars=200,
                                  )


The chunks are ready for RAG. To use them with LangChain, you can easily convert Unstructured elements to LangChain documents.

In [13]:
from langchain_core.documents import Document

documents = []
for chunked_element in chunked_elements:
    metadata = chunked_element.metadata.to_dict()
    metadata["source"] = metadata["filename"]
    del metadata["languages"]
    documents.append(Document(page_content=chunked_element.text, metadata=metadata))

## Setting up the retriever

This example uses ChromaDB as a vector store and [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model, feel free to use any other vector store.

In [14]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import utils as chromautils

# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

  embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

If you plan to use a gated model from the Hugging Face Hub, be it an embeddings or text generation model, you'll need to authenticate yourself with your Hugging Face token, which you can get in your Hugging Face profile's settings.

In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## RAG with LangChain

Let's bring everything together and build RAG with LangChain.
In this example we'll be using [`Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) from Meta. To make sure it can run smoothly in the free T4 runtime from Google Colab, you'll need to quantize it.

In [16]:
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from langchain.chains import RetrievalQA

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# --- Load GPT-2 locally without authentication ---
model_name = "gpt2"

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure GPT-2 has a pad token
tokenizer.pad_token = tokenizer.eos_token

# --- Create the generation pipeline ---
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
)

# --- Wrap in LangChain LLM ---
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# --- Create the prompt ---
prompt_template = """
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}
Answer:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# --- Create QA chain ---
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,  # assumes retriever is set up with your chunks and embeddings
    chain_type_kwargs={"prompt": prompt},
)

# Example usage:
query = "What are the payment terms specified in the contract?"
result = qa_chain.invoke({"query": query})

print(result["result"])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=text_generation_pipeline)


"The Contracts entered into between us on behalf or by way about our business will contain provisions which may affect your rights if they apply only where it would have been possible otherwise". This means we can use them without affecting those other things mentioned above except when necessary so that no one else has access thereto from outside sources who might want their information protected against disclosure through third party intermediaries like Google Analytics. We also give some examples here but I'm going ahead because what's important now isn´t anything specific - rather how much money should go towards making sure people understand exactly why certain contracts were signed before signing others...


## Results and next steps

Now that you have your RAG chain, let's ask it about aphids. Are they a pest in my garden?

In [20]:
queries = [
    "What are the Key Dates and conditions to be met for the handover of the CCR Room?",
    "What is the period for reply to a communication as specified in the contract?",
    "What are the insurance coverage requirements for public liability under this contract?",
    "What are the contractor’s share percentages for Stages 4, 5, and 6?",
    "Under which circumstances can the contractor claim a compensation event for weather conditions?",
    "What is the law and jurisdiction governing this contract?",
    "What is the retention percentage and the retention free amount stated in the contract?",
    "What are the delay damages for each section under Option X5 and X7?",
    "What are the contractor’s obligations regarding the Key Clinical Equipment Design Data?",
    "What is the defect correction period and the defects date defined in the contract?",
]

for idx, query in enumerate(queries, 1):
    result = qa_chain.invoke({"query": query})["result"]
    print(f"{idx}️⃣ Query: {query}\nAnswer: {result}\n{'-'*80}\n")

1️⃣ Query: What are the Key Dates and conditions to be met for the handover of the CCR Room?
Answer: - The work is complete in 3 days or less but no more than 2 weeks after completion; if there was any delay during this period it would have been considered by us that we should take action immediately so please let me hear from anyone who has had problems while working at our office which could affect their ability access within 24 hours
--------------------------------------------------------------------------------

2️⃣ Query: What is the period for reply to a communication as specified in the contract?
Answer: 
"What does this mean?"



--------------------------------------------------------------------------------

3️⃣ Query: What are the insurance coverage requirements for public liability under this contract?
Answer: - The maximum number(s). - If there is no more than two events at all times during each period covered by the contracted term; then it would mean if both were occurr

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


9️⃣ Query: What are the contractor’s obligations regarding the Key Clinical Equipment Design Data?
Answer: "The key clinical equipment is part-time work performed at home during normal working hours". This means it can take place only after 3pm every day from 7am until 6pm each night between 8am and 5pm daily throughout the week - this includes all days off which may include weekends/evenings where there might be no other activity available such like school holidays etc. In addition, if your employer does not have access then they must provide some sort 'workplace' service including training sessions under their own direction but these should also cover both primary care services offered through NHS Carers Partnership Service providers who offer specialist support over time while providing relevant information about how patients get involved via social media platforms – see below. If we were talking about healthcare workers here I would assume most employers want people doing basic hea

In [21]:
print("🔹 Contract QA Assistant ready.")
print("🔹 Ask any question about your contract. Type 'bye' to exit.\n")

while True:
    user_query = input("❓ Your question: ").strip()

    if user_query.lower() == "bye":
        print("👋 Goodbye!")
        break

    try:
        result = qa_chain.invoke({"query": user_query})["result"]
        print(f"🪐 Answer: {result}\n")
    except Exception as e:
        print(f"⚠️ Error: {e}\nPlease try again or check your retriever and pipeline.\n")


🔹 Contract QA Assistant ready.
🔹 Ask any question about your contract. Type 'bye' to exit.

❓ Your question: how to save budget if the contract is violated?
🪐 Answer: "In order with respect thereto there must have been at least two such failures". In other words, it would take ten years from date of failure until all three were fixed before they could cause further problems because no fault had occurred during those six months but only after their work began again when repairs commenced; so even though he has failed twice now since then - once while repairing himself first time through making sure everything worked properly without causing more than half way between them – what does 'two' mean exactly?? The problem here lies in whether your contractor's negligence resulted directly into injury resulting either direct result of faulty equipment installed within its premises OR indirect consequence arising indirectly via improper use of materials used therein(i). If both causes can occur

Output:

```bash
Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!
```

In [30]:
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.prompts import PromptTemplate

!pip install --upgrade langchain langchain-openai faiss-cpu

# 1. Load documents
loader = DirectoryLoader("./documents", glob="*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

# 2. Create embeddings
embeddings = OpenAIEmbeddings()

# 3. Create FAISS vector store
db = FAISS.from_documents(docs, embeddings)
retriever = db.as_retriever()

# 4. Setup LLM with OpenAI
llm = OpenAI(temperature=0)

# 5. Prompt template
prompt_template = """
You are an assistant answering questions using contract context.
If you don't know, say "I do not know.".

Question: {question}
Context: {context}

Answer:
"""
prompt = PromptTemplate(input_variables=["context", "question"], template=prompt_template)

# 6. Build RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, chain_type_kwargs={"prompt": prompt})

# 7. Interactive loop
print("Ask me anything about the contract (type 'bye' to quit).")
while True:
    query = input("Question: ")
    if query.lower() == "bye":
        break
    result = qa_chain.run(query)
    print(f"Answer:\n{result}\n")


Collecting langchain-openai
  Downloading langchain_openai-0.3.27-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.27-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.4/70.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.27


ValidationError: 1 validation error for OpenAIEmbeddings
  Value error, Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. [type=value_error, input_value={'model_kwargs': {}, 'cli...20, 'http_client': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error

This looks like a promising start! Now that you know the basics of preprocessing complex unstructured data for RAG, you can continue improving upon this example. Here are some ideas:

* You can connect to a different source to ingest the documents from, for example, an S3 bucket.
* You can add `return_source_documents=True` in the `qa_chain` arguments to make the chain return the documents that were passed to the prompt as context. This can be useful to understand what sources were used to generate the answer.
* If you want to leverage the elements metadata at the retrieval stage, consider using Hugging Face agents and creating a custom retriever tool as described in [this other notebook](https://huggingface.co/learn/cookbook/agents#2--rag-with-iterative-query-refinement--source-selection).
* There are many things you could do to improve search results. For instance, you could use Hybrid search instead of a single similarity-search retriever. Hybrid search combines multiple search algorithms to improve the accuracy and relevance of search results. Typically it's a combination of keyword-based search algorithms with vector search methods.

Have fun building RAG applications with Unstructured data!