# RAG without embeddings: A keyword-first retrieval stack

Not every search problem needs a vector store.

There are plenty of use cases especially in incident response, enterprise ops, or tightly-scoped document corpora where plain old keyword retrieval can get you surprisingly far.

This notebook explores what that looks like in practice:  
A **BM25-powered RAG pipeline** built entirely without embeddings.

We’ll use:
- **Unstructured** to extract and chunk source docs from S3
- **Elasticsearch Serverless** to handle retrieval via BM25
- **LangChain + OpenAI** to run natural language queries over the results

Along the way, we’ll see where this setup shines and where it quietly falls apart.  
Some queries will resolve beautifully. Others will fail in subtle ways, with answers that *sound* right but aren't grounded.

This isn’t about proving BM25 is enough. It’s about understanding what you get when you start simple.


In [None]:
!pip install -U unstructured-client elasticsearch langchain-community langchain-openai langchain-elasticsearch

Collecting unstructured-client
  Downloading unstructured_client-0.42.1-py3-none-any.whl.metadata (23 kB)
Collecting elasticsearch
  Downloading elasticsearch-9.1.0-py3-none-any.whl.metadata (8.4 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-elasticsearch
  Downloading langchain_elasticsearch-0.3.2-py3-none-any.whl.metadata (8.3 kB)
Collecting pypdf>=4.0 (from unstructured-client)
  Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)
Collecting elastic-transport<10,>=9.1.0 (from elasticsearch)
  Downloading elastic_transport-9.1.0-py3-none-any.whl.metadata (3.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_set

## Setting up credentials and environment variables

Before we define our workflow, we’ll load the necessary credentials for all the external services we’ll be using — Unstructured, AWS S3 (as our source), and Elasticsearch (as our destination).

These are securely pulled from Colab secrets using `userdata.get(...)`, so make sure you’ve already added them via the “🔐 Secrets” tab in Colab.

Here’s what each one is used for:

- **Unstructured API key**: Required to access the Unstructured Workflows API.
- **S3 credentials**: Used to fetch documents from an S3 bucket or folder.
- **Elasticsearch credentials**: Used to push the processed, structured data into an Elasticsearch Serverless index.

---

### Where to get these values

Here’s a quick guide on how to fetch the required credentials:

#### 🔑 Unstructured API Key
[Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.


#### 🪣 S3 Credentials
We’re using the [S3 Source Connector](https://docs.unstructured.io/api-reference/workflow/sources/s3). You’ll need:

- **AWS Access Key ID** and **Secret Access Key**: You can create these from your AWS IAM dashboard by creating a user with “AmazonS3ReadOnlyAccess” or similar permissions.
- **S3 Remote URL**: This should point to the folder or bucket you want to ingest from — e.g. `s3://your-bucket-name/path-to-folder/`. Make sure it’s in URI format.



#### 🔍 Elasticsearch (Serverless)
We’re using the [Elasticsearch destination connector](https://docs.unstructured.io/api-reference/workflow/destinations/elasticsearch). To set this up:

1. Go to [https://cloud.elastic.co](https://cloud.elastic.co) and create a **Serverless Project**.
2. Under **Project Settings → API Keys**, create a new key.
3. Grab the following values:
   - **API key** (you’ll use this as `ES_API_KEY`)
   - **Deployment URL** (this becomes `ES_HOST_NAME`)
   - Your target **index name** (set this as `ES_INDEX_NAME`)

That’s it — once these are in place as secrets, we’re ready to configure the connectors programmatically in the next step.


In [None]:
import os
import time
from datetime import datetime
from google.colab import userdata

# Unstructured
os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')

# AWS S3
os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')
os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')
os.environ['S3_REMOTE_URL'] = userdata.get("S3_REMOTE_URL")


# Elasticsearch Serverless
os.environ['ES_INDEX_NAME'] = userdata.get('ES_INDEX_NAME')
os.environ['ES_HOST_NAME'] = userdata.get('ES_HOST_NAME')
os.environ['ES_API_KEY'] = userdata.get('ES_API_KEY')





In [None]:
# instantiate Unstructured Client
from unstructured_client import UnstructuredClient

unstructured_client = UnstructuredClient(api_key_auth=os.environ['UNSTRUCTURED_API_KEY'])

# helper function
def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

### Registering the S3 source connector

Now that our credentials are set, let’s connect to the raw data stored in S3.

This step registers an **S3 source connector** with the Unstructured API. Once created, this connector tells the system where to pull documents from during workflow execution.

Here’s what’s happening:
- We use the S3 credentials and remote URL from earlier.
- `recursive=True` ensures that files inside nested folders will also be processed.

Once the source is registered, Unstructured will return a unique `source_id` — you’ll use this to define the pipeline input in the next step.

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import CreateSourceConnector

formatted_time = datetime.now().strftime("%H:%M:%S")
source_response = unstructured_client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name=f"Rag w/o Embeddings Source_ {formatted_time}",
            type="s3",
            config={
              "key": os.environ.get('AWS_ACCESS'),
              "secret": os.environ.get('AWS_SECRET'),
              "remote_url": os.environ.get('S3_REMOTE_URL'),
              "recursive": True
            }
        )
    )
)

pretty_print_model(source_response.source_connector_information)

{
    "config": {
        "anonymous": false,
        "recursive": true,
        "remote_url": "s3://ajay-uns-devrel-content/agentic-analysis/",
        "key": "**********",
        "secret": "**********"
    },
    "created_at": "2025-08-06T14:34:21.898458Z",
    "id": "fbb2a2da-156e-4317-a394-40596bc7b102",
    "name": "Rag w/o Embeddings Source_ 14:34:21",
    "type": "s3",
    "updated_at": "2025-08-06T14:34:22.081140Z"
}


### Registering the Elasticsearch destination connector

With our source in place, we now define where the processed data should go.

In this case, we’re using **Elasticsearch Serverless** as our destination. This connector pushes cleaned, structured chunks directly into your configured index — making them queryable for downstream RAG tasks.

Here’s a breakdown of what’s passed into the connector:
- `hosts`: The Elasticsearch deployment URL (from your Serverless project).
- `es_api_key`: The API key you created earlier for secure access.
- `index_name`: The target index where documents will be stored.

> 📌 Note: The index will be created automatically if it doesn’t already exist.

After this step, Unstructured will return a `destination_id`, which we’ll use to tie the source and destination together in the next step: building the workflow.


In [None]:
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import CreateDestinationConnector

destination_response = unstructured_client.destinations.create_destination(
    request=CreateDestinationRequest(
        create_destination_connector=CreateDestinationConnector(
            name=f"ES_Destination_connector_{formatted_time}",
            type="elasticsearch",
            config={
                "hosts": [os.environ['ES_HOST_NAME']],
                "es_api_key": os.environ['ES_API_KEY'],
                "index_name": os.environ['ES_INDEX_NAME']
            }
        )
    )
)

pretty_print_model(destination_response.destination_connector_information)

{
    "config": {
        "es_api_key": "**********",
        "hosts": [
            "https://my-elasticsearch-project-cf9288.es.us-east-1.aws.elastic.cloud:443"
        ],
        "index_name": "es-demo"
    },
    "created_at": "2025-08-06T14:36:34.442290Z",
    "id": "19bd8287-d7b5-4d7d-84ab-63ad14e07b70",
    "name": "ES_Destination_connector_14:34:21",
    "type": "elasticsearch",
    "updated_at": "2025-08-06T14:36:34.562580Z"
}


### Building an Unstructured workflow

Now that we’ve registered both our source and destination connectors, it’s time to define how documents should be processed.

This step creates a **custom workflow** in Unstructured that connects:
1. The S3 source (documents in)
2. A two-step transformation pipeline
3. The Elasticsearch destination (clean chunks out)

Here’s what the processing nodes do:

- **Partitioner**: Uses a Vision-Language Model (Anthropic Claude Sonnet) to extract clean structured content — preserving layout, tables, and section headers.
- **Chunker**: Breaks up the content into smaller pieces. We’re using a title-aware strategy with controlled overlap (`4096` max characters, `150` character overlap) to preserve context for retrieval.

> 🔍 No embedder here, and that’s intentional.  
> For this tutorial, we’ll be using **BM25** for retrieval instead of dense vector embeddings, so there’s no need to generate embeddings in this pipeline.

Once the workflow is created, we save the `workflow_id` so we can run it in the next step.


In [None]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowType,
    Schedule
)

parition_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type="partition",
    settings={
        "provider": "anthropic",
        "model": "claude-sonnet-4-5-20250929",
        }
    )

chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type="chunk",
    settings={
        "new_after_n_chars": 1000,
        "max_characters": 4096,
        "overlap": 150
    }
)

response = unstructured_client.workflows.create_workflow(
    request={
        "create_workflow": {
            "name": f"Rag w/o Embeddings Tutorial Workflow_ {time.time()}",
            "source_id": source_response.source_connector_information.id,
            "destination_id": destination_response.destination_connector_information.id,
            "workflow_type": WorkflowType.CUSTOM,
            "workflow_nodes": [
                parition_node,
                chunk_node
            ]
        }
    }
)

pretty_print_model(response.workflow_information)
workflow_id = response.workflow_information.id

### Run the workflow

Run the following cell to start running the workflow.

In [None]:
res = unstructured_client.workflows.run_workflow(
    request={
        "workflow_id": workflow_id,
    }
)

pretty_print_model(res.job_information)

{
    "created_at": "2025-07-19T19:40:36.320615Z",
    "id": "5270557c-2e97-4bc2-998c-0eb8af189c18",
    "status": "SCHEDULED",
    "workflow_id": "46a8b815-0528-4afe-bba4-03f05f4310b5",
    "workflow_name": "Rag w/o Embeddings Tutorial Workflow_ 1752954034.9518843",
    "job_type": "ephemeral"
}


### Get the workflow run's job ID

Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID.

In [None]:
response = unstructured_client.jobs.list_jobs(
    request={
        "workflow_id": workflow_id
    }
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

job_id: 5270557c-2e97-4bc2-998c-0eb8af189c18


### Poll for job completion

Run the following cell to confirm the job has finished running. If successful, Unstructured prints `"status": "COMPLETED"` within the information about the job.

In [None]:
def poll_job_status(job_id, wait_time=30):
    while True:
        response = unstructured_client.jobs.get_job(
            request={
                "job_id": job_id
            }
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print(f"Job is scheduled, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        elif job.status == "IN_PROGRESS":
            print(f"Job is in progress, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

Job is scheduled, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is completed
{
    "created_at": "2025-07-19T19:40:36.320615",
    "id": "5270557c-2e97-4bc2-998c-0eb8af189c18",
    "status": "COMPLETED",
    "workflow_id": "46a8b815-0528-4afe-bba4-03f05f4310b5",
    "workflow_name": "Rag w/o Embeddings Tutorial Workflow_ 1752954034.9518843",
    "job_type": "ephemeral",
    "runtime": "PT0S"
}


At this point, we’ve successfully run the full Unstructured pipeline:

- Documents were pulled from S3
- Cleaned and chunked using the Partitioner and Chunker nodes
- And indexed into our Elasticsearch Serverless instance

All of this happened without generating embeddings — and that’s by design.

In the next section, we’ll build a lightweight **RAG pipeline** that uses traditional keyword-based search (**BM25**) to retrieve context from Elasticsearch

## RAG

In this section, we’ll build a Retrieval-Augmented Generation (RAG) pipeline but without using any embeddings.  
Instead, we’ll rely on a classic scoring algorithm called **BM25**, which powers the keyword-based search inside Elasticsearch.

### What is BM25?

BM25 is a **ranking function** that scores documents based on how well they match a query using exact terms, partial matches, and some clever normalization behind the scenes.

It’s been a staple in information retrieval for decades, and it still holds up remarkably well when:
- Your documents are chunked cleanly
- Your queries are fairly literal (i.e., not abstract or fuzzy)

Here’s how it works, at a high level:

- **Matching terms boost relevance**: If a chunk contains your search terms, it scores higher.
- **Rare words carry more weight**: Matches on uncommon terms matter more than matches on generic words.
- **Document length is normalized**: Longer chunks don’t get an unfair advantage just because they mention everything.

Unlike dense embeddings, BM25 doesn’t “understand” semantic meaning. It’s not going to connect synonyms or paraphrases. But when your queries are sharp and your chunking is good — it can work surprisingly well.

> 🧠 Why use this?
> - It’s **fast**, **transparent**, and doesn’t need a GPU or embedding model.
> - It’s perfect for bootstrapping or low-latency use cases.

We’ll now query the indexed data in Elasticsearch using BM25 and pass the results into our LLM to generate grounded answers.


In [None]:
from langchain_elasticsearch import ElasticsearchStore, BM25Strategy
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from elasticsearch import Elasticsearch

### Setting up the BM25-backed RAG pipeline

With our data indexed and ready, we can now run queries over it using BM25 retrieval.

Here’s how this section works:
1. We connect to the Elasticsearch Serverless instance using the `Elasticsearch` Python client.
2. We initialize a `BM25Strategy` — this wraps keyword-based scoring around our document chunks.
3. We query Elasticsearch for the top-k most relevant chunks (`similarity_search`), and pass them to GPT-4o to generate an answer.

#### BM25 parameters: `k1` and `b`

- **`k1` (default: `1.2`)**  
  Controls **term frequency scaling** — how much repeated terms matter.  
  - Higher `k1` = more boost for repeated keywords  
  - Lower `k1` = frequency saturates quickly

- **`b` (default: `0.75`)**  
  Controls **document length normalization** — i.e., should longer chunks be penalized?  
  - `b = 0` → No length penalty (longer chunks may dominate)  
  - `b = 1` → Full normalization (neutralizes doc length bias)

These values work well in practice, but you can tune them if:
- Your chunks are very short/long
- You see irrelevant long documents dominating results


The `run_query_direct(...)` function wraps the whole RAG flow:

- It retrieves the top-k hits via BM25
- Assembles a context string
- Injects it into a prompt
- And uses GPT-4o to answer based only on that context


In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

def connect_elasticsearch():
    return Elasticsearch(
        os.environ['ES_HOST_NAME'],
        api_key=os.environ['ES_API_KEY']
    )

def init_bm25_store(es_client, index_name):
    bm25_strategy = BM25Strategy(k1=1.2, b=0.75)

    store = ElasticsearchStore(
        es_connection=es_client,
        index_name=index_name,
        strategy=bm25_strategy
    )
    return store

def run_query_direct(store, query, k=5):
    print(f"\n--- QUERY: {query} ---")

    docs = store.similarity_search(query, k=k)

    context = "\n\n".join([doc.page_content for doc in docs])

    llm = ChatOpenAI(model="gpt-4o")

    prompt = ChatPromptTemplate.from_template("""
    Answer the following question based only on the provided context:

    Context: {context}

    Question: {question}

    Answer:
    """)

    formatted_prompt = prompt.format(context=context, question=query)
    response = llm.invoke(formatted_prompt)

    print("RETRIEVED DOCUMENTS:")
    for i, doc in enumerate(docs, 1):
        print(f"{i}. {doc.page_content[:200]}...")


    print(f"\nANSWER:")
    print(response.content)

    return response.content, docs

In [None]:
es_client = connect_elasticsearch()
store = init_bm25_store(es_client, os.environ['ES_INDEX_NAME'])


Now let's run a sample question on our data

In [None]:
response, docs = run_query_direct(store, "What are the containment procedures?")


--- QUERY: What are the containment procedures? ---
RETRIEVED DOCUMENTS:
1. Analyze for Common Adversary TTPs

Compare TTPs to adversary TTPs documented in ATT&CK and analyze how the TTPs fit into the attack lifecycle. TTPs describe "why," "what," and "how." Tactics describe ...
2. TLP:CLEAR

Incident Response Process flowchart showing the workflow from START through various phases including Declare Incident, Determine Investigation Scope, Share CTI, Collect and Preserve Data, P...
3. TLP:CLEAR

Step Incident Response Procedure Action Taken Date Completed 9c. Reset passwords on compromised accounts. 9d. Implement multi-factor authentication for all access methods. 9e. Install updat...
4. 7. Contain Activity (Short-term Mitigations)

7a. Determine appropriate containment strategy, including: • Requirement to preserve evidence • Availability of services (e.g., network connectivity, serv...
5. TLP:CLEAR

Term Definition Source National Security Systems (NSS) National Security Systems (NS


Because the query used clear, operational language (“containment procedures”), BM25 was able to surface high-signal chunks that directly addressed the topic — including full containment checklists and tactical steps.

The LLM then stitched together the overlapping context into a clean, actionable list — covering evidence preservation, system isolation, access revocation, and more.

> ✅ This is where keyword search shines: when your documents are structured, and your query terms match section headers or list items directly.


Now let's try a more abstract query

In [None]:
response, docs = run_query_direct(store, "Where in the document does it describe coordination between the SOC and executive leadership?")


--- QUERY: Where in the document does it describe coordination between the SOC and executive leadership? ---
RETRIEVED DOCUMENTS:
1. TLP:CLEAR TLP:CLEAR label in black background with white text CISA | Cybersecurity and Infrastructure Security Agency 2

TLP:CLEAR

INTRODUCTION

The Cybersecurity and Infrastructure Security Agency (...
2. TLP:CLEAR

Term Definition Source National Security Systems (NSS) National Security Systems (NSS) are information systems as defined in 44 U.S.C.3552(b)(6). {A}The term "national security system" mean...
3. APPENDIX G: SOURCE TEXT

Agency Responsibilities References Cyber Response Group (CRG) Coordinates the development and implementation of the federal government's policies, strategies, and procedures f...
4. TLP:CLEAR

Step Incident Response Procedure Action Taken Date Completed 9c. Reset passwords on compromised accounts. 9d. Implement multi-factor authentication for all access methods. 9e. Install updat...
5. Coordination with CISA

Cyber defense 

We asked about coordination between the **SOC** and **executive leadership**, but the documents retrieved didn’t contain an exact match. Instead, they surfaced adjacent topics like:

- Reporting incidents to **CISA** and **IT leadership**
- Establishing cross-agency communications protocols

The LLM still produced a fluent answer — but it was largely inferred from context, not grounded in an exact passage. This is a classic case where **BM25 lacks the fuzziness or semantic awareness** needed to bridge slightly different wording.

> ⚠️ This is where embedding-based retrieval would outperform:  
> A vector store could connect “SOC coordination” with descriptions of escalation protocols, even if those words aren’t used verbatim.

So while BM25 gave us *close-ish* chunks, the final answer wasn’t fully supported by the source, and that’s important to catch.

## Conclusion

This walkthrough demonstrated how to build a RAG pipeline without using embeddings — relying instead on **BM25 keyword search** for document retrieval.

We saw that:

- 🔍 BM25 performs well when queries use **precise terms** that align closely with the document’s language or structure.
- ⚠️ It falls short when the language **diverges** — like asking abstract or cross-functional questions not spelled out in exact keywords.
- 🤖 The LLM can sometimes *paper over* poor retrieval by guessing — but that breaks the grounding contract of RAG.

### When does this approach make sense?

Use BM25-based RAG when:
- Your document set is small to medium-sized
- You don’t want to manage embeddings or vector stores
- Your queries are likely to match real wording in the docs (e.g., checklists, procedures, FAQs)

But if you’re working with more ambiguous queries — or documents with varied phrasing — **embedding-based search or a hybrid strategy** will perform better.

---

### ✅ Next steps

Try extending this notebook by:
- Swapping in a **hybrid retrieval strategy** (BM25 + vectors)
- Adding an **embedding step** to the Unstructured workflow
- Testing queries that deliberately push the limits of lexical matching

You now have a full BM25-based RAG system running, feel free to plug in your own docs and explore how it holds up.

> ⚡️ Want to go deeper?  
> Check out [Unstructured’s API docs](https://docs.unstructured.io) for advanced connectors, chunking strategies, and embedding options.
