# Context Poisoning in RAG Systems: Hands-on Examples

This notebook demonstrates three common patterns of context poisoning in RAG systems and how to defend against them using Elasticsearch's search capabilities.

**What you'll learn:**
1. **Temporal Degradation**: Filter outdated documents with range queries
2. **Information Conflicts**: Prioritize relevant context with metadata boosting
3. **Semantic Noise**: Eliminate irrelevant results with product filters

**Requirements:**
- Elasticsearch 9.x or higher 
- Jina embeddings v3 inference endpoint (created automatically)
- Python 3.8+

---
## Section 1: Setup and Configuration

### Install Dependencies

First, let's install the required Python packages.

In [1]:
!pip install -qU elasticsearch python-dotenv

### Connect to Elasticsearch

To run this notebook, you need an Elasticsearch deployment. 

**Don't have one?** [Sign up for a free Elastic Cloud trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook).

Set the following environment variables:
- `ES_URL`: Your Elasticsearch endpoint URL
- `ES_API_KEY`: Your API key for authentication

In [2]:
import os
from elasticsearch import Elasticsearch
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

ES_URL = os.environ.get("ES_URL")
ES_API_KEY = os.environ.get("ES_API_KEY")

if not ES_URL or not ES_API_KEY:
    raise ValueError(
        "Please set ES_URL and ES_API_KEY environment variables (or create a .env file)"
    )

es = Elasticsearch(ES_URL, api_key=ES_API_KEY)

# Verify connection
info = es.info()
print(f"Connected to Elasticsearch {info['version']['number']}")

Connected to Elasticsearch 9.2.1


### Configure Inference Endpoint

We'll use Jina embeddings v3 for semantic search. The following cell will create the inference endpoint if it doesn't exist, or verify it's available if already created.

In [3]:
INFERENCE_ID = "jina-embeddings-v3"

# Try to create the inference endpoint, or verify it exists
try:
    es.inference.put(
        task_type="text_embedding",
        inference_id=INFERENCE_ID,
        body={
            "service": "elastic",
            "service_settings": {"model_id": "jina-embeddings-v3"},
        },
    )
    print(f"Inference endpoint '{INFERENCE_ID}' created")
except Exception as e:
    if (
        "already exists" in str(e).lower()
        or "resource_already_exists" in str(e).lower()
    ):
        print(f"Inference endpoint '{INFERENCE_ID}' already exists")
    else:
        # Try to get it - maybe it exists but with different error message
        try:
            endpoint = es.inference.get(inference_id=INFERENCE_ID)
            print(f"Inference endpoint '{INFERENCE_ID}' is available")
        except Exception as get_error:
            print(f"Error creating/accessing inference endpoint: {e}")
            raise

Inference endpoint 'jina-embeddings-v3' already exists


### Helper Function

We'll use a reusable function to load JSON datasets and index them into Elasticsearch.

In [4]:
import json
from elasticsearch.helpers import bulk


def load_and_index(es_client, index_name, json_file, mapping):
    """Load JSON dataset and index to Elasticsearch."""
    # Delete if exists
    if es_client.indices.exists(index=index_name):
        es_client.indices.delete(index=index_name)
        print(f"Deleted existing index: {index_name}")

    # Create index
    es_client.indices.create(index=index_name, body=mapping)
    print(f"Created index: {index_name}")

    # Load and index documents
    with open(json_file, "r") as f:
        docs = json.load(f)

    actions = [
        {"_index": index_name, "_id": doc.pop("_id"), "_source": doc} for doc in docs
    ]

    success, errors = bulk(es_client, actions, refresh=True)
    print(f"Indexed {success} documents to {index_name}")

    if errors:
        print(f"Errors: {errors}")

    return success

---
## Section 2: Temporal Degradation

### The Problem

Outdated docs remain semantically similar to current queries but contain obsolete information. A query for "OAuth authentication" might retrieve docs from 6.x (Shield plugin), 7.x (legacy syntax), and 9.x (current)—all relevant, but only the latest is accurate.

### The Solution

Use **date range filters** in your RRF query to exclude documents older than a threshold (e.g., 6 months).

### Create the Index

We'll create an index for product documentation with a `semantic_text` field for vector search.

In [5]:
INDEX_TEMPORAL = "product-docs"

# Mapping with copy_to for semantic_text field
mapping_temporal = {
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "content": {"type": "text", "copy_to": "content_semantic"},
            "content_semantic": {"type": "semantic_text", "inference_id": INFERENCE_ID},
            "last_updated": {"type": "date"},
            "version": {"type": "keyword"},
            "status": {"type": "keyword"},
            "content_snippet": {"type": "text"},
        }
    }
}

load_and_index(es, INDEX_TEMPORAL, "data/product-docs.json", mapping_temporal)

Created index: product-docs
Indexed 15 documents to product-docs


15

### Query WITHOUT Temporal Filter

First, let's see what happens when we query without any date filtering. The RRF (Reciprocal Rank Fusion) query combines semantic and keyword search but retrieves documents from all time periods.

In [6]:
query_no_filter = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": "how to configure OAuth authentication",
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": "configure OAuth authentication",
                                "fields": ["title^2", "content"],
                            }
                        }
                    }
                },
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "last_updated", "version", "content_snippet"],
    "size": 5,
}

print("Query: 'how to configure OAuth authentication'")
print("Filter: NONE")
print("-" * 60)

results = es.search(index=INDEX_TEMPORAL, body=query_no_filter)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    print(f"\n{i}. {src['title']}")
    print(f"   Version: {src['version']} | Updated: {src['last_updated']}")
    print(f"   {src['content_snippet'][:80]}...")

Query: 'how to configure OAuth authentication'
Filter: NONE
------------------------------------------------------------

1. Setting Up OAuth (Deprecated)
   Version: 6.x | Updated: 2022-03-15
   Configure OAuth via Shield plugin (deprecated, replaced by X-Pack)....

2. OAuth 2.0 Authentication Setup
   Version: 9.x | Updated: 2026-01-15
   Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage...

3. OAuth Authentication Configuration
   Version: 7.x | Updated: 2023-06-15
   Configure OAuth via elasticsearch.yml with xpack.security.authc.realms.oidc sett...

4. OAuth Realm Setup
   Version: 7.x | Updated: 2023-05-20
   Set up OAuth realm in xpack.security.authc.realms.oidc with op.* settings....

5. OAuth Client Registration
   Version: 7.x | Updated: 2023-04-10
   Register OAuth clients via rp.client_id and rp.client_secret in elasticsearch.ym...


Without filtering, results mix documents from 2022-2026; the RAG system would receive conflicting information about deprecated Shield plugin, legacy elasticsearch.yml config, and current API-based setup.

In [7]:
query_with_filter = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": "how to configure OAuth authentication",
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": "configure OAuth authentication",
                                "fields": ["title^2", "content"],
                            }
                        }
                    }
                },
            ],
            "filter": [
                {"range": {"last_updated": {"gte": "now-6M"}}},
                {"term": {"status": "published"}},
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "last_updated", "version", "content_snippet"],
    "size": 5,
}

print("Query: 'how to configure OAuth authentication'")
print("Filter: last_updated >= now-6M AND status = published")
print("-" * 60)

results = es.search(index=INDEX_TEMPORAL, body=query_with_filter)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    print(f"\n{i}. {src['title']}")
    print(f"   Version: {src['version']} | Updated: {src['last_updated']}")
    print(f"   {src['content_snippet'][:80]}...")

Query: 'how to configure OAuth authentication'
Filter: last_updated >= now-6M AND status = published
------------------------------------------------------------

1. OAuth 2.0 Authentication Setup
   Version: 9.x | Updated: 2026-01-15
   Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Manage...

2. OAuth Provider Configuration
   Version: 9.x | Updated: 2025-12-20
   Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery....

3. OAuth Token Management
   Version: 9.x | Updated: 2026-01-10
   Manage OAuth tokens via /_security/oauth2/token endpoint for refresh and revocat...

4. OAuth Security Best Practices
   Version: 9.x | Updated: 2026-01-05
   OAuth best practices: short-lived tokens, PKCE, proper redirect URIs, state vali...

5. OAuth Troubleshooting Guide
   Version: 9.x | Updated: 2025-12-15
   Troubleshoot OAuth: check token expiration, redirect URIs, credentials, use debu...


All results are now from version 9.x, providing consistent and current OAuth configuration guidance.

---

### Key Takeaway

Temporal filtering prevents context poisoning from outdated documentation:
- Set appropriate staleness thresholds based on your documentation lifecycle
- Consider recency boosting for soft preferences vs. hard cutoffs
- Mark evergreen content (core concepts) to exempt from time filters

---
## Section 3: Information Conflicts

### The Problem

Semantically similar documents may contain contradictory information based on different contexts. A query for "configure custom users in serverless" might retrieve Serverless docs ("use SSO"), Cloud docs ("create in Stack Management"), and Self-hosted docs ("configure native realm")—all valid, but only one matches the user's context.

### The Solution

Use **metadata boosting** with `should` clauses to prioritize documents matching the user's context (deployment type, product version, etc.).

In [8]:
INDEX_CONFLICTS = "platform-docs"

# Mapping with copy_to for semantic_text field
mapping_conflicts = {
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "content": {"type": "text", "copy_to": "content_semantic"},
            "content_semantic": {"type": "semantic_text", "inference_id": INFERENCE_ID},
            "deployment_type": {"type": "keyword"},
            "feature_supported": {"type": "boolean"},
            "doc_status": {"type": "keyword"},
            "content_snippet": {"type": "text"},
        }
    }
}

load_and_index(es, INDEX_CONFLICTS, "data/platform-docs.json", mapping_conflicts)

Created index: platform-docs
Indexed 15 documents to platform-docs


15

### Query WITHOUT Metadata Boosting

First, let's query without any deployment-type boosting to see results from all deployment types.

In [None]:
user_query = "How do I configure custom users in serverless?"

query_no_boosting = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": user_query,
                                "fields": ["title^2", "content"],
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": user_query,
                            }
                        }
                    }
                },
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "deployment_type", "feature_supported", "content_snippet"],
    "size": 5,
}

print(f"Query: '{user_query}'")
print("Boosting: NONE")
print("-" * 60)

results = es.search(index=INDEX_CONFLICTS, body=query_no_boosting)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    supported = "Supported" if src["feature_supported"] else "NOT Supported"
    print(f"\n{i}. {src['title']}")
    print(f"   Deployment: {src['deployment_type']} | Feature: {supported}")
    print(f"   {src['content_snippet'][:80]}...")

### Query with Metadata Boosting

We use `should` clauses to boost documents that match the user's deployment context (serverless). This ensures contextually relevant documents rank higher than semantically similar but contextually wrong results.

In [None]:
query_with_boosting = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "multi_match": {
                                            "query": user_query,
                                            "fields": ["title^2", "content"],
                                        }
                                    }
                                ],
                                "should": [
                                    {
                                        "term": {
                                            "deployment_type": {
                                                "value": "serverless",
                                                "boost": 3.0,
                                            }
                                        }
                                    },
                                    {
                                        "term": {
                                            "doc_status": {
                                                "value": "current",
                                                "boost": 2.0,
                                            }
                                        }
                                    },
                                ],
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": user_query,
                            }
                        }
                    }
                },
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "deployment_type", "feature_supported", "content_snippet"],
    "size": 5,
}

print(f"Query: '{user_query}'")
print("Boosting: deployment_type=serverless (3.0x), doc_status=current (2.0x)")
print("-" * 60)

results = es.search(index=INDEX_CONFLICTS, body=query_with_boosting)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    supported = "Supported" if src["feature_supported"] else "NOT Supported"
    print(f"\n{i}. {src['title']}")
    print(f"   Deployment: {src['deployment_type']} | Feature: {supported}")
    print(f"   {src['content_snippet'][:80]}...")

Serverless documents now rank at the top, correctly informing the user that custom users are NOT supported and they should use SSO instead.

---

### Key Takeaway

Metadata boosting resolves conflicts by prioritizing context-relevant documents:
- Extract user context from the query (deployment type, version, etc.)
- Apply appropriate boosts to matching metadata
- Use strict filters when context is unambiguous

---
## Section 4: Semantic Noise

### The Problem

Documents about different products may share terminology, causing irrelevant results. A query for "configure agents" could match both **Elastic Agent** (Observability—collects logs/metrics) and **Agent Builder** (GenAI—builds LLM workflows). Same term, completely different products.

### The Solution

Use **product filters** in the RRF query to exclude irrelevant product documentation entirely.

### Create the Index

In [None]:
INDEX_NOISE = "elastic-docs"

mapping_noise = {
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "content": {"type": "text", "copy_to": "content_semantic"},
            "content_semantic": {"type": "semantic_text", "inference_id": INFERENCE_ID},
            "product": {"type": "keyword"},
            "doc_type": {"type": "keyword"},
            "tags": {"type": "keyword"},
            "url": {"type": "keyword"},
        }
    }
}

load_and_index(es, INDEX_NOISE, "data/elastic-docs.json", mapping_noise)

### Query WITHOUT Product Filter

First, let's see what happens without filtering. The query "configure agents" matches both product areas.

In [None]:
user_query = "agent configuration logs metrics collection"

query_no_filter = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": user_query,
                                "fields": ["title^3", "content", "tags^2"],
                                "type": "best_fields",
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": "configuring agents to collect logs and metrics from hosts",
                            }
                        }
                    }
                },
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "product", "tags", "url"],
    "size": 5,
}

print(f"Query: '{user_query}'")
print("Filter: NONE")
print("-" * 60)

results = es.search(index=INDEX_NOISE, body=query_no_filter)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    tags = ", ".join(src.get("tags", [])[:3])
    print(f"\n{i}. {src['title']}")
    print(f"   Product: {src['product']} | Tags: {tags}")
    print(f"   URL: {src['url']}")

Results mix Elastic Agent and Agent Builder documentation. "Agent Builder Configuration" is semantically similar but irrelevant to log/metric collection.

### Query WITH Product Filter

Now let's apply a product filter to only retrieve Observability and Elastic Agent documentation.

In [13]:
query_with_filter = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": user_query,
                                "fields": ["title^3", "content", "tags^2"],
                                "type": "best_fields",
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "content_semantic",
                                "query": "configuring agents to collect logs and metrics from hosts",
                            }
                        }
                    }
                },
            ],
            "filter": [
                {"terms": {"product": ["observability", "elastic-agent"]}},
                {"term": {"doc_type": "configuration"}},
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title", "product", "tags", "url"],
    "size": 5,
}

print(f"Query: '{user_query}'")
print("Filter: product IN [observability, elastic-agent] AND doc_type = configuration")
print("-" * 60)

results = es.search(index=INDEX_NOISE, body=query_with_filter)

for i, hit in enumerate(results["hits"]["hits"], 1):
    src = hit["_source"]
    tags = ", ".join(src.get("tags", [])[:3])
    print(f"\n{i}. {src['title']}")
    print(f"   Product: {src['product']} | Tags: {tags}")
    print(f"   URL: {src['url']}")

Query: 'agent configuration logs metrics collection'
Filter: product IN [observability, elastic-agent] AND doc_type = configuration
------------------------------------------------------------

1. Elastic Agent Input Configuration
   Product: elastic-agent | Tags: inputs, logs, metrics
   URL: /docs/elastic-agent/inputs

2. Configure Elastic Agent for Log and Metric Collection
   Product: elastic-agent | Tags: configuration, logs, metrics
   URL: /docs/elastic-agent/configure

3. Agent Policies and Integrations
   Product: observability | Tags: policies, integrations, fleet
   URL: /docs/fleet/policies

4. Configuring Agent Outputs
   Product: elastic-agent | Tags: outputs, elasticsearch, logstash
   URL: /docs/elastic-agent/outputs

5. Manage Elastic Agents with Fleet
   Product: observability | Tags: fleet, agent-management, deployment
   URL: /docs/fleet/manage-agents


---
## Section 5: Cleanup

Uncomment and run the following cell to delete the indices created in this notebook.

In [None]:
# Uncomment to delete indices
# es.indices.delete(index=INDEX_TEMPORAL, ignore_unavailable=True)
# es.indices.delete(index=INDEX_CONFLICTS, ignore_unavailable=True)
# es.indices.delete(index=INDEX_NOISE, ignore_unavailable=True)
# print("Indices deleted")