# Parent Child Retriever Examples
**Using Elasticsearch Nested Dense Vector Support**

When splitting documents for retrieval, there are often conflicting desires:

- You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
- You want to have long enough documents that the context of each chunk is retained.

We can take advantage of Nested Dense Vector capability in Elasticsearch to store both large passages and smaller linked passages in one document. During retrieval, we query for small passages which link back to a larger parent passage.

Note that “parent document” refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

## Dependencies
In this notebook, we're going to use Langchain and the Elasticsearch python client.

We will also require a running Elasticsearch instance with an ML node and model deployed to it.

In [36]:
!python3 -m pip install -qU langchain elasticsearch 

### Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

In [3]:

from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

In [29]:
from elasticsearch import Elasticsearch

client = Elasticsearch(cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY)

### Download our example Dataset
We are going to use Langchain's tooling to ingest and split raw documents into smaller chunks. We are using our example workplace search dataset.

LangChain has a number of other loaders to ingest data from other sources. See their [core loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/) or [loaders integration](https://python.langchain.com/docs/integrations/document_loaders) for more information. 

In [1]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json"

response = urlopen(url)
data = json.load(response)

with open('temp.json', 'w') as json_file:
    json.dump(data, json_file)


In [5]:
from langchain.document_loaders import JSONLoader 

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["name"] = record.get("name")
    metadata["summary"] = record.get("summary")
    metadata["url"] = record.get("url")
    metadata["category"] = record.get("category")
    metadata["updated_at"] = record.get("updated_at")

    return metadata

# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)

### Setting up our Elasticsearch Index
In this example we're going to use a pipeline to do the inference and store the embeddings in our index. 

In this example, we are using the sentence transformers minilm-l6-v2 model, which you will need to is running on the ML node. With this model, we are setting up an index_pipeline to do the inference and store the embeddings in our index.

In [43]:
PIPELINE_ID = "chunk_text_to_passages"
MODEL_ID = "sentence-transformers__all-minilm-l6-v2"
MODEL_DIMS = 384
INDEX_NAME = "nb_parent_retriever_index"

# Create the pipeline
client.ingest.put_pipeline(
  id=PIPELINE_ID, 
  processors=[
    {
      "foreach": {
        "field": "passages",
        "processor": {
          "inference": {
            "field_map": {
              "_ingest._value.text": "text_field"
            },
            "model_id": MODEL_ID,
            "target_field": "_ingest._value.vector",
            "on_failure": [
              {
                "append": {
                  "field": "_source._ingest.inference_errors",
                  "value": [
                    {
                      "message": "Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'",
                      "pipeline": "ml-inference-title-vector",
                      "timestamp": "{{{ _ingest.timestamp }}}"
                    }
                  ]
                }
              }
            ]
          }
        }
      }
    }
  ]
)

# Create the index
client.indices.create( 
  index=INDEX_NAME, 
  settings={
    "index": {
      "default_pipeline": PIPELINE_ID
    }
  },
  mappings={
    "dynamic": "true",
    "properties": {
      "passages": {
        "type": "nested",
        "properties": {
          "vector": {
            "properties": {
              "predicted_value": {
                "type": "dense_vector",
                "index": True,
                "dims": MODEL_DIMS,
                "similarity": "dot_product"
              }
            }
          }
        }
      }
    }
  }
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'nb_parent_retriever_index'})

### Utils: Parent Child Splitter Function
This function will split a document into multiple passages, and return the parent document with the child passages. 

It also has an option to chunk the parent document into smaller documents, meaning the parent document will be split into multiple index documents. We will use this in example 2.

In [44]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def parent_child_splitter(data, parent_chunk_size: int | None = None, child_chunk_size: int = 200):
  if parent_chunk_size:
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=parent_chunk_size)
    documents = parent_splitter.split_documents(data)
  else:
    documents = data

  child_splitter = RecursiveCharacterTextSplitter(chunk_size=child_chunk_size)

  docs = []
  for i, doc in enumerate(documents):
    passages = []

    for _doc in child_splitter.split_documents([doc]):
        passages.append({
            "text": _doc.page_content,
        })

    doc = {
        "content": doc.page_content,
        "metadata": doc.metadata,
        "passages": passages
    }
    docs.append(doc)
    
  return docs


### Utils: Pretty Response
This function will print out the response from Elasticsearch in an easier to read format.

In [58]:
def pretty_response(response, show_parent_text=False):
  if len(response['hits']['hits']) == 0:
      print('Your search returned no results.')
  else:
    for hit in response['hits']['hits']:
      id = hit['_id']
      score = hit['_score']
      doc_title = hit['_source']["metadata"]['name']
      parent_text = ""

      if show_parent_text:
          parent_text = hit['_source']["content"]

      passage_text = ""

      for passage in hit['inner_hits']['passages']['hits']['hits']:
          passage_text += passage["fields"]["passages"][0]['text'][0] + "\n\n"

      pretty_output = (f"\nID: {id}\nDoc Title: {doc_title}\nparent text:\n{parent_text}\nPassage Text:\n{passage_text}\nScore: {score}\n")
      print(pretty_output)
      print("---")

## Example 1: Full Document, nested passages
In this example we will split a document into passages, and store the full document as a parent document. We will then store the passages as nested documents, with a link back to the parent document.

Below we are using the parent child splitter to split the full documents into passages. The `parent_child_splitter` fn returns a list of documents, with an array of nested passages. 

We then index these documents into Elasticsearch. This will index the full document and the passages will be stored in a nested field. 

Our index pipeline processor will then run the inference on the passages, and store the embeddings in the index.

In [45]:
from elasticsearch import helpers

chunked_docs = parent_child_splitter(loader.load(), parent_chunk_size=None)

count, errors = helpers.bulk(
  client, 
  chunked_docs,
  index=INDEX_NAME
)

print(f"Indexed {count} documents with {errors} errors")

Indexed 15 documents with [] errors


### Perform a Nested Search
We can now perform a nested search, to find the passages that match our query, which will be returned in `inner_hits`.

In [47]:
response = client.search(
  index=INDEX_NAME, 
  knn={
    "inner_hits": {
      "_source": False,
      "fields": [
        "passages.text"
      ]
    },
    "field": "passages.vector.predicted_value",
    "k": 5,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "model_text": "Whats the work from home policy?"
      }
    }
  }
)

pretty_response(response)


ID: AvgyPowBeCQuLJUsS_Tv
Doc Title: Work From Home Policy
Passage Text:
Effective: March 2020
Purpose

The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
Scope


Score: 0.84295774

---

ID: CfgyPowBeCQuLJUsS_Tv
Doc Title: Intellectual Property Policy
Passage Text:
Scope
This policy applies to all employees, including full-time, part-time, temporary, and contract employees.


Score: 0.7304177

---

ID: BvgyPowBeCQuLJUsS_Tv
Doc Title: Company Vacation Policy
Passage Text:
Purpose

The purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. This policy aims to promote a healthy work-life balance and encourage employees to take time to rest and recharge.
Scope


Score: 0.71928245

---

ID: BPgyPowBeCQuLJU

### With Langchain
We can also peform this search within Langchain with an adjustment to the query.

We also override the `doc_builder` to populate the `site_content` with the passages rather than the full document.

In [62]:
from langchain.vectorstores.elasticsearch import ElasticsearchStore, ApproxRetrievalStrategy
from typing import List, Union
from langchain_core.documents import Document

class CustomRetrievalStrategy(ApproxRetrievalStrategy):

    def query(
      self,
      query: Union[str, None],
      filter: List[dict],
      **kwargs,
    ):
                
      es_query = {
        "knn": {
          "inner_hits": {
              "_source": False,
              "fields": [
                  "passages.text"
              ]
          },
          "field": "passages.vector.predicted_value",
          "filter": filter,
          "k": 5,
          "num_candidates": 100,
          "query_vector_builder": {
            "text_embedding": {
              "model_id": "sentence-transformers__all-minilm-l6-v2",
              "model_text": query
            }
          }
        }
      }

      return es_query
    

vector_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    es_connection=client,
    query_field="content",
    strategy=CustomRetrievalStrategy(),
)

def doc_builder(hit):
  passage_hits = hit.get("inner_hits", {}).get("passages", {}).get("hits", {}).get("hits", [])
  page_content = ""
  for passage_hit in passage_hits:
    passage_fields = passage_hit.get("fields", {}).get("passages", [])[0]
    page_content += passage_fields.get("text", [])[0] + "\n\n"

    return Document(
      page_content=page_content,
      metadata=hit["_source"]["metadata"],
    )

vector_store.similarity_search(query="Whats the work from home policy?", doc_builder=doc_builder)

[Document(page_content='Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\n', metadata={'summary': 'This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns', 'updated_at': '2020-03-01', 'name': 'Work From Home Policy', 'source': '/Users/joe/projects/elastic/elasticsearch-labs/notebooks/langchain/temp.json', 'category': 'teams', 'seq_num': 1, 'url': './sharepoint/Work from home policy.txt'}),
 Document(page_content='Scope\nThis policy applies to all em

# Example 2: Parent Child Retriever
In the above example, we are storing the full document in the parent document. You can also still chunk the document into chunks that are large enough to retain context, but split the chunk into many small passages and store them in the parent chunk. This allows you to retrieve the parent chunk, but the passage embeddings can be very precise which link back to the parent chunk.

Below we are using the same parent_child_splitter, but we are specifying the `parent_chunk_size` to be 2000 characters. This means that the parent chunk will be 2000 characters long, and the passages will be 200 characters long.

You can see from the response we have now stored 32 documents in our index, representing the 15 documents from our dataset.

In [64]:
# delete documents in the index
client.delete_by_query(index=INDEX_NAME, query={"match_all": {}})

chunked_docs = parent_child_splitter(loader.load(), parent_chunk_size=2000, child_chunk_size=200)

count, errors = helpers.bulk(
  client, 
  chunked_docs,
  index=INDEX_NAME
)

print(f"Indexed {count} documents with {errors} errors")

Indexed 32 documents with [] errors


### Retrieving the parent Chunks
We can perform a normal nested dense vector query to retrieve the parent chunks. We can see that the parent chunks are returned, but the passages are not.

In [67]:
response = client.search(
  index=INDEX_NAME,
  source_includes=["content", "metadata"],
  knn={
    "inner_hits": {
      "_source": False,
      "fields": [
        "passages.text"
      ]
    },
    "field": "passages.vector.predicted_value",
    "k": 5,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "model_text": "Whats the work from home policy?"
      }
    }
  }
)

pretty_response(response, show_parent_text=True)


ID: FfltP4wBeCQuLJUsARJC
Doc Title: Work From Home Policy
parent text:
Effective: March 2020
Purpose

The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
Scope

This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.
Eligibility

Employees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.
Equipment and Resources

The necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to s

### With Langchain
You can also use Langchain to retrieve the passages from the parent chunks. In combination with the nested query search configured in the ElasticsearchStore strategy, we retrieve the parent chunks that are relevant to one or more chunked passages.  

In [68]:
vector_store.similarity_search(query="Whats the work from home policy?")

[Document(page_content="Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees

In [22]:
client.indices.delete(index=INDEX_NAME)

ObjectApiResponse({'acknowledged': True})