diff --git a/README.md b/README.md index 8671f74a..ba49ad63 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,9 @@ The [`notebooks`](notebooks/README.md) folder contains a range of executable Pyt - [`langchain-using-own-model.ipynb`](./notebooks/langchain/langchain-using-own-model.ipynb) ### Document Chunking + - [`Document Chunking with Ingest Pipelines`](./notebooks/document-chunking/with-index-pipelines.ipynb) +- ['Document Chunking with LangChain Splitters'](./notebooks/document-chunking/with-langchain-splitters.ipynb) ### Search diff --git a/notebooks/document-chunking/with-index-pipelines.ipynb b/notebooks/document-chunking/with-index-pipelines.ipynb index cd5e92fa..c4fced89 100644 --- a/notebooks/document-chunking/with-index-pipelines.ipynb +++ b/notebooks/document-chunking/with-index-pipelines.ipynb @@ -309,8 +309,6 @@ " index=INDEX_NAME, \n", " settings={\n", " \"index\": {\n", - " \"number_of_replicas\": \"1\",\n", - " \"number_of_shards\": \"1\",\n", " \"default_pipeline\": \"chunk_text_to_passages\"\n", " }\n", " },\n", diff --git a/notebooks/document-chunking/with-langchain-splitters.ipynb b/notebooks/document-chunking/with-langchain-splitters.ipynb new file mode 100644 index 00000000..cb28df95 --- /dev/null +++ b/notebooks/document-chunking/with-langchain-splitters.ipynb @@ -0,0 +1,641 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Document Chunking With LangChain Document Splitters\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/document-chunking/with-langchain-splitters.ipynb)\n", + "\n", + "**Using Elasticsearch Nested Dense Vector Support**\n", + "\n", + "This interactive notebook will:\n", + "- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n", + "- Use LangChain splitters to chunk the passages into sentences and index them into Elasticsearch with nested dense vector\n", + "- perform a search and return docs with the most relevant passages" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dependencies\n", + "In this notebook, we're going to use Langchain and the Elasticsearch python client.\n", + "\n", + "We will also require a running Elasticsearch instance with an ML node and model deployed to it." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "langserve 0.0.21 requires pydantic<2,>=1, but you have pydantic 2.3.0 which is incompatible.\n", + "poetry 1.6.1 requires build<0.11.0,>=0.10.0, but you have build 1.0.3 which is incompatible.\n", + "poetry 1.6.1 requires jsonschema<4.18.0,>=4.10.0, but you have jsonschema 4.19.1 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!python3 -m pip install -qU langchain elasticsearch " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Elasticsearch\n", + "\n", + "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n", + "\n", + "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from getpass import getpass\n", + "\n", + "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", + "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", + "\n", + "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", + "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch\n", + "\n", + "client = Elasticsearch(cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Download our example Dataset\n", + "We are going to use Langchain's tooling to ingest and split raw documents into smaller chunks. We are using our example workplace search dataset.\n", + "\n", + "LangChain has a number of other loaders to ingest data from other sources. See their [core loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/) or [loaders integration](https://python.langchain.com/docs/integrations/document_loaders) for more information. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from urllib.request import urlopen\n", + "import json\n", + "\n", + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json\"\n", + "\n", + "response = urlopen(url)\n", + "data = json.load(response)\n", + "\n", + "with open('temp.json', 'w') as json_file:\n", + " json.dump(data, json_file)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import JSONLoader \n", + "\n", + "def metadata_func(record: dict, metadata: dict) -> dict:\n", + " metadata[\"name\"] = record.get(\"name\")\n", + " metadata[\"summary\"] = record.get(\"summary\")\n", + " metadata[\"url\"] = record.get(\"url\")\n", + " metadata[\"category\"] = record.get(\"category\")\n", + " metadata[\"updated_at\"] = record.get(\"updated_at\")\n", + "\n", + " return metadata\n", + "\n", + "# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/\n", + "# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders\n", + "loader = JSONLoader(\n", + " file_path=\"temp.json\",\n", + " jq_schema=\".[]\",\n", + " content_key=\"content\",\n", + " metadata_func=metadata_func,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load Model from hugging face\n", + "The first thing you will need is a model to create the text embeddings out of the chunks, you can use whatever you would like, but this example will run end to end on the minilm-l6-v2 model. With an Elastic Cloud cluster created or another Elasticsearch cluster ready, we can upload the text embedding model using the eland library." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_ID = \"sentence-transformers__all-minilm-l6-v2\"\n", + "\n", + "!eland_import_hub_model \\\n", + " --cloud-id $ELASTIC_CLOUD_ID \\\n", + " --es-username elastic \\\n", + " --es-api-key $ELASTIC_API_KEY \\\n", + " --hub-model-id \"sentence-transformers/all-MiniLM-L6-v2\" \\\n", + " --task-type text_embedding \\\n", + " --clear-previous \\\n", + " --start" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting up our Elasticsearch Index\n", + "In this example we're going to use a pipeline to do the inference and store the embeddings in our index. \n", + "\n", + "In this example, we are using the sentence transformers minilm-l6-v2 model, which you will need to is running on the ML node. With this model, we are setting up an index_pipeline to do the inference and store the embeddings in our index." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'nb_parent_retriever_index'})" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "PIPELINE_ID = \"chunk_text_to_passages\"\n", + "MODEL_DIMS = 384\n", + "INDEX_NAME = \"nb_parent_retriever_index\"\n", + "\n", + "# Create the pipeline\n", + "client.ingest.put_pipeline(\n", + " id=PIPELINE_ID, \n", + " processors=[\n", + " {\n", + " \"foreach\": {\n", + " \"field\": \"passages\",\n", + " \"processor\": {\n", + " \"inference\": {\n", + " \"field_map\": {\n", + " \"_ingest._value.text\": \"text_field\"\n", + " },\n", + " \"model_id\": MODEL_ID,\n", + " \"target_field\": \"_ingest._value.vector\",\n", + " \"on_failure\": [\n", + " {\n", + " \"append\": {\n", + " \"field\": \"_source._ingest.inference_errors\",\n", + " \"value\": [\n", + " {\n", + " \"message\": \"Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'\",\n", + " \"pipeline\": \"ml-inference-title-vector\",\n", + " \"timestamp\": \"{{{ _ingest.timestamp }}}\"\n", + " }\n", + " ]\n", + " }\n", + " }\n", + " ]\n", + " }\n", + " }\n", + " }\n", + " }\n", + " ]\n", + ")\n", + "\n", + "# Create the index\n", + "client.indices.create( \n", + " index=INDEX_NAME, \n", + " settings={\n", + " \"index\": {\n", + " \"default_pipeline\": PIPELINE_ID\n", + " }\n", + " },\n", + " mappings={\n", + " \"dynamic\": \"true\",\n", + " \"properties\": {\n", + " \"passages\": {\n", + " \"type\": \"nested\",\n", + " \"properties\": {\n", + " \"vector\": {\n", + " \"properties\": {\n", + " \"predicted_value\": {\n", + " \"type\": \"dense_vector\",\n", + " \"index\": True,\n", + " \"dims\": MODEL_DIMS,\n", + " \"similarity\": \"dot_product\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Utils: Parent Child Splitter Function\n", + "This function will split a document into multiple passages, and return the parent document with the child passages. \n", + "\n", + "It also has an option to chunk the parent document into smaller documents, meaning the parent document will be split into multiple index documents. We will use this in example 2." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "def parent_child_splitter(documents, chunk_size: int = 200):\n", + "\n", + " child_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)\n", + "\n", + " docs = []\n", + " for i, doc in enumerate(documents):\n", + " passages = []\n", + "\n", + " for _doc in child_splitter.split_documents([doc]):\n", + " passages.append({\n", + " \"text\": _doc.page_content,\n", + " })\n", + "\n", + " doc = {\n", + " \"content\": doc.page_content,\n", + " \"metadata\": doc.metadata,\n", + " \"passages\": passages\n", + " }\n", + " docs.append(doc)\n", + " \n", + " return docs\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Utils: Pretty Response\n", + "This function will print out the response from Elasticsearch in an easier to read format." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "def pretty_response(response, show_parent_text=False):\n", + " if len(response['hits']['hits']) == 0:\n", + " print('Your search returned no results.')\n", + " else:\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " score = hit['_score']\n", + " doc_title = hit['_source'][\"metadata\"]['name']\n", + " parent_text = \"\"\n", + "\n", + " if show_parent_text:\n", + " parent_text = hit['_source'][\"content\"]\n", + "\n", + " passage_text = \"\"\n", + "\n", + " for passage in hit['inner_hits']['passages']['hits']['hits']:\n", + " passage_text += passage[\"fields\"][\"passages\"][0]['text'][0] + \"\\n\\n\"\n", + "\n", + " pretty_output = (f\"\\nID: {id}\\nDoc Title: {doc_title}\\nparent text:\\n{parent_text}\\nPassage Text:\\n{passage_text}\\nScore: {score}\\n\")\n", + " print(pretty_output)\n", + " print(\"---\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Full Document, nested passages\n", + "In this example we will split a document into passages, and store the full document as a parent document. We will then store the passages as nested documents, with a link back to the parent document.\n", + "\n", + "Below we are using the parent child splitter to split the full documents into passages. The `parent_child_splitter` fn returns a list of documents, with an array of nested passages. \n", + "\n", + "We then index these documents into Elasticsearch. This will index the full document and the passages will be stored in a nested field. \n", + "\n", + "Our index pipeline processor will then run the inference on the passages, and store the embeddings in the index." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Indexed 15 documents with [] errors\n" + ] + } + ], + "source": [ + "from elasticsearch import helpers\n", + "\n", + "chunked_docs = parent_child_splitter(loader.load(), chunk_size=600)\n", + "\n", + "count, errors = helpers.bulk(\n", + " client, \n", + " chunked_docs,\n", + " index=INDEX_NAME\n", + ")\n", + "\n", + "print(f\"Indexed {count} documents with {errors} errors\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Perform a Nested Search\n", + "We can now perform a nested search, to find the passages that match our query, which will be returned in `inner_hits`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "ID: HP6WfIwBeCQuLJUs19ov\n", + "Doc Title: Work From Home Policy\n", + "parent text:\n", + "\n", + "Passage Text:\n", + "Effective: March 2020\n", + "Purpose\n", + "\n", + "The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\n", + "Scope\n", + "\n", + "This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\n", + "Eligibility\n", + "\n", + "\n", + "Score: 0.8483097\n", + "\n", + "---\n", + "\n", + "ID: I_6WfIwBeCQuLJUs19ov\n", + "Doc Title: Intellectual Property Policy\n", + "parent text:\n", + "\n", + "Passage Text:\n", + "Purpose\n", + "The purpose of this Intellectual Property Policy is to establish guidelines and procedures for the ownership, protection, and utilization of intellectual property generated by employees during their employment. This policy aims to encourage creativity and innovation while ensuring that the interests of both the company and its employees are protected.\n", + "\n", + "Scope\n", + "This policy applies to all employees, including full-time, part-time, temporary, and contract employees.\n", + "\n", + "\n", + "Score: 0.7292882\n", + "\n", + "---\n", + "\n", + "ID: IP6WfIwBeCQuLJUs19ov\n", + "Doc Title: Company Vacation Policy\n", + "parent text:\n", + "\n", + "Passage Text:\n", + "Purpose\n", + "\n", + "The purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. This policy aims to promote a healthy work-life balance and encourage employees to take time to rest and recharge.\n", + "Scope\n", + "\n", + "This policy applies to all full-time and part-time employees who have completed their probationary period.\n", + "Vacation Accrual\n", + "\n", + "\n", + "Score: 0.7137784\n", + "\n", + "---\n", + "\n", + "ID: Hv6WfIwBeCQuLJUs19ov\n", + "Doc Title: Wfh Policy Update May 2023\n", + "parent text:\n", + "\n", + "Passage Text:\n", + "As we continue to prioritize the well-being of our employees, we are making a slight adjustment to our hybrid work policy. Starting May 1, 2023, employees will be required to work from the office three days a week, with two days designated for remote work. Please communicate with your supervisor and HR department to establish your updated in-office workdays.\n", + "\n", + "\n", + "Score: 0.70840263\n", + "\n", + "---\n", + "\n", + "ID: Kv6WfIwBeCQuLJUs19ov\n", + "Doc Title: New Employee Onboarding Guide\n", + "parent text:\n", + "\n", + "Passage Text:\n", + "Review benefits options: Carefully review the benefits package and choose the options that best meet your needs.\n", + "Complete enrollment forms: Fill out the necessary forms to enroll in your chosen benefits. Submit these forms to the HR department within 30 days of your start date.\n", + "Designate beneficiaries: If applicable, designate beneficiaries for your life insurance and retirement plans.\n", + "Getting Settled in Your Workspace\n", + "To help you feel comfortable and productive in your new workspace, take the following steps:\n", + "\n", + "\n", + "Score: 0.6890813\n", + "\n", + "---\n" + ] + } + ], + "source": [ + "response = client.search(\n", + " index=INDEX_NAME, \n", + " knn={\n", + " \"inner_hits\": {\n", + " \"_source\": False,\n", + " \"fields\": [\n", + " \"passages.text\"\n", + " ]\n", + " },\n", + " \"field\": \"passages.vector.predicted_value\",\n", + " \"k\": 5,\n", + " \"num_candidates\": 100,\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n", + " \"model_text\": \"Whats the work from home policy?\"\n", + " }\n", + " }\n", + " }\n", + ")\n", + "\n", + "pretty_response(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### With Langchain\n", + "We can also peform this search within Langchain with an adjustment to the query.\n", + "\n", + "We also override the `doc_builder` to populate the `site_content` with the passages rather than the full document." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Effective: March 2020\\nPurpose\\n\\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\\nScope\\n\\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\\nEligibility\\n\\n', metadata={'summary': 'This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns', 'updated_at': '2020-03-01', 'name': 'Work From Home Policy', 'source': '/Users/joe/projects/elastic/elasticsearch-labs/notebooks/document-chunking/temp.json', 'category': 'teams', 'seq_num': 1, 'url': './sharepoint/Work from home policy.txt'}),\n", + " Document(page_content='Purpose\\nThe purpose of this Intellectual Property Policy is to establish guidelines and procedures for the ownership, protection, and utilization of intellectual property generated by employees during their employment. This policy aims to encourage creativity and innovation while ensuring that the interests of both the company and its employees are protected.\\n\\nScope\\nThis policy applies to all employees, including full-time, part-time, temporary, and contract employees.\\n\\n', metadata={'summary': \"This Intellectual Property Policy outlines guidelines and procedures for the ownership, protection, and utilization of intellectual property generated by employees during their employment. It establishes the company's ownership of work generated on company time, while recognizing employee ownership of work generated outside of company time without the use of company resources. The policy\", 'updated_at': None, 'name': 'Intellectual Property Policy', 'source': '/Users/joe/projects/elastic/elasticsearch-labs/notebooks/document-chunking/temp.json', 'category': 'sharepoint', 'seq_num': 8, 'url': 'https://enterprisesearch.sharepoint.com/:t:/s/MSBuilddemo/EWz3cYEVdzBNsiHsYbKhms4BVYGhravyrUw3T3lzxL4pTg?e=mPIgbO'}),\n", + " Document(page_content='Purpose\\n\\nThe purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. This policy aims to promote a healthy work-life balance and encourage employees to take time to rest and recharge.\\nScope\\n\\nThis policy applies to all full-time and part-time employees who have completed their probationary period.\\nVacation Accrual\\n\\n', metadata={'summary': ': This policy outlines the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. Full-time employees accrue vacation time at a rate of [X hours] per month, equivalent to [Y days] per year. Vacation requests must be submitted to supervisors at least', 'updated_at': '2018-04-16', 'name': 'Company Vacation Policy', 'source': '/Users/joe/projects/elastic/elasticsearch-labs/notebooks/document-chunking/temp.json', 'category': 'sharepoint', 'seq_num': 5, 'url': 'https://enterprisesearch.sharepoint.com/:t:/s/MSBuilddemo/ES6rw9bKZxVBobG1WUoJpikBF9Bhx1pw_GvJWbsg-Z_HNA?e=faSHVt'}),\n", + " Document(page_content='As we continue to prioritize the well-being of our employees, we are making a slight adjustment to our hybrid work policy. Starting May 1, 2023, employees will be required to work from the office three days a week, with two days designated for remote work. Please communicate with your supervisor and HR department to establish your updated in-office workdays.\\n\\n', metadata={'summary': 'Starting May 1, 2023, our hybrid work policy will require employees to work from the office three days a week and two days remotely.', 'updated_at': '2023-05-01', 'name': 'Wfh Policy Update May 2023', 'source': '/Users/joe/projects/elastic/elasticsearch-labs/notebooks/document-chunking/temp.json', 'category': 'teams', 'seq_num': 3, 'url': './sharepoint/WFH policy update May 2023.txt'})]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain.vectorstores.elasticsearch import ElasticsearchStore, ApproxRetrievalStrategy\n", + "from typing import List, Union\n", + "from langchain_core.documents import Document\n", + "\n", + "class CustomRetrievalStrategy(ApproxRetrievalStrategy):\n", + "\n", + " def query(\n", + " self,\n", + " query: Union[str, None],\n", + " filter: List[dict],\n", + " **kwargs,\n", + " ):\n", + " \n", + " es_query = {\n", + " \"knn\": {\n", + " \"inner_hits\": {\n", + " \"_source\": False,\n", + " \"fields\": [\n", + " \"passages.text\"\n", + " ]\n", + " },\n", + " \"field\": \"passages.vector.predicted_value\",\n", + " \"filter\": filter,\n", + " \"k\": 5,\n", + " \"num_candidates\": 100,\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n", + " \"model_text\": query\n", + " }\n", + " }\n", + " }\n", + " }\n", + "\n", + " return es_query\n", + " \n", + "\n", + "vector_store = ElasticsearchStore(\n", + " index_name=INDEX_NAME,\n", + " es_connection=client,\n", + " query_field=\"content\",\n", + " strategy=CustomRetrievalStrategy(),\n", + ")\n", + "\n", + "def doc_builder(hit):\n", + " passage_hits = hit.get(\"inner_hits\", {}).get(\"passages\", {}).get(\"hits\", {}).get(\"hits\", [])\n", + " page_content = \"\"\n", + " for passage_hit in passage_hits:\n", + " passage_fields = passage_hit.get(\"fields\", {}).get(\"passages\", [])[0]\n", + " page_content += passage_fields.get(\"text\", [])[0] + \"\\n\\n\"\n", + "\n", + " return Document(\n", + " page_content=page_content,\n", + " metadata=hit[\"_source\"][\"metadata\"],\n", + " )\n", + "\n", + "vector_store.similarity_search(query=\"Whats the work from home policy?\", doc_builder=doc_builder)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True})" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.indices.delete(index=INDEX_NAME)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb b/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb index 5ca8cf9b..04acd6ec 100644 --- a/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb +++ b/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb @@ -303,33 +303,6 @@ "from langchain.prompts import ChatPromptTemplate, PromptTemplate\n", "from langchain.schema import format_document\n", "\n", - "def custom_query(query_body, query):\n", - " filters = query_body.get(\"knn\", {}).get(\"filter\", [])\n", - " \n", - " print(f\"filters: {filters}\")\n", - " print(f\"query: {query}\")\n", - "\n", - " if query.strip() != \"\":\n", - " query_clause = [{\n", - " \"multi_match\": {\n", - " \"query\": query,\n", - " \"fields\": [\"text\"],\n", - " \"fuzziness\": \"AUTO\",\n", - " }\n", - " }]\n", - " else:\n", - " query_clause = []\n", - "\n", - "\n", - " return {\n", - " \"query\": {\n", - " \"bool\": {\n", - " \"filter\": filters,\n", - " \"must\": query_clause\n", - " }\n", - " },\n", - " }\n", - "\n", "retriever = SelfQueryRetriever.from_llm(\n", " llm, \n", " vectorstore, \n",