diff --git a/colab-notebooks-examples/generative-ai/chatbot.ipynb b/colab-notebooks-examples/generative-ai/chatbot.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/generative-ai/question-answering.ipynb b/colab-notebooks-examples/generative-ai/question-answering.ipynb new file mode 100644 index 00000000..6a8313d3 --- /dev/null +++ b/colab-notebooks-examples/generative-ai/question-answering.ipynb @@ -0,0 +1,340 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "tZnIXBfrRpex" + }, + "source": [ + "# Question Answering with Langchain and OpenAI\n", + "\n", + "\"Open\n", + "\n", + "This interactive notebook uses Langchain to split fictional workplace documents into passages and uses OpenAI to transform these passages into embeddings and store them into Elasticsearch.\n", + "\n", + "\n", + "![image.png]()\n", + "\n", + "Then when we ask a question, we retrieve the relevant passages from the vector store and use langchain and OpenAI to provide a summary for the question." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "GyAst2W-VpHb" + }, + "source": [ + "## Install packages and import modules\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to install the `elasticsearch` Python client." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "33A-cP-XvFCr" + }, + "outputs": [], + "source": [ + "!pip install -qU langchain jq openai elasticsearch tiktoken" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "J8-93TiJsNyK" + }, + "outputs": [], + "source": [ + "import json\n", + "from urllib.request import urlopen\n", + "\n", + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/esre-examples/python-flask-example/example-data/data.json\"\n", + "\n", + "response = urlopen(url)\n", + "\n", + "workplace_docs = json.loads(response.read())\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "qtEOCsCLWCZp" + }, + "source": [ + "## Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + "- Select **Create deployment**\n", + "\n", + "Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "a-t1mglib54F" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch\n", + "\n", + "client = Elasticsearch(\n", + " cloud_id=\"CLOUD_ID\",\n", + " basic_auth=(\"elastic\", \"PASSWORD\")\n", + ")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "p0cQFDl1b9v4" + }, + "source": [ + "### Split Documents into Passages\n", + "\n", + "We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.\n", + "\n", + "Here we are chunking documents into 800 token passages with an overlap of 400 tokens.\n", + "\n", + "Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chace of context being lost." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dbHEoTF6vBXE" + }, + "outputs": [], + "source": [ + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.embeddings import OpenAIEmbeddings\n", + "\n", + "metadata = []\n", + "content = []\n", + "\n", + "for doc in workplace_docs:\n", + " content.append(doc[\"content\"])\n", + " metadata.append({\n", + " \"name\": doc[\"name\"],\n", + " \"summary\": doc[\"summary\"]\n", + " })\n", + "\n", + "text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=400)\n", + "docs = text_splitter.create_documents(content, metadatas=metadata)\n", + "\n", + "embeddings = OpenAIEmbeddings(openai_api_key=\"OPENAI_KEY\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "kRPxrJnXWfMD" + }, + "source": [ + "### Setup the Index\n", + "Next define the mapping for the passages. Langchain relies on two fields: text and vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BfhBEjLsOr0l" + }, + "outputs": [], + "source": [ + "# Define the mapping\n", + "mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"text\": { \"type\": \"keyword\" },\n", + " \"vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 1536,\n", + " \"index\": True,\n", + " \"similarity\": \"cosine\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "# Create the index\n", + "client.indices.create(index='workplace_index', body=mapping)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "RmCUl0hxW4lG" + }, + "source": [ + "### Enrich Passages with OpenAI Model\n", + "\n", + "Next we are going to enrich each passage with an embedding from OpenAI." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qc1LXk-rOzNR" + }, + "outputs": [], + "source": [ + "# get the embeddings from openAI\n", + "\n", + "texts = []\n", + "for passage in docs:\n", + " texts.append(passage.page_content)\n", + "\n", + "textEmbeddings = embeddings.embed_documents(texts)\n", + "\n", + "# persist the passage documents into elasticsearch\n", + "\n", + "actions = []\n", + "for i, passage in enumerate(docs):\n", + " actions.append({\"index\": {\"_index\": \"workplace_index\"}})\n", + " actions.append({\n", + " \"text\": passage.page_content,\n", + " \"vector\": textEmbeddings[i],\n", + " \"metadata\": passage.metadata\n", + " })\n", + "\n", + "client.bulk(operations=actions)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "rXJH_MiWejv7" + }, + "source": [ + "## Asking a question\n", + "Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "OobeBT6rek7Q", + "outputId": "ba7b3a7a-253e-4e7f-83b9-cec07ebdac09" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "---- answer ----\n", + " The NASA Sales Team is a regional sales team consisting of dedicated account managers, sales representatives, and support staff, led by their respective Area Vice-Presidents (Laura Martinez and Gary Johnson). They are responsible for identifying and pursuing new business opportunities, nurturing existing client relationships, and ensuring customer satisfaction.\n", + "---- sources ----\n", + "Sales Organization Overview\n", + "Our sales organization is structured to effectively serve our customers and achieve our business objectives across multiple regions. The organization is divided into the following main regions:\n", + "\n", + "The Americas: This region includes the United States, Canada, Mexico, as well as Central and South America. The North America South America region (NASA) has two Area Vice-Presidents: Laura Martinez is the Area Vice-President of North America, and Gary Johnson is the Area Vice-President of South America.\n", + "Sales Organization Overview\n", + "Each regional sales team consists of dedicated account managers, sales representatives, and support staff, led by their respective Area Vice-Presidents. They are responsible for identifying and pursuing new business opportunities, nurturing existing client relationships, and ensuring customer satisfaction. The teams collaborate closely with other departments, such as marketing, product development, and customer support, to ensure we consistently deliver high-quality products and services to our clients.\n", + "Sales Engineering Collaboration\n", + "Title: Working with the Sales Team as an Engineer in a Tech Company\n", + "Sales Engineering Collaboration\n", + "Introduction:\n", + "As an engineer in a tech company, collaboration with the sales team is essential to ensure the success of the company's products and services. This guidance document aims to provide an overview of how engineers can effectively work with the sales team, fostering a positive and productive working environment.\n", + "Understanding the Sales Team's Role:\n", + "The sales team is responsible for promoting and selling the company's products and services to potential clients. Their role involves establishing relationships with customers, understanding their needs, and ensuring that the offered solutions align with their requirements.\n" + ] + } + ], + "source": [ + "from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch\n", + "from langchain.llms import OpenAI\n", + "from langchain.chains import RetrievalQA\n", + "\n", + "db = ElasticKnnSearch(\n", + " es_connection=client, index_name=\"workplace_index\", embedding=embeddings\n", + ")\n", + "\n", + "retriever = db.as_retriever()\n", + "\n", + "llm = OpenAI(openai_api_key=\"OPENAI_KEY\")\n", + "\n", + "qa = RetrievalQA.from_chain_type(\n", + " llm=llm,\n", + " chain_type=\"stuff\",\n", + " retriever=retriever,\n", + " return_source_documents=True\n", + ")\n", + "\n", + "ans = qa({\"query\": \"what is the nasa sales team?\"})\n", + "\n", + "print(\"---- answer ----\")\n", + "print(ans[\"result\"])\n", + "print(\"---- sources ----\")\n", + "for doc in ans[\"source_documents\"]:\n", + " print(doc.metadata[\"name\"])\n", + " print(doc.page_content)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SOeP4-DLnwDB", + "outputId": "acff2056-eec6-46e5-cddc-9af9050c046e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True})" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.indices.delete(index=\"workplace_index\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/colab-notebooks-examples/index.md b/colab-notebooks-examples/index.md new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/integrations/hugging-face/hugging-face-endpoints.ipynb b/colab-notebooks-examples/integrations/hugging-face/hugging-face-endpoints.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/integrations/hugging-face/loading-model-from-hugging-face.ipynb b/colab-notebooks-examples/integrations/hugging-face/loading-model-from-hugging-face.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/integrations/llama-index/intro.ipynb b/colab-notebooks-examples/integrations/llama-index/intro.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/integrations/openai/intro.ipynb b/colab-notebooks-examples/integrations/openai/intro.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/langchain/langchain-elasticsearch-embeddings.ipynb b/colab-notebooks-examples/langchain/langchain-elasticsearch-embeddings.ipynb new file mode 100644 index 00000000..264fe34f --- /dev/null +++ b/colab-notebooks-examples/langchain/langchain-elasticsearch-embeddings.ipynb @@ -0,0 +1,18 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/colab-notebooks-examples/langchain/langchain-vector-store.ipynb b/colab-notebooks-examples/langchain/langchain-vector-store.ipynb new file mode 100644 index 00000000..e69de29b diff --git a/colab-notebooks-examples/search/00-quick-start.ipynb b/colab-notebooks-examples/search/00-quick-start.ipynb new file mode 100644 index 00000000..daccb2fe --- /dev/null +++ b/colab-notebooks-examples/search/00-quick-start.ipynb @@ -0,0 +1,783 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "87773ce7", + "metadata": { + "id": "87773ce7" + }, + "source": [ + "# Elasticsearch Quick Start\n", + "\n", + "\"Open\n", + "\n", + "This interactive notebook will introduce you to the very basics of getting started with simple Elasticsearch queries, using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).\n", + "We'll run through getting the client up and running, indexing a small data set into Elasticsearch, and performing basic searches against your data." + ] + }, + { + "cell_type": "markdown", + "id": "a32202e2", + "metadata": { + "id": "a32202e2" + }, + "source": [ + "## Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Select **Create deployment**" + ] + }, + { + "cell_type": "markdown", + "id": "52a6a607", + "metadata": { + "id": "52a6a607" + }, + "source": [ + "## Install packages and import modules\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to install the `elasticsearch` Python client." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ffc5fa6f", + "metadata": { + "id": "ffc5fa6f", + "scrolled": false + }, + "outputs": [], + "source": [ + "!pip install -qU elasticsearch sentence-transformers==2.2.2" + ] + }, + { + "cell_type": "markdown", + "id": "28AH8LhI-0UD", + "metadata": { + "id": "28AH8LhI-0UD" + }, + "source": [ + "# Setup the Embedding Model\n", + "For this example, we're using the all-MiniLM-L6-v2, part of the sentence_transformers library. You can read more about this model in [hugging face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "WHC3hHGW-wbI", + "metadata": { + "id": "WHC3hHGW-wbI" + }, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "import torch\n", + "\n", + "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", + "\n", + "model = SentenceTransformer('all-MiniLM-L6-v2', device=device)\n", + "model" + ] + }, + { + "cell_type": "markdown", + "id": "0241694c", + "metadata": { + "id": "0241694c" + }, + "source": [ + "## Initialize the Elasticsearch client\n", + "\n", + "Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "f38e0397", + "metadata": { + "id": "f38e0397", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "ad6df489-d242-4229-a42a-39c5ca19d124" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Elastic Cloud ID··········\n", + "Elastic Password··········\n" + ] + } + ], + "source": [ + "from elasticsearch import Elasticsearch\n", + "from getpass import getpass\n", + "\n", + "CLOUD_ID = getpass(\"Elastic Cloud ID\")\n", + "CLOUD_PASSWORD = getpass(\"Elastic Password\")\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", CLOUD_PASSWORD)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fcd165fa", + "metadata": { + "id": "fcd165fa" + }, + "source": [ + "If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. [Read more](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) on how to connect to Elasticsearch locally" + ] + }, + { + "cell_type": "markdown", + "id": "1462ebd8", + "metadata": { + "id": "1462ebd8" + }, + "source": [ + "Confirm that the client has connected with this test." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "25c618eb", + "metadata": { + "id": "25c618eb", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "30a6ba5b-5109-4457-ddfe-5633a077ca9b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'name': 'instance-0000000000', 'cluster_name': '1a56ad21587c44d3930932eb9fa1d8e8', 'cluster_uuid': 'gX4zlwtlR4qhZpp1SPm4Yg', 'version': {'number': '8.8.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '98e1271edf932a480e4262a471281f1ee295ce6b', 'build_date': '2023-06-26T05:16:16.196344851Z', 'build_snapshot': False, 'lucene_version': '9.6.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}\n" + ] + } + ], + "source": [ + "print(client.info())" + ] + }, + { + "cell_type": "markdown", + "id": "61e1e6d8", + "metadata": { + "id": "61e1e6d8" + }, + "source": [ + "## Index some test data\n", + "\n", + "Our client is set up and connected to our Elastic deployment.\n", + "Now we need some data to test out the basics of Elasticsearch queries.\n", + "We'll use a small index of books with the following fields:\n", + "\n", + "- `title`\n", + "- `authors`\n", + "- `publish_date`\n", + "- `num_reviews`\n", + "- `publisher`\n", + "\n", + "### Create index\n", + "\n", + "Let's create an Elasticsearch index with the correct mappings for our test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bc95238", + "metadata": { + "id": "6bc95238" + }, + "outputs": [], + "source": [ + "# Define the mapping\n", + "mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"title\": {\"type\": \"text\"},\n", + " \"authors\": {\"type\": \"keyword\"},\n", + " \"summary\": {\"type\": \"text\"},\n", + " \"publish_date\": {\"type\": \"date\"},\n", + " \"num_reviews\": {\"type\": \"integer\"},\n", + " \"publisher\": {\"type\": \"keyword\"},\n", + " \"title_vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 384,\n", + " \"index\": \"true\",\n", + " \"similarity\": \"cosine\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "# Create the index\n", + "client.indices.create(index='book_index', body=mapping)\n" + ] + }, + { + "cell_type": "markdown", + "id": "075f5eb6", + "metadata": { + "id": "075f5eb6" + }, + "source": [ + "### Index test data\n", + "\n", + "Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/joemcelroy/elasticsearch-labs/notebooks-guides/colab-notebooks-examples/search/data.json)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "008d723e", + "metadata": { + "id": "008d723e" + }, + "outputs": [], + "source": [ + "import json\n", + "from urllib.request import urlopen\n", + "\n", + "url = \"https://raw.githubusercontent.com/joemcelroy/elasticsearch-labs/notebooks-guides/colab-notebooks-examples/search/data.json\"\n", + "response = urlopen(url)\n", + "books = json.loads(response.read())\n", + "\n", + "actions = []\n", + "for book in books:\n", + " actions.append({\"index\": {\"_index\": \"book_index\"}})\n", + " # Transforming the title into an embedding using the model\n", + " book[\"title_vector\"] = model.encode(book[\"title\"]).tolist()\n", + " actions.append(book)\n", + "client.bulk(index=\"book_index\", operations=actions)\n" + ] + }, + { + "cell_type": "markdown", + "id": "cd8b03e0", + "metadata": { + "id": "cd8b03e0" + }, + "source": [ + "## Aside: Pretty printing Elasticsearch responses\n", + "\n", + "Your API calls will return hard-to-read nested JSON.\n", + "We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "f12ce2c9", + "metadata": { + "id": "f12ce2c9" + }, + "outputs": [], + "source": [ + "def pretty_response(response):\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " publication_date = hit['_source']['publish_date']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " summary = hit['_source']['summary']\n", + " publisher = hit[\"_source\"][\"publisher\"]\n", + " num_reviews = hit[\"_source\"][\"num_reviews\"]\n", + " authors = hit[\"_source\"][\"authors\"]\n", + " pretty_output = (f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nPublisher: {publisher}\\nReviews: {num_reviews}\\nAuthors: {authors}\\nScore: {score}\")\n", + " print(pretty_output)" + ] + }, + { + "cell_type": "markdown", + "id": "39bdefe0", + "metadata": { + "id": "39bdefe0" + }, + "source": [ + "##Making Queries\n", + "\n", + "Now that we have indexed the books, we want to perform a semantic search for books that similarly match the query. We embed the query and perform a search." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "Df7hwcIjYwMT", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Df7hwcIjYwMT", + "outputId": "e63884d7-d4a5-4f5d-ea43-fc2f0793f040" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "ID: OOlWP4kB-GB5Evg6zHVx\n", + "Publication date: 2008-05-15\n", + "Title: JavaScript: The Good Parts\n", + "Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code\n", + "Publisher: oreilly\n", + "Reviews: 51\n", + "Authors: ['douglas crockford']\n", + "Score: 0.8075247\n", + "\n", + "ID: NOlWP4kB-GB5Evg6zHVx\n", + "Publication date: 2015-03-27\n", + "Title: You Don't Know JS: Up & Going\n", + "Summary: Introduction to JavaScript and programming as a whole\n", + "Publisher: oreilly\n", + "Reviews: 36\n", + "Authors: ['kyle simpson']\n", + "Score: 0.6946182\n", + "\n", + "ID: NelWP4kB-GB5Evg6zHVx\n", + "Publication date: 2018-12-04\n", + "Title: Eloquent JavaScript\n", + "Summary: A modern introduction to programming\n", + "Publisher: no starch press\n", + "Reviews: 38\n", + "Authors: ['marijn haverbeke']\n", + "Score: 0.66179085\n", + "\n", + "ID: MOlWP4kB-GB5Evg6zHVx\n", + "Publication date: 2019-10-29\n", + "Title: The Pragmatic Programmer: Your Journey to Mastery\n", + "Summary: A guide to pragmatic programming for software engineers and developers\n", + "Publisher: addison-wesley\n", + "Reviews: 30\n", + "Authors: ['andrew hunt', 'david thomas']\n", + "Score: 0.61159486\n", + "\n", + "ID: OelWP4kB-GB5Evg6zHVx\n", + "Publication date: 2012-06-27\n", + "Title: Introduction to the Theory of Computation\n", + "Summary: Introduction to the theory of computation and complexity theory\n", + "Publisher: cengage learning\n", + "Reviews: 33\n", + "Authors: ['michael sipser']\n", + "Score: 0.58697784\n", + "\n", + "ID: N-lWP4kB-GB5Evg6zHVx\n", + "Publication date: 2011-05-13\n", + "Title: The Clean Coder: A Code of Conduct for Professional Programmers\n", + "Summary: A guide to professional conduct in the field of software engineering\n", + "Publisher: prentice hall\n", + "Reviews: 20\n", + "Authors: ['robert c. martin']\n", + "Score: 0.57042736\n", + "\n", + "ID: NulWP4kB-GB5Evg6zHVx\n", + "Publication date: 1994-10-31\n", + "Title: Design Patterns: Elements of Reusable Object-Oriented Software\n", + "Summary: Guide to design patterns that can be used in any object-oriented language\n", + "Publisher: addison-wesley\n", + "Reviews: 45\n", + "Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']\n", + "Score: 0.56175697\n", + "\n", + "ID: M-lWP4kB-GB5Evg6zHVx\n", + "Publication date: 2008-08-11\n", + "Title: Clean Code: A Handbook of Agile Software Craftsmanship\n", + "Summary: A guide to writing code that is easy to read, understand and maintain\n", + "Publisher: prentice hall\n", + "Reviews: 55\n", + "Authors: ['robert c. martin']\n", + "Score: 0.55407417\n", + "\n", + "ID: MulWP4kB-GB5Evg6zHVx\n", + "Publication date: 2020-04-06\n", + "Title: Artificial Intelligence: A Modern Approach\n", + "Summary: Comprehensive introduction to the theory and practice of artificial intelligence\n", + "Publisher: pearson\n", + "Reviews: 39\n", + "Authors: ['stuart russell', 'peter norvig']\n", + "Score: 0.5461982\n", + "\n", + "ID: MelWP4kB-GB5Evg6zHVx\n", + "Publication date: 2019-05-03\n", + "Title: Python Crash Course\n", + "Summary: A fast-paced, no-nonsense guide to programming in Python\n", + "Publisher: no starch press\n", + "Reviews: 42\n", + "Authors: ['eric matthes']\n", + "Score: 0.536102\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + ":1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " response = client.search(index=\"book_index\", body={\n" + ] + } + ], + "source": [ + "response = client.search(index=\"book_index\", body={\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\": model.encode(\"Best javascript books?\"),\n", + " \"k\": 10,\n", + " \"num_candidates\": 100\n", + " }\n", + "})\n", + "\n", + "pretty_response(response)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Filtering\n", + "\n", + "Filter context is mostly used for filtering structured data, for example:\n", + "\n", + "Does this timestamp fall into the range 2015 to 2016?\n", + "Is the status field set to \"published\"?\n", + "Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_not parameters in the bool query.\n", + "\n", + "[Read more](https://)" + ], + "metadata": { + "id": "LdJCpbQMeml5" + }, + "id": "LdJCpbQMeml5" + }, + { + "cell_type": "markdown", + "source": [ + "### Example: Keyword Filtering\n", + "\n", + "This is an example of adding a keyword filter to the query.\n", + "\n", + "It narrows down the results by including only documents where the \"publisher\" field is equal to \"addison-wesley\".\n", + "\n", + "Overall, the code retrieves the top books that are similar to \"Best javascript books?\" based on their title vectors and have \"addison-wesley\" as the publisher." + ], + "metadata": { + "id": "dRSrPMyFf7w7" + }, + "id": "dRSrPMyFf7w7" + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", body={\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\": model.encode(\"Best javascript books?\"),\n", + " \"k\": 10,\n", + " \"num_candidates\": 100,\n", + " \"filter\": {\n", + " \"term\": {\n", + " \"publisher\": \"addison-wesley\"\n", + " }\n", + " }\n", + " }\n", + "})\n", + "\n", + "pretty_response(response)" + ], + "metadata": { + "id": "WoE0yTchfj3A" + }, + "id": "WoE0yTchfj3A", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Example: Advanced Filtering\n", + "\n", + "TODO: Help the developer understand more about the different types of filtering they can do.\n", + "\n", + "Provide a link to show more advanced use cases of filtering on date-range, geo-location etc." + ], + "metadata": { + "id": "YY2SrWDtgnF3" + }, + "id": "YY2SrWDtgnF3" + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", body={\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\": model.encode(\"Best javascript books?\"),\n", + " \"k\": 10,\n", + " \"num_candidates\": 100,\n", + " \"filter\": {\n", + " \"bool\": {\n", + " \"should\": [\n", + " {\n", + " \"term\": {\n", + " \"publisher\": \"addison-wesley\"\n", + " }\n", + " },\n", + " {\n", + " \"term\": {\n", + " \"authors\": \"robert c. martin\"\n", + " }\n", + " }\n", + " ],\n", + "\n", + " }\n", + " }\n", + " }\n", + "})\n", + "\n", + "pretty_response(response)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fcDfiJC9g6AX", + "outputId": "0909515a-5b94-4863-94c6-e67015baeadf" + }, + "id": "fcDfiJC9g6AX", + "execution_count": 43, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "ID: MOlWP4kB-GB5Evg6zHVx\n", + "Publication date: 2019-10-29\n", + "Title: The Pragmatic Programmer: Your Journey to Mastery\n", + "Summary: A guide to pragmatic programming for software engineers and developers\n", + "Publisher: addison-wesley\n", + "Reviews: 30\n", + "Authors: ['andrew hunt', 'david thomas']\n", + "Score: 0.61159486\n", + "\n", + "ID: N-lWP4kB-GB5Evg6zHVx\n", + "Publication date: 2011-05-13\n", + "Title: The Clean Coder: A Code of Conduct for Professional Programmers\n", + "Summary: A guide to professional conduct in the field of software engineering\n", + "Publisher: prentice hall\n", + "Reviews: 20\n", + "Authors: ['robert c. martin']\n", + "Score: 0.57042736\n", + "\n", + "ID: NulWP4kB-GB5Evg6zHVx\n", + "Publication date: 1994-10-31\n", + "Title: Design Patterns: Elements of Reusable Object-Oriented Software\n", + "Summary: Guide to design patterns that can be used in any object-oriented language\n", + "Publisher: addison-wesley\n", + "Reviews: 45\n", + "Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']\n", + "Score: 0.56175697\n", + "\n", + "ID: M-lWP4kB-GB5Evg6zHVx\n", + "Publication date: 2008-08-11\n", + "Title: Clean Code: A Handbook of Agile Software Craftsmanship\n", + "Summary: A guide to writing code that is easy to read, understand and maintain\n", + "Publisher: prentice hall\n", + "Reviews: 55\n", + "Authors: ['robert c. martin']\n", + "Score: 0.55407417\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + ":1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " response = client.search(index=\"book_index\", body={\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Hybrid Search\n", + "\n", + "In this example, we are investigating the combination of two search algorithms: BM25 for text search and HNSW for nearest neighbor search. By combining multiple ranking methods, such as BM25 and an ML model that generates dense vector embeddings, we can achieve the best ranking results. This approach allows us to leverage the strengths of each algorithm and improve the overall search performance.\n", + "\n", + "TODO: Explain why we use RRF here\n" + ], + "metadata": { + "id": "IUMOK8h-iYrq" + }, + "id": "IUMOK8h-iYrq" + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", body={\n", + " \"query\": {\n", + " \"match\": {\n", + " \"summary\": \"python\"\n", + " }\n", + " },\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " # generate embedding for query so it can be compared to `title_vector`\n", + " \"query_vector\" : model.encode(\"python programming\").tolist(),\n", + " \"k\": 5,\n", + " \"num_candidates\": 10\n", + " },\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 100,\n", + " \"rank_constant\": 20\n", + " }\n", + " }\n", + "})\n", + "\n", + "pretty_response(response)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1BwZ-yjli7xA", + "outputId": "26eea86c-5cda-42d0-ba1e-2904e2b7865a" + }, + "id": "1BwZ-yjli7xA", + "execution_count": 51, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "ID: MelWP4kB-GB5Evg6zHVx\n", + "Publication date: 2019-05-03\n", + "Title: Python Crash Course\n", + "Summary: A fast-paced, no-nonsense guide to programming in Python\n", + "Publisher: no starch press\n", + "Reviews: 42\n", + "Authors: ['eric matthes']\n", + "Score: None\n", + "\n", + "ID: MOlWP4kB-GB5Evg6zHVx\n", + "Publication date: 2019-10-29\n", + "Title: The Pragmatic Programmer: Your Journey to Mastery\n", + "Summary: A guide to pragmatic programming for software engineers and developers\n", + "Publisher: addison-wesley\n", + "Reviews: 30\n", + "Authors: ['andrew hunt', 'david thomas']\n", + "Score: None\n", + "\n", + "ID: OelWP4kB-GB5Evg6zHVx\n", + "Publication date: 2012-06-27\n", + "Title: Introduction to the Theory of Computation\n", + "Summary: Introduction to the theory of computation and complexity theory\n", + "Publisher: cengage learning\n", + "Reviews: 33\n", + "Authors: ['michael sipser']\n", + "Score: None\n", + "\n", + "ID: N-lWP4kB-GB5Evg6zHVx\n", + "Publication date: 2011-05-13\n", + "Title: The Clean Coder: A Code of Conduct for Professional Programmers\n", + "Summary: A guide to professional conduct in the field of software engineering\n", + "Publisher: prentice hall\n", + "Reviews: 20\n", + "Authors: ['robert c. martin']\n", + "Score: None\n", + "\n", + "ID: MulWP4kB-GB5Evg6zHVx\n", + "Publication date: 2020-04-06\n", + "Title: Artificial Intelligence: A Modern Approach\n", + "Summary: Comprehensive introduction to the theory and practice of artificial intelligence\n", + "Publisher: pearson\n", + "Reviews: 39\n", + "Authors: ['stuart russell', 'peter norvig']\n", + "Score: None\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + ":1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " response = client.search(index=\"book_index\", body={\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "client.indices.delete(index=\"book_index\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_OAahfg-tqrf", + "outputId": "d8f81ba4-cdc9-4e30-edf7-6d5bb16920eb" + }, + "id": "_OAahfg-tqrf", + "execution_count": 52, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True})" + ] + }, + "metadata": {}, + "execution_count": 52 + } + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/colab-notebooks-examples/search/01-keyword-querying-filtering.ipynb b/colab-notebooks-examples/search/01-keyword-querying-filtering.ipynb new file mode 100644 index 00000000..deb44e47 --- /dev/null +++ b/colab-notebooks-examples/search/01-keyword-querying-filtering.ipynb @@ -0,0 +1,1721 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cSj1acY8uPR2" + }, + "outputs": [], + "source": [ + "#@title Prepare elasticsearch client { display-mode: \"form\" }\n", + "!pip install elasticsearch\n", + "from elasticsearch import Elasticsearch\n", + "import pandas as pd\n", + "from google.colab import data_table\n", + "\n", + "data_table.enable_dataframe_formatter()\n", + "# Create the client instance\n", + "cloud_id = '\\u003Ccloud_id>' #@param {type:\"string\"}\n", + "elastic_username = 'elastic' #@param {type:\"string\"}\n", + "elastic_password = '\\u003Cpassword>' #@param {type: \"string\"}\n", + "client = Elasticsearch(\n", + " cloud_id=cloud_id,\n", + " basic_auth=(elastic_username, elastic_password)\n", + ")\n", + "\n", + "def pretty_response_transform(response):\n", + " result = []\n", + " for hit in response['hits']['hits']:\n", + " result.append({\n", + " 'id' : hit['_id'],\n", + " 'publication_date' : hit['_source']['publish_date'],\n", + " 'score' : hit['_score'],\n", + " 'title' : hit['_source']['title'],\n", + " 'summary' : hit['_source']['summary']\n", + " })\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "source": [ + "[Quick Start](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) || **Keyword Querying Filtering** || [Hybrid search with RRF](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb) || [ELSER](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/03-ELSER.ipynb)" + ], + "metadata": { + "id": "0wgbLWl2udLQ" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Keyword querying and filtering\n", + "\n", + "\"Open\n", + "\n", + "This interactive notebook will introduce you to the basic Elasticsearch queries, using the official Elasticsearch Python client. Before getting start this section we recommend working through our [quick start](https://github.com/yansavitski/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb)." + ], + "metadata": { + "id": "83LdOUCwwHzs" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Querying\n", + "In the query context, a query clause answers the question _“How well does this document match this query clause?”_. In addition to deciding whether or not the document matches, the query clause also calculates a relevance score in the `_score `metadata field.\n", + "\n", + "### Full text queries\n", + "\n", + "Full text queries enable you to search analyzed text fields such as the body of an email. The query string is processed using the same analyzer that was applied to the field during indexing.\n", + "\n", + "* **match**.\n", + " The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.\n", + "* **multi-match**.\n", + " The multi-field version of the match query." + ], + "metadata": { + "id": "22onltbgxxGm" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Match query\n", + "Returns documents that `match` a provided text, number, date or boolean value. The provided text is analyzed before matching.\n", + "\n", + "The `match` query is the standard query for performing a full-text search, including options for fuzzy matching.\n", + "\n", + "[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#match-query-ex-request).\n", + "\n" + ], + "metadata": { + "id": "clXQwoFQ6x61" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"match\": {\n", + " \"summary\": {\n", + " \"query\": \"guide\"\n", + " }\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "id": "q_OE0XVx6_qX", + "outputId": "6a1d7760-5fb9-4809-e060-e35a398ed3c4" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 3cXgIYkBfxlbyhU5Krfc 2019-10-29 0.704228 \n", + "1 3sXgIYkBfxlbyhU5Krfc 2019-05-03 0.704228 \n", + "2 5MXgIYkBfxlbyhU5Krfd 2011-05-13 0.677165 \n", + "3 4MXgIYkBfxlbyhU5Krfc 2008-08-11 0.628835 \n", + "4 48XgIYkBfxlbyhU5Krfd 1994-10-31 0.628835 \n", + "\n", + " title \\\n", + "0 The Pragmatic Programmer: Your Journey to Mastery \n", + "1 Python Crash Course \n", + "2 The Clean Coder: A Code of Conduct for Profess... \n", + "3 Clean Code: A Handbook of Agile Software Craft... \n", + "4 Design Patterns: Elements of Reusable Object-O... \n", + "\n", + " summary \n", + "0 A guide to pragmatic programming for software ... \n", + "1 A fast-paced, no-nonsense guide to programming... \n", + "2 A guide to professional conduct in the field o... \n", + "3 A guide to writing code that is easy to read, ... \n", + "4 Guide to design patterns that can be used in a... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
03cXgIYkBfxlbyhU5Krfc2019-10-290.704228The Pragmatic Programmer: Your Journey to MasteryA guide to pragmatic programming for software ...
13sXgIYkBfxlbyhU5Krfc2019-05-030.704228Python Crash CourseA fast-paced, no-nonsense guide to programming...
25MXgIYkBfxlbyhU5Krfd2011-05-130.677165The Clean Coder: A Code of Conduct for Profess...A guide to professional conduct in the field o...
34MXgIYkBfxlbyhU5Krfc2008-08-110.628835Clean Code: A Handbook of Agile Software Craft...A guide to writing code that is easy to read, ...
448XgIYkBfxlbyhU5Krfd1994-10-310.628835Design Patterns: Elements of Reusable Object-O...Guide to design patterns that can be used in a...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"3cXgIYkBfxlbyhU5Krfc\",\n\"2019-10-29\",\n{\n 'v': 0.7042277,\n 'f': \"0.7042277\",\n },\n\"The Pragmatic Programmer: Your Journey to Mastery\",\n\"A guide to pragmatic programming for software engineers and developers\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"3sXgIYkBfxlbyhU5Krfc\",\n\"2019-05-03\",\n{\n 'v': 0.7042277,\n 'f': \"0.7042277\",\n },\n\"Python Crash Course\",\n\"A fast-paced, no-nonsense guide to programming in Python\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"5MXgIYkBfxlbyhU5Krfd\",\n\"2011-05-13\",\n{\n 'v': 0.6771651,\n 'f': \"0.6771651\",\n },\n\"The Clean Coder: A Code of Conduct for Professional Programmers\",\n\"A guide to professional conduct in the field of software engineering\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"4MXgIYkBfxlbyhU5Krfc\",\n\"2008-08-11\",\n{\n 'v': 0.62883455,\n 'f': \"0.62883455\",\n },\n\"Clean Code: A Handbook of Agile Software Craftsmanship\",\n\"A guide to writing code that is easy to read, understand and maintain\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"48XgIYkBfxlbyhU5Krfd\",\n\"1994-10-31\",\n{\n 'v': 0.62883455,\n 'f': \"0.62883455\",\n },\n\"Design Patterns: Elements of Reusable Object-Oriented Software\",\n\"Guide to design patterns that can be used in any object-oriented language\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-fe653608-986c-4ebe-b427-0ec37ccb087b button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Multi-match query\n", + "\n", + "The `multi_match` query builds on the match query to allow multi-field queries.\n", + "\n", + "[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html)." + ], + "metadata": { + "id": "H-n6hoVsfAqc" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"multi_match\": {\n", + " \"query\": \"javascript\",\n", + " \"fields\": [\"summary\", \"title\"]\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 154 + }, + "id": "TRmGYM94gCtb", + "outputId": "dc58b19f-e585-4d0a-d065-ac3fc18ae123" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 4sXgIYkBfxlbyhU5Krfc 2018-12-04 2.030753 \n", + "1 5cXgIYkBfxlbyhU5Krfd 2008-05-15 1.706409 \n", + "2 4cXgIYkBfxlbyhU5Krfc 2015-03-27 1.636058 \n", + "\n", + " title \\\n", + "0 Eloquent JavaScript \n", + "1 JavaScript: The Good Parts \n", + "2 You Don't Know JS: Up & Going \n", + "\n", + " summary \n", + "0 A modern introduction to programming \n", + "1 A deep dive into the parts of JavaScript that ... \n", + "2 Introduction to JavaScript and programming as ... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04sXgIYkBfxlbyhU5Krfc2018-12-042.030753Eloquent JavaScriptA modern introduction to programming
15cXgIYkBfxlbyhU5Krfd2008-05-151.706409JavaScript: The Good PartsA deep dive into the parts of JavaScript that ...
24cXgIYkBfxlbyhU5Krfc2015-03-271.636058You Don't Know JS: Up & GoingIntroduction to JavaScript and programming as ...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4sXgIYkBfxlbyhU5Krfc\",\n\"2018-12-04\",\n{\n 'v': 2.0307527,\n 'f': \"2.0307527\",\n },\n\"Eloquent JavaScript\",\n\"A modern introduction to programming\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"5cXgIYkBfxlbyhU5Krfd\",\n\"2008-05-15\",\n{\n 'v': 1.7064086,\n 'f': \"1.7064086\",\n },\n\"JavaScript: The Good Parts\",\n\"A deep dive into the parts of JavaScript that are essential to writing maintainable code\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"4cXgIYkBfxlbyhU5Krfc\",\n\"2015-03-27\",\n{\n 'v': 1.6360576,\n 'f': \"1.6360576\",\n },\n\"You Don't Know JS: Up & Going\",\n\"Introduction to JavaScript and programming as a whole\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-d42669b9-d6a4-4402-a5be-2092a0f1ebbd button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 50 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Individual fields can be boosted with the caret (^) notation." + ], + "metadata": { + "id": "FnBeBIVKiPnS" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"multi_match\": {\n", + " \"query\": \"javascript\",\n", + " \"fields\": [\"summary\", \"title^3\"]\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 154 + }, + "id": "_aI7hnH0ixkG", + "outputId": "2af27f3d-f9fd-4c7a-cab5-7cb06132582c" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 4sXgIYkBfxlbyhU5Krfc 2018-12-04 6.092258 \n", + "1 5cXgIYkBfxlbyhU5Krfd 2008-05-15 5.119226 \n", + "2 4cXgIYkBfxlbyhU5Krfc 2015-03-27 1.636058 \n", + "\n", + " title \\\n", + "0 Eloquent JavaScript \n", + "1 JavaScript: The Good Parts \n", + "2 You Don't Know JS: Up & Going \n", + "\n", + " summary \n", + "0 A modern introduction to programming \n", + "1 A deep dive into the parts of JavaScript that ... \n", + "2 Introduction to JavaScript and programming as ... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04sXgIYkBfxlbyhU5Krfc2018-12-046.092258Eloquent JavaScriptA modern introduction to programming
15cXgIYkBfxlbyhU5Krfd2008-05-155.119226JavaScript: The Good PartsA deep dive into the parts of JavaScript that ...
24cXgIYkBfxlbyhU5Krfc2015-03-271.636058You Don't Know JS: Up & GoingIntroduction to JavaScript and programming as ...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4sXgIYkBfxlbyhU5Krfc\",\n\"2018-12-04\",\n{\n 'v': 6.0922585,\n 'f': \"6.0922585\",\n },\n\"Eloquent JavaScript\",\n\"A modern introduction to programming\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"5cXgIYkBfxlbyhU5Krfd\",\n\"2008-05-15\",\n{\n 'v': 5.1192265,\n 'f': \"5.1192265\",\n },\n\"JavaScript: The Good Parts\",\n\"A deep dive into the parts of JavaScript that are essential to writing maintainable code\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"4cXgIYkBfxlbyhU5Krfc\",\n\"2015-03-27\",\n{\n 'v': 1.6360576,\n 'f': \"1.6360576\",\n },\n\"You Don't Know JS: Up & Going\",\n\"Introduction to JavaScript and programming as a whole\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-85cba031-2c1b-41fe-95b1-be7944cc54f1 button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 49 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Prefix search\n", + "\n", + "Returns documents that contain a specific prefix in a provided field.\n", + "\n", + "[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html)" + ], + "metadata": { + "id": "yXipv0xSk-nK" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"prefix\": {\n", + " \"title\": {\n", + " \"value\": 'java'\n", + " }\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 133 + }, + "id": "dCr1pwlqlOE7", + "outputId": "ae55cd66-0ded-4868-dac5-5815ea317c44" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score title \\\n", + "0 4sXgIYkBfxlbyhU5Krfc 2018-12-04 1.0 Eloquent JavaScript \n", + "1 5cXgIYkBfxlbyhU5Krfd 2008-05-15 1.0 JavaScript: The Good Parts \n", + "\n", + " summary \n", + "0 A modern introduction to programming \n", + "1 A deep dive into the parts of JavaScript that ... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04sXgIYkBfxlbyhU5Krfc2018-12-041.0Eloquent JavaScriptA modern introduction to programming
15cXgIYkBfxlbyhU5Krfd2008-05-151.0JavaScript: The Good PartsA deep dive into the parts of JavaScript that ...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4sXgIYkBfxlbyhU5Krfc\",\n\"2018-12-04\",\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"Eloquent JavaScript\",\n\"A modern introduction to programming\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"5cXgIYkBfxlbyhU5Krfd\",\n\"2008-05-15\",\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"JavaScript: The Good Parts\",\n\"A deep dive into the parts of JavaScript that are essential to writing maintainable code\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-14c8ac74-36d1-4dbe-b3a6-c23015d99d02 button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 48 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Fuzzy search\n", + "\n", + "Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.\n", + "\n", + "An edit distance is the number of one-character changes needed to turn one term into another. These changes can include:\n", + "\n", + "* Changing a character (box → fox)\n", + "* Removing a character (black → lack)\n", + "* Inserting a character (sic → sick)\n", + "* Transposing two adjacent characters (act → cat)\n", + "\n", + "[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)\n", + "\n" + ], + "metadata": { + "id": "a606YcCmmLHW" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"fuzzy\": {\n", + " \"title\": {\n", + " \"value\": 'pyvascript'\n", + " }\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 133 + }, + "id": "dTMc-IxPmbtC", + "outputId": "9acf74fd-bc16-45df-80f3-49504860b10a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 4sXgIYkBfxlbyhU5Krfc 2018-12-04 1.624602 \n", + "1 5cXgIYkBfxlbyhU5Krfd 2008-05-15 1.365127 \n", + "\n", + " title \\\n", + "0 Eloquent JavaScript \n", + "1 JavaScript: The Good Parts \n", + "\n", + " summary \n", + "0 A modern introduction to programming \n", + "1 A deep dive into the parts of JavaScript that ... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04sXgIYkBfxlbyhU5Krfc2018-12-041.624602Eloquent JavaScriptA modern introduction to programming
15cXgIYkBfxlbyhU5Krfd2008-05-151.365127JavaScript: The Good PartsA deep dive into the parts of JavaScript that ...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4sXgIYkBfxlbyhU5Krfc\",\n\"2018-12-04\",\n{\n 'v': 1.6246022,\n 'f': \"1.6246022\",\n },\n\"Eloquent JavaScript\",\n\"A modern introduction to programming\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"5cXgIYkBfxlbyhU5Krfd\",\n\"2008-05-15\",\n{\n 'v': 1.3651271,\n 'f': \"1.3651271\",\n },\n\"JavaScript: The Good Parts\",\n\"A deep dive into the parts of JavaScript that are essential to writing maintainable code\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-39be6dc9-f7bc-45e8-811b-98a2b550ecf2 button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Filtering\n", + "\n", + "In a filter context, a query clause answers the question *“Does this document match this query clause?”* The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, for example:\n", + "* Does this `timestamp` fall into the range 2015 to 2016?\n", + "* Is the `status` field set to `\"published\"`?\n", + "\n", + "Filter context is in effect whenever a query clause is passed to a `filter` parameter, such as the `filter` or `must_not` parameters in the `bool` query.\n", + "\n", + "[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html)" + ], + "metadata": { + "id": "PG9TYqL-8H29" + } + }, + { + "cell_type": "markdown", + "source": [ + "### **bool.must**\n", + "The clause (query) must appear in matching documents and will contribute to the score." + ], + "metadata": { + "id": "7do0lmxA_v25" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"bool\": {\n", + " \"must\": [{\n", + " \"term\": {\n", + " \"summary\": \"guide\"\n", + " }\n", + " }, {\n", + " \"term\": {\n", + " \"summary\": \"code\"\n", + " }\n", + " }]\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 111 + }, + "id": "8_C-JHRQFDl7", + "outputId": "be59d18b-5e20-4db0-8697-2e7746251742" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 4MXgIYkBfxlbyhU5Krfc 2008-08-11 1.97297 \n", + "\n", + " title \\\n", + "0 Clean Code: A Handbook of Agile Software Craft... \n", + "\n", + " summary \n", + "0 A guide to writing code that is easy to read, ... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04MXgIYkBfxlbyhU5Krfc2008-08-111.97297Clean Code: A Handbook of Agile Software Craft...A guide to writing code that is easy to read, ...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4MXgIYkBfxlbyhU5Krfc\",\n\"2008-08-11\",\n{\n 'v': 1.9729705,\n 'f': \"1.9729705\",\n },\n\"Clean Code: A Handbook of Agile Software Craftsmanship\",\n\"A guide to writing code that is easy to read, understand and maintain\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-3d2b4397-5161-4160-a602-92c29337f1d1 button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 46 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### **bool.should**\n", + "\n", + "The clause (query) should appear in the matching document." + ], + "metadata": { + "id": "eNlncytRIl9h" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"bool\": {\n", + " \"should\": [{\n", + " \"term\": {\n", + " \"summary\": \"guide\"\n", + " }\n", + " }, {\n", + " \"term\": {\n", + " \"summary\": \"code\"\n", + " }\n", + " }]\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 219 + }, + "id": "GRm9T1vfIsmF", + "outputId": "d9fb6936-3ffb-4fff-9467-1f7ac7b41490" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 4MXgIYkBfxlbyhU5Krfc 2008-08-11 1.972970 \n", + "1 5cXgIYkBfxlbyhU5Krfd 2008-05-15 1.254593 \n", + "2 3cXgIYkBfxlbyhU5Krfc 2019-10-29 0.704228 \n", + "3 3sXgIYkBfxlbyhU5Krfc 2019-05-03 0.704228 \n", + "4 5MXgIYkBfxlbyhU5Krfd 2011-05-13 0.677165 \n", + "5 48XgIYkBfxlbyhU5Krfd 1994-10-31 0.628835 \n", + "\n", + " title \\\n", + "0 Clean Code: A Handbook of Agile Software Craft... \n", + "1 JavaScript: The Good Parts \n", + "2 The Pragmatic Programmer: Your Journey to Mastery \n", + "3 Python Crash Course \n", + "4 The Clean Coder: A Code of Conduct for Profess... \n", + "5 Design Patterns: Elements of Reusable Object-O... \n", + "\n", + " summary \n", + "0 A guide to writing code that is easy to read, ... \n", + "1 A deep dive into the parts of JavaScript that ... \n", + "2 A guide to pragmatic programming for software ... \n", + "3 A fast-paced, no-nonsense guide to programming... \n", + "4 A guide to professional conduct in the field o... \n", + "5 Guide to design patterns that can be used in a... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
04MXgIYkBfxlbyhU5Krfc2008-08-111.972970Clean Code: A Handbook of Agile Software Craft...A guide to writing code that is easy to read, ...
15cXgIYkBfxlbyhU5Krfd2008-05-151.254593JavaScript: The Good PartsA deep dive into the parts of JavaScript that ...
23cXgIYkBfxlbyhU5Krfc2019-10-290.704228The Pragmatic Programmer: Your Journey to MasteryA guide to pragmatic programming for software ...
33sXgIYkBfxlbyhU5Krfc2019-05-030.704228Python Crash CourseA fast-paced, no-nonsense guide to programming...
45MXgIYkBfxlbyhU5Krfd2011-05-130.677165The Clean Coder: A Code of Conduct for Profess...A guide to professional conduct in the field o...
548XgIYkBfxlbyhU5Krfd1994-10-310.628835Design Patterns: Elements of Reusable Object-O...Guide to design patterns that can be used in a...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"4MXgIYkBfxlbyhU5Krfc\",\n\"2008-08-11\",\n{\n 'v': 1.9729705,\n 'f': \"1.9729705\",\n },\n\"Clean Code: A Handbook of Agile Software Craftsmanship\",\n\"A guide to writing code that is easy to read, understand and maintain\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"5cXgIYkBfxlbyhU5Krfd\",\n\"2008-05-15\",\n{\n 'v': 1.2545931,\n 'f': \"1.2545931\",\n },\n\"JavaScript: The Good Parts\",\n\"A deep dive into the parts of JavaScript that are essential to writing maintainable code\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"3cXgIYkBfxlbyhU5Krfc\",\n\"2019-10-29\",\n{\n 'v': 0.7042277,\n 'f': \"0.7042277\",\n },\n\"The Pragmatic Programmer: Your Journey to Mastery\",\n\"A guide to pragmatic programming for software engineers and developers\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"3sXgIYkBfxlbyhU5Krfc\",\n\"2019-05-03\",\n{\n 'v': 0.7042277,\n 'f': \"0.7042277\",\n },\n\"Python Crash Course\",\n\"A fast-paced, no-nonsense guide to programming in Python\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"5MXgIYkBfxlbyhU5Krfd\",\n\"2011-05-13\",\n{\n 'v': 0.6771651,\n 'f': \"0.6771651\",\n },\n\"The Clean Coder: A Code of Conduct for Professional Programmers\",\n\"A guide to professional conduct in the field of software engineering\"],\n [{\n 'v': 5,\n 'f': \"5\",\n },\n\"48XgIYkBfxlbyhU5Krfd\",\n\"1994-10-31\",\n{\n 'v': 0.62883455,\n 'f': \"0.62883455\",\n },\n\"Design Patterns: Elements of Reusable Object-Oriented Software\",\n\"Guide to design patterns that can be used in any object-oriented language\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-7b970c43-3212-4654-a3ff-1fbcd54e4493 button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 44 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### **bool.filter**\n", + "\n", + "The clause (query) must appear in matching documents. **However unlike `must` the `score` of the query will be ignored.** Filter clauses are executed in filter context, meaning that scoring is ignored and clauses are considered for caching." + ], + "metadata": { + "id": "PGTFXUIkJG4t" + } + }, + { + "cell_type": "code", + "source": [ + "response = client.search(index=\"book_index\", query={\n", + " \"bool\": {\n", + " \"filter\": [{\n", + " \"term\": {\n", + " \"summary\": \"guide\"\n", + " }\n", + " }]\n", + " }\n", + " })\n", + "\n", + "pd.DataFrame.from_records(pretty_response_transform(response))" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "id": "6RH0OALLJPHv", + "outputId": "338419b0-3e60-4ac9-ddeb-67cac6202ca2" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id publication_date score \\\n", + "0 3cXgIYkBfxlbyhU5Krfc 2019-10-29 0.0 \n", + "1 3sXgIYkBfxlbyhU5Krfc 2019-05-03 0.0 \n", + "2 4MXgIYkBfxlbyhU5Krfc 2008-08-11 0.0 \n", + "3 48XgIYkBfxlbyhU5Krfd 1994-10-31 0.0 \n", + "4 5MXgIYkBfxlbyhU5Krfd 2011-05-13 0.0 \n", + "\n", + " title \\\n", + "0 The Pragmatic Programmer: Your Journey to Mastery \n", + "1 Python Crash Course \n", + "2 Clean Code: A Handbook of Agile Software Craft... \n", + "3 Design Patterns: Elements of Reusable Object-O... \n", + "4 The Clean Coder: A Code of Conduct for Profess... \n", + "\n", + " summary \n", + "0 A guide to pragmatic programming for software ... \n", + "1 A fast-paced, no-nonsense guide to programming... \n", + "2 A guide to writing code that is easy to read, ... \n", + "3 Guide to design patterns that can be used in a... \n", + "4 A guide to professional conduct in the field o... " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpublication_datescoretitlesummary
03cXgIYkBfxlbyhU5Krfc2019-10-290.0The Pragmatic Programmer: Your Journey to MasteryA guide to pragmatic programming for software ...
13sXgIYkBfxlbyhU5Krfc2019-05-030.0Python Crash CourseA fast-paced, no-nonsense guide to programming...
24MXgIYkBfxlbyhU5Krfc2008-08-110.0Clean Code: A Handbook of Agile Software Craft...A guide to writing code that is easy to read, ...
348XgIYkBfxlbyhU5Krfd1994-10-310.0Design Patterns: Elements of Reusable Object-O...Guide to design patterns that can be used in a...
45MXgIYkBfxlbyhU5Krfd2011-05-130.0The Clean Coder: A Code of Conduct for Profess...A guide to professional conduct in the field o...
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"3cXgIYkBfxlbyhU5Krfc\",\n\"2019-10-29\",\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"The Pragmatic Programmer: Your Journey to Mastery\",\n\"A guide to pragmatic programming for software engineers and developers\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"3sXgIYkBfxlbyhU5Krfc\",\n\"2019-05-03\",\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"Python Crash Course\",\n\"A fast-paced, no-nonsense guide to programming in Python\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"4MXgIYkBfxlbyhU5Krfc\",\n\"2008-08-11\",\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"Clean Code: A Handbook of Agile Software Craftsmanship\",\n\"A guide to writing code that is easy to read, understand and maintain\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"48XgIYkBfxlbyhU5Krfd\",\n\"1994-10-31\",\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"Design Patterns: Elements of Reusable Object-Oriented Software\",\n\"Guide to design patterns that can be used in any object-oriented language\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"5MXgIYkBfxlbyhU5Krfd\",\n\"2011-05-13\",\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"The Clean Coder: A Code of Conduct for Professional Programmers\",\n\"A guide to professional conduct in the field of software engineering\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"id\"], [\"string\", \"publication_date\"], [\"number\", \"score\"], [\"string\", \"title\"], [\"string\", \"summary\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n \n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
\n \n
\n\n\n\n \n`;\n parentElement.appendChild(quickchartButtonContainerElement);\n \nfunction displayQuickchartButton(domScope) {\n let quickchartButtonEl =\n domScope.querySelector('#df-7984fec6-fcda-46b7-9b9e-3e31a15d21ad button.colab-df-quickchart');\n quickchartButtonEl.style.display =\n google.colab.kernel.accessAllowed ? 'block' : 'none';\n}\n\n displayQuickchartButton(parentElement);\n }\n " + }, + "metadata": {}, + "execution_count": 45 + } + ] + } + ] +} \ No newline at end of file diff --git a/colab-notebooks-examples/search/02-hybrid-search.ipynb b/colab-notebooks-examples/search/02-hybrid-search.ipynb new file mode 100644 index 00000000..ba55bdcf --- /dev/null +++ b/colab-notebooks-examples/search/02-hybrid-search.ipynb @@ -0,0 +1,794 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Hybrid Search using RRF\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/leemthompo/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb)\n", + "\n", + "In this example we'll use the reciprocal rank fusion algorithm to combine the results of BM25 and kNN semantic search.\n", + "We'll use the same dataset we used in our [quickstart](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) guide.\n", + "You can use RRF for hybrid search out of the box, without any additional configuration.\n", + "\n", + "We also provide a walkthrough of a toy example, which demonstrates how RRF ranking works at a basic level." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01AXpELkygt" + }, + "source": [ + "# 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment with minimum **4GB machine learning node**\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [ELSER](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html) model installed on your Elastic deployment\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "N4pI1-eIvWrI" + }, + "source": [ + "# Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Under **Advanced settings**, go to **Machine Learning instances**\n", + " - You'll need at least **4GB** RAM per zone for this tutorial\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages and initialize the Elasticsearch Python client\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the packages we need for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!git clone https://github.com/elastic/elasticsearch-py.git\n", + "%cd elasticsearch-py\n", + "!git checkout v8.8.2\n", + "!{sys.executable} -m pip install .\n", + "!pip install sentence_transformers\n", + "!pip install torch\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gEzq2Z1wBs3M" + }, + "source": [ + "[TODO: Update]\n", + "Next we need to import the `elasticsearch` module and the `getpass` module.\n", + "`getpass` is part of the Python standard library and is used to securely prompt for credentials." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uP_GTVRi-d96" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "from sentence_transformers import SentenceTransformer\n", + "import torch\n", + "\n", + "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", + "\n", + "model = SentenceTransformer('all-MiniLM-L6-v2', device=device)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "AMSePFiZCRqX" + }, + "source": [ + "Now we can instantiate the Python Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h0MdAZ53CdKL", + "outputId": "96ea6f81-f935-4d51-c4a7-af5a896180f1" + }, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "bRHbecNeEDL3" + }, + "source": [ + "Confirm that the client has connected with this test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rdiUKqZbEKfF", + "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" + }, + "outputs": [], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "enHQuT57DhD1" + }, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TF_wxIAhD07a" + }, + "source": [ + "# Create Elasticsearch index with required mappings\n", + "\n", + "We need to add a field to support dense vector storage and search.\n", + "Note the `title_vector` field below, which is used to store the dense vector representation of the `title` field." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cvYECABJJs_2", + "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" + }, + "outputs": [], + "source": [ + "# Define the mapping\n", + "mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"title\": {\"type\": \"text\"},\n", + " \"authors\": {\"type\": \"keyword\"},\n", + " \"summary\": {\"type\": \"text\"},\n", + " \"publish_date\": {\"type\": \"date\"},\n", + " \"num_reviews\": {\"type\": \"integer\"},\n", + " \"publisher\": {\"type\": \"keyword\"},\n", + " \"title_vector\": { \n", + " \"type\": \"dense_vector\", \n", + " \"dims\": 384, \n", + " \"index\": \"true\", \n", + " \"similarity\": \"dot_product\" \n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "# Create the index\n", + "client.indices.create(index='rrf_book_index', body=mapping)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset\n", + "\n", + "Let's index some data.\n", + "Note that we are embedding the `title` field using the sentence transformer model.\n", + "Once indexed, you'll see that your documents contain a `title_vector` field (`\"type\": \"dense_vector\"`) which contains a vector of floating point values.\n", + "This is the embedding of the `title` field in vector space.\n", + "We'll use this field to perform semantic search using kNN." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "books = [\n", + " {\n", + " \"title\": \"The Pragmatic Programmer: Your Journey to Mastery\",\n", + " \"authors\": [\"andrew hunt\", \"david thomas\"],\n", + " \"summary\": \"A guide to pragmatic programming for software engineers and developers\",\n", + " \"publish_date\": \"2019-10-29\",\n", + " \"num_reviews\": 30,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"Python Crash Course\",\n", + " \"authors\": [\"eric matthes\"],\n", + " \"summary\": \"A fast-paced, no-nonsense guide to programming in Python\",\n", + " \"publish_date\": \"2019-05-03\",\n", + " \"num_reviews\": 42,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Artificial Intelligence: A Modern Approach\",\n", + " \"authors\": [\"stuart russell\", \"peter norvig\"],\n", + " \"summary\": \"Comprehensive introduction to the theory and practice of artificial intelligence\",\n", + " \"publish_date\": \"2020-04-06\",\n", + " \"num_reviews\": 39,\n", + " \"publisher\": \"pearson\"\n", + " },\n", + " {\n", + " \"title\": \"Clean Code: A Handbook of Agile Software Craftsmanship\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to writing code that is easy to read, understand and maintain\",\n", + " \"publish_date\": \"2008-08-11\",\n", + " \"num_reviews\": 55,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"You Don't Know JS: Up & Going\",\n", + " \"authors\": [\"kyle simpson\"],\n", + " \"summary\": \"Introduction to JavaScript and programming as a whole\",\n", + " \"publish_date\": \"2015-03-27\",\n", + " \"num_reviews\": 36,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Eloquent JavaScript\",\n", + " \"authors\": [\"marijn haverbeke\"],\n", + " \"summary\": \"A modern introduction to programming\",\n", + " \"publish_date\": \"2018-12-04\",\n", + " \"num_reviews\": 38,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Design Patterns: Elements of Reusable Object-Oriented Software\",\n", + " \"authors\": [\"erich gamma\", \"richard helm\", \"ralph johnson\", \"john vlissides\"],\n", + " \"summary\": \"Guide to design patterns that can be used in any object-oriented language\",\n", + " \"publish_date\": \"1994-10-31\",\n", + " \"num_reviews\": 45,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"The Clean Coder: A Code of Conduct for Professional Programmers\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to professional conduct in the field of software engineering\",\n", + " \"publish_date\": \"2011-05-13\",\n", + " \"num_reviews\": 20,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"JavaScript: The Good Parts\",\n", + " \"authors\": [\"douglas crockford\"],\n", + " \"summary\": \"A deep dive into the parts of JavaScript that are essential to writing maintainable code\",\n", + " \"publish_date\": \"2008-05-15\",\n", + " \"num_reviews\": 51,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Introduction to the Theory of Computation\",\n", + " \"authors\": [\"michael sipser\"],\n", + " \"summary\": \"Introduction to the theory of computation and complexity theory\",\n", + " \"publish_date\": \"2012-06-27\",\n", + " \"num_reviews\": 33,\n", + " \"publisher\": \"cengage learning\"\n", + " },\n", + "]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Index documents\n", + "\n", + "Our dataset is a Python list that contains dictionaries of movie titles and descriptions.\n", + "We'll use the `helpers.bulk` method to index our documents in batches.\n", + "\n", + "The following code iterates over the list of books and creates a list of actions to be performed.\n", + "Each action is a dictionary containing an \"index\" operation on our Elasticsearch index.\n", + "The book's title is encoded using our selected model, and the encoded vector is added to the book document.\n", + "The book document is then added to the list of actions.\n", + "\n", + "Finally, we call the `bulk` method, specifying the index name and the list of actions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "actions = []\n", + "for book in books:\n", + " actions.append({\"index\": {\"_index\": \"rrf_book_index\"}})\n", + " titleEmbedding = model.encode(book[\"title\"]).tolist()\n", + " book[\"title_vector\"] = titleEmbedding\n", + " actions.append(book)\n", + "\n", + "client.bulk(index=\"rrf_book_index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "WgWDMgf9NkHL" + }, + "source": [ + "## Pretty printing Elasticsearch responses\n", + "\n", + "This is a helper function to print Elasticsearch responses in a readable format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def pretty_response(response):\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " publication_date = hit['_source']['publish_date']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " summary = hit['_source']['summary']\n", + " pretty_output = (f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nScore: {score}\")\n", + " print(pretty_output)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "MrBCHdH1u8Wd" + }, + "source": [ + "# Hybrid search using RRF\n", + "\n", + "## RRF overview\n", + "\n", + "[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) is a state-of-the-art ranking algorithm for combining results from different information retrieval strategies.\n", + "RRF consistently improves the combined results of different search algorithms.\n", + "It outperforms all other ranking algorithms, and often surpasses the best individual results, without calibration.\n", + "In brief, it enables best-in-class hybrid search out of the box.\n", + "\n", + "## How RRF works in Elasticsearch\n", + "\n", + "You can use RRF as part of a search to combine and rank documents using result sets from a combination of query and/or knn searches.\n", + "A minimum of 2 results sets is required for ranking from the specified sources.\n", + "Check out the [RRF API reference](https://www.elastic.co/guide/en/elasticsearch/reference/master/rrf.html#rrf-api) for full details information.\n", + "\n", + "In the following example, we'll use RRF to combine the results of a `match` query and a kNN semantic search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"size\": 5,\n", + " \"query\": {\n", + " \"match\": {\n", + " \"summary\": \"shoes\"\n", + " },\n", + " \n", + " },\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\" : model.encode(\"python programming\").tolist(), # generate embedding for query so it can be compared to `title_vector`\n", + " \"k\": 5,\n", + " \"num_candidates\": 10},\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 20\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"rrf_book_index\", body=body)\n", + "\n", + "print(response)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example, we first execute the kNN search to get its global top 5 results.\n", + "Then we execute the match query to get its global top 5 results.\n", + "Then we combine the knn search and match query results and rank them based on the RRF method to get the final top 2 results.\n", + "\n", + "ℹ️ Note that if `k` from a knn search is larger than `window_size`, the results are truncated to `window_size`.\n", + "If `k` is smaller than `window_size`, the results will be `k` size." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RRF toy example\n", + "\n", + "This very simple example demonstrates how RRF ranks documents from different search strategies.\n", + "We begin by creating a mapping for an index with a text field, a vector field, and an integer field along with indexing several documents. For this example we are going to use a vector with only a single dimension to make the ranking easier to explain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"text\" : {\n", + " \"type\" : \"text\"\n", + " },\n", + " \"vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 1,\n", + " \"similarity\": \"l2_norm\",\n", + " \"index\": \"true\"\n", + "\n", + " },\n", + " \"integer\" : {\n", + " \"type\" : \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "client.indices.create(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next let's index some documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "doc1 = {\n", + " \"text\" : \"rrf\",\n", + " \"vector\" : [5],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc2 ={\n", + " \"text\" : \"rrf rrf\",\n", + " \"vector\" : [4],\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc3 = {\n", + " \"text\" : \"rrf rrf rrf\",\n", + " \"vector\" : [3],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc4 = {\n", + " \"text\" : \"rrf rrf rrf rrf\",\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc5 ={\n", + " \"vector\" : [0],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "docs = [doc1, doc2, doc3, doc4, doc5]\n", + "\n", + "actions = []\n", + "for doc in docs:\n", + " actions.append({\"index\": {\"_index\": \"example-index\"}})\n", + " actions.append(doc)\n", + "\n", + "client.bulk(index=\"example-index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now execute a search using RRF with a query, a kNN search, and a terms aggregation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"query\": {\n", + " \"term\": {\n", + " \"text\": \"rrf\"\n", + " }\n", + " },\n", + " \"knn\": {\n", + " \"field\": \"vector\",\n", + " \"query_vector\": [3],\n", + " \"k\": 5,\n", + " \"num_candidates\": 5\n", + " },\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 1\n", + " }\n", + " },\n", + " \"size\": 3,\n", + " \"aggs\": {\n", + " \"int_count\": {\n", + " \"terms\": {\n", + " \"field\": \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We receive a response with ranked hits and the terms aggregation result.\n", + "Note that _score is null, and we instead use _rank to show our top-ranked documents.\n", + "\n", + "Let’s break down how these hits were ranked.\n", + "We start by running the query and the kNN search separately to collect what their individual hits are.\n", + "\n", + "First, we look at the hits for the query.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"4\",\n", + " \"_score\" : 0.16152832, (1) \n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"text\" : \"rrf rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", (2) \n", + " \"_score\" : 0.15876243,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", (3) \n", + " \"_score\" : 0.15350538,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", (4)\n", + " \"_score\" : 0.13963442,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " }\n", + "]\n", + "```\n", + "\n", + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 4\n", + "- **(2)** rank 2, `_id` 3\n", + "- **(3)** rank 3, `_id` 2\n", + "- **(4)** rank 4, `_id` 1\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Note that our first hit doesn’t have a value for the vector field.\n", + "\n", + "Now, we look at the results for the kNN search.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", (1)\n", + " \"_score\" : 1.0,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", (2)\n", + " \"_score\" : 0.5,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", (3)\n", + " \"_score\" : 0.2,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"5\", (4)\n", + " \"_score\" : 0.1,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [0]\n", + " }\n", + " }\n", + "]\n", + "```\n", + "\n", + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 3\n", + "- **(2)** rank 2, `_id` 2\n", + "- **(3)** rank 3, `_id` 1\n", + "- **(4)** rank 4, `_id` 5\n", + "\n", + "\n", + "We can now take the two individually ranked result sets and apply the RRF formula to them to get our final ranking.\n", + "\n", + "```python\n", + "# doc | query | knn | score\n", + "_id: 1 = 1.0/(1+4) + 1.0/(1+3) = 0.4500\n", + "_id: 2 = 1.0/(1+3) + 1.0/(1+2) = 0.5833\n", + "_id: 3 = 1.0/(1+2) + 1.0/(1+1) = 0.8333\n", + "_id: 4 = 1.0/(1+1) = 0.5000\n", + "_id: 5 = 1.0/(1+4) = 0.2000\n", + "```\n", + "\n", + "We rank the documents based on the RRF formula with a `window_size` of `5`\n", + "truncating the bottom `2` docs in our RRF result set with a `size` of `3`.\n", + "\n", + "We end up with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and\n", + "`_id: 4` as `_rank: 3`.\n", + "\n", + "This ranking matches the result set from the\n", + "original RRF search as expected." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/colab-notebooks-examples/search/03-ELSER.ipynb b/colab-notebooks-examples/search/03-ELSER.ipynb new file mode 100644 index 00000000..8956d6d2 --- /dev/null +++ b/colab-notebooks-examples/search/03-ELSER.ipynb @@ -0,0 +1,797 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Semantic Search using ELSER text expansion\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/03-ELSER.ipynb)\n", + "\n", + "\n", + "Learn how to use the [Elastic Learned Sparse Encoder](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) for text expansion-powered semantic search." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01AXpELkygt" + }, + "source": [ + "# 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment with minimum **4GB machine learning node**\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [ELSER](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html) model installed on your Elastic deployment\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "N4pI1-eIvWrI" + }, + "source": [ + "# Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Under **Advanced settings**, go to **Machine Learning instances**\n", + " - You'll need at least **4GB** RAM per zone for this tutorial\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "nSw1R8e28F_E" + }, + "source": [ + "# Setup ELSER\n", + "To use ELSER, you must have the [appropriate subscription]() level\n", + "for semantic search or the trial period activated.\n", + "\n", + "Follow these [instructions](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html#trained-model) to download and deploy ELSER in the Kibana UI or using the Dev Tools **Console**.\n", + "\n", + "(Console commands in comments 👇)\n", + "\n", + "" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages and initialize the Elasticsearch Python client\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the following packages:\n", + "\n", + "- `elasticsearch`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!pip install elasticsearch" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gEzq2Z1wBs3M" + }, + "source": [ + "Next, we need to import the modules we need.\n", + "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uP_GTVRi-d96" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass", + "import json" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "AMSePFiZCRqX" + }, + "source": [ + "Now we can instantiate the Python Elasticsearch client.\n", + "\n", + "First we prompt the user for their password and Cloud ID.\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h0MdAZ53CdKL", + "outputId": "96ea6f81-f935-4d51-c4a7-af5a896180f1" + }, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "bRHbecNeEDL3" + }, + "source": [ + "Confirm that the client has connected with this test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rdiUKqZbEKfF", + "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" + }, + "outputs": [], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "enHQuT57DhD1" + }, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TF_wxIAhD07a" + }, + "source": [ + "# Create Elasticsearch index with required mappings\n", + "\n", + "To use the ELSER model at index time, we'll need to create an index mapping that supports a [`text_expansion`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-text-expansion-query.html) query.\n", + "The mapping must include a field of type [`rank_features`](https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-features.html) to work with our feature vectors of interest.\n", + "This field contains the token-weight pairs the ELSER model created based on the input text.\n", + "\n", + "Let's create an index named `elser-movies` with the mappings we need.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cvYECABJJs_2", + "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" + }, + "outputs": [], + "source": [ + "INDEX = 'elser-movies'\n", + "client.indices.create(\n", + " index=INDEX,\n", + " settings={\n", + " \"index\": {\n", + " \"number_of_shards\": 1,\n", + " \"number_of_replicas\": 1\n", + " }\n", + " },\n", + " mappings={\n", + " \"properties\": {\n", + " \"genre\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"plot\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"released\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"runtime\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"title\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"ml.tokens\": {\n", + " \"type\": \"rank_features\"\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\"\n", + " }\n", + " }\n", + "}\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "ohcvdngCGJlo" + }, + "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "EmELvr_JK_22" + }, + "source": [ + "# Create an ingest pipeline with an inference processor to use ELSER\n", + "\n", + "In order to use ELSER on our Elastic Cloud deployment we'll need to create an ingest pipeline that contains an inference processor that runs the ELSER model.\n", + "Let's add that pipeline using the [`put_pipeline`](https://www.elastic.co/guide/en/elasticsearch/reference/master/put-pipeline-api.html) method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XhRng99KLQsd", + "outputId": "00ea73b5-45a4-472b-f4bc-2c2c790ab94d" + }, + "outputs": [], + "source": [ + "\n", + "client.ingest.put_pipeline(id=\"elser-v1-test\", body={\n", + " \"processors\": [\n", + " {\n", + " \"inference\": {\n", + " \"model_id\": \".elser_model_1\",\n", + " \"target_field\": \"ml\",\n", + " \"field_map\": {\n", + " \"keyScene\": \"text_field\",\n", + " \"plot\": \"text_field\"\n", + " },\n", + " \"inference_config\": {\n", + " \"text_expansion\": {\n", + " \"results_field\": \"tokens\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + " ]\n", + "})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "0wCH7YHLNW3i" + }, + "source": [ + "Let's note a few important parameters from that API call:\n", + "\n", + "- `inference`: A processor that performs inference using a machine learning model.\n", + "- `model_id`: Specifies the ID of the machine learning model to be used. In this example, the model ID is set to `.elser_model_1`.\n", + "- `target_field`: Defines the field where the inference result will be stored. Here, it is set to `ml`.\n", + "- `text_expansion`: Configures text expansion options for the inference process.\n", + "In this example, the inference results will be stored in a field named \"tokens\"." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "WgWDMgf9NkHL" + }, + "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "U3vT2g5LVIQF" + }, + "source": [ + "# Create index and mapping for test data\n", + "\n", + "\n", + "We have some test data in a `json` file at this [URL](https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json).\n", + "Let's load that into our Elastic deployment.\n", + "First we'll create an index named `search-movies` to store that data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "X3ONJckPnUIT", + "outputId": "07ea0766-c226-4510-c910-893db89757ad" + }, + "outputs": [], + "source": [ + "client.indices.create(\n", + " index=\"search-movies\",\n", + " mappings= {\n", + " \"properties\": {\n", + " \"genre\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"plot\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"released\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"runtime\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"title\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " }\n", + " }\n", + "})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "lFHgRUYVpNKP" + }, + "source": [ + "# Upload sample data\n", + "\n", + "> ⚠ To use the UI to upload data, follow the approach described [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search-elser.html#load-data).\n", + "\n", + "Let's upload the JSON data.\n", + "The dataset provides information on twelve iconic films.\n", + "Each film's entry includes its title, runtime, plot summary, a key scene, genre classification, and release year." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IBfqgdAcuKRG", + "outputId": "3b86daa1-ade1-4ff3-da81-4207fa814d30" + }, + "outputs": [], + "source": [ + "url = \"https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json\"\n", + "\n", + "# Send a request to the URL and get the response\n", + "response = urlopen(url)\n", + "\n", + "# Load the response data into a JSON object\n", + "data_json = json.loads(response.read())\n", + "\n", + "def create_index_body(doc):\n", + " \"\"\" Generate the body for an Elasticsearch document. \"\"\"\n", + " return {\n", + " \"_index\": \"search-movies\",\n", + " \"_source\": doc,\n", + " }\n", + "\n", + "# Prepare the documents to be indexed\n", + "documents = [create_index_body(doc) for doc in data_json]\n", + "\n", + "# Use helpers.bulk to index\n", + "helpers.bulk(client, documents)\n", + "\n", + "print(\"Done indexing documents into `search-movies` index!\")\n", + "\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "73d3Td-1ubhv" + }, + "source": [ + "# Ingest the data through the inference ingest pipeline\n", + "\n", + "Create tokens from the text by reindexing the data throught the inference pipeline that uses ELSER as the inference model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ysYobyC9uhn5", + "outputId": "27af8c88-9039-4ff8-a20f-9af9ffcff05c" + }, + "outputs": [], + "source": [ + "client.reindex(wait_for_completion=False,\n", + " source={\n", + " \"index\": \"search-movies\"\n", + " },\n", + " dest= {\n", + " \"index\": \"elser-movies\",\n", + " \"pipeline\": \"elser-v1-test\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "tUDGeY7e2-I2" + }, + "source": [ + "# Confirm documents are indexed with additional fields\n", + "\n", + "A successful API call in the previous step returns a task ID to monitor the job's progress.\n", + "Use the [task management API](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html) to check progress.\n", + "You can also monitor this task using the **Trained Models** UI in Kibana, selecting the **Pipelines** tab under **ELSER**.\n", + "\n", + "Call the following, replacing `` with the task id returned in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "2KXeXCc63WVw", + "outputId": "e8fee6dd-34a1-401d-c879-71fd54de3c90" + }, + "outputs": [], + "source": [ + "client.tasks.get(task_id='cxy4bU9ASFKpFgZUpa-jnA:19545263')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "oCj3jHHML4Tn" + }, + "source": [ + "Inspect a new document to confirm that it now has an `\"ml\": {\"tokens\":...}` field that contains a list of new, additional terms.\n", + "These terms are the **text expansion** of the field(s) you targeted for ELSER inference.\n", + "ELSER essentially creates a tree of expanded terms to improve the semantic searchability of your documents.\n", + "We'll be able to search these documents using a `text_expansion` query.\n", + "\n", + "But first let's start with a simple keyword search, to see how ELSER delivers semantically relevant results out of the box." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "_KahQAbPPd9l" + }, + "source": [ + "# Keyword match\n", + "\n", + "## Successful match\n", + "\n", + "Let's start by assuming a user queries the data set and hits an exact match.\n", + "BM25 is perfect for exact keyword matches.\n", + "Imagine our user remembers a movie where a child's spinning top was a recurring image.\n", + "They search for `spinning top` and because these exact words are used in the key scene description, we get a perfect hit.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FsZkFhGaYnzD", + "outputId": "843c72f1-6a0c-43ce-c1e4-ad5e763ebc95" + }, + "outputs": [], + "source": [ + "response = client.search(\n", + " index=\"elser-movies\",\n", + " query= {\n", + " \"match\": {\n", + " \"keyScene\": \"spinning top\"\n", + " }\n", + " }\n", + ")\n", + "for hit in response['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"\\nTitle: {title}\\nKey scene description: {text}\\n\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01WHeOtbTZ-" + }, + "source": [ + "## Unsuccessful match\n", + "\n", + "Unfortunately, searches that rely on exact matches are brittle.\n", + "What if you can't remember the exact name of the thing you're searching for?\n", + "Who knows what a spinning top is anyway?\n", + "\n", + "Imagine I can only think of the word `child toy` to describe this apparatus?\n", + "A match query won't find any relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "osifkhqidjYw", + "outputId": "6b917df6-b0af-4947-9280-98f7b17f2ff9" + }, + "outputs": [], + "source": [ + "response = client.search(\n", + " index=\"elser-movies\",\n", + " query= {\n", + " \"match\": {\n", + " \"keyScene\": \"child toy\"\n", + " }\n", + " }\n", + ")\n", + "hits = response['hits']['hits']\n", + "\n", + "if not hits:\n", + " print(\"No matches found\")\n", + "else:\n", + " for hit in hits:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"\\nTitle: {title}\\nKey scene description: {text}\\n\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "MPCVztOLeAk_" + }, + "source": [ + "So it turns out classical term matching strategies are very good, if you know precisely what you're looking for.\n", + "But they break down when a user has a hard time articulating what they're trying to find.\n", + "Here's where semantic search shines.\n", + "It helps capture a user's intent or meaning better, without relying on brittle term matches.\n", + "\n", + "Traditional dense vector based similarity strategies require you to generate embeddings for your data and then map queries into the same mathematical space as the data.\n", + "This works well but is time consuming and requires a lot of legwork.\n", + "The beauty of the Elastic Learned Sparse Encoder model is that it works out-of-the-box, without the need to fine tune on your data.\n", + "\n", + "The Elastic Learned Sparse Encoder creates a tree of expanded terms, adds them to your documents, improving their semantic searchability.\n", + "The fields that you targeted for inference are now enriched with a range of relevant synonyms and related terms, that increase the probability of a successful search." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Zy5GT2xb38oz" + }, + "source": [ + "# Semantic search with the `text_expansion` query\n", + "\n", + "Let's test out semantic search using the Elastic Learned Sparse Encoder, and see if we can improve our earlier unsuccessful search, using the query `child toy`.\n", + "\n", + "To perform semantic search using the Elastic Learned Sparse Encoder, you need the following:\n", + "- A `text_expansion` query\n", + "- Query text\n", + " - In this example we use `child toy`\n", + "- ELSER model ID" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bAZRxja-5Q6X", + "outputId": "37a26a2c-4284-4e51-c34e-9a55edf77cb8" + }, + "outputs": [], + "source": [ + "response = client.search(index='elser-movies', size=3,\n", + " query={\n", + " \"text_expansion\": {\n", + " \"ml.tokens\": {\n", + " \"model_id\":\".elser_model_1\",\n", + " \"model_text\":\"child toy\"\n", + " \n", + " }\n", + " }\n", + "}\n", + ")\n", + "\n", + "for hit in response['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " text = hit['_source']['keyScene']\n", + " print(f\"Score: {score}\\nTitle: {title}\\nKey scene description: {text}\\n\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "yYSJ7fnv5uWd" + }, + "source": [ + "Success! Out of the box ELSER has taken a fuzzy, but semantically similar query and found the correct match.\n", + "Our user has found the movie they're looking for!" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/colab-notebooks-examples/search/data.json b/colab-notebooks-examples/search/data.json new file mode 100644 index 00000000..ebc81684 --- /dev/null +++ b/colab-notebooks-examples/search/data.json @@ -0,0 +1,87 @@ +[ + { + "title": "The Pragmatic Programmer: Your Journey to Mastery", + "authors": ["andrew hunt", "david thomas"], + "summary": "A guide to pragmatic programming for software engineers and developers", + "publish_date": "2019-10-29", + "num_reviews": 30, + "publisher": "addison-wesley" + }, + { + "title": "Python Crash Course", + "authors": ["eric matthes"], + "summary": "A fast-paced, no-nonsense guide to programming in Python", + "publish_date": "2019-05-03", + "num_reviews": 42, + "publisher": "no starch press" + }, + { + "title": "Artificial Intelligence: A Modern Approach", + "authors": ["stuart russell", "peter norvig"], + "summary": "Comprehensive introduction to the theory and practice of artificial intelligence", + "publish_date": "2020-04-06", + "num_reviews": 39, + "publisher": "pearson" + }, + { + "title": "Clean Code: A Handbook of Agile Software Craftsmanship", + "authors": ["robert c. martin"], + "summary": "A guide to writing code that is easy to read, understand and maintain", + "publish_date": "2008-08-11", + "num_reviews": 55, + "publisher": "prentice hall" + }, + { + "title": "You Don't Know JS: Up & Going", + "authors": ["kyle simpson"], + "summary": "Introduction to JavaScript and programming as a whole", + "publish_date": "2015-03-27", + "num_reviews": 36, + "publisher": "oreilly" + }, + { + "title": "Eloquent JavaScript", + "authors": ["marijn haverbeke"], + "summary": "A modern introduction to programming", + "publish_date": "2018-12-04", + "num_reviews": 38, + "publisher": "no starch press" + }, + { + "title": "Design Patterns: Elements of Reusable Object-Oriented Software", + "authors": [ + "erich gamma", + "richard helm", + "ralph johnson", + "john vlissides" + ], + "summary": "Guide to design patterns that can be used in any object-oriented language", + "publish_date": "1994-10-31", + "num_reviews": 45, + "publisher": "addison-wesley" + }, + { + "title": "The Clean Coder: A Code of Conduct for Professional Programmers", + "authors": ["robert c. martin"], + "summary": "A guide to professional conduct in the field of software engineering", + "publish_date": "2011-05-13", + "num_reviews": 20, + "publisher": "prentice hall" + }, + { + "title": "JavaScript: The Good Parts", + "authors": ["douglas crockford"], + "summary": "A deep dive into the parts of JavaScript that are essential to writing maintainable code", + "publish_date": "2008-05-15", + "num_reviews": 51, + "publisher": "oreilly" + }, + { + "title": "Introduction to the Theory of Computation", + "authors": ["michael sipser"], + "summary": "Introduction to the theory of computation and complexity theory", + "publish_date": "2012-06-27", + "num_reviews": 33, + "publisher": "cengage learning" + } +]