diff --git a/README.md b/README.md index c2b8a51..9ecd17e 100644 --- a/README.md +++ b/README.md @@ -32,3 +32,4 @@ To contribute to the tutorials, please check out our [Contributing Guidelines](. | [Query Classification with TransformersTextRouter](./tutorials/41_Query_Classification_with_TransformersTextRouter_and_TransformersZeroShotTextRouter.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/41_Query_Classification_with_TransformersTextRouter_and_TransformersZeroShotTextRouter.ipynb) | | [Retrieving a Context Window Around a Sentence](./tutorials/42_Sentence_Window_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/42_Sentence_Window_Retriever.ipynb) | | | [Build a Tool-Calling Agent](./tutorials/43_Building_a_Tool_Calling_Agent.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/43_Building_a_Tool_Calling_Agent.ipynb) | | +| [Creating Custom SuperComponents](./tutorials/44_Creating_Custom_SuperComponents.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/44_Creating_Custom_SuperComponents.ipynb) | | diff --git a/index.toml b/index.toml index f30be23..822e47c 100644 --- a/index.toml +++ b/index.toml @@ -206,3 +206,15 @@ completion_time = "10 min" created_at = 2025-04-03 dependencies = ["docstring-parser", "trafilatura"] featured = true + +[[tutorial]] +title = "Creating Custom SuperComponents" +description = "Learn how to use the @super_component decorator to create custom SuperComponents with input and output mappings" +level = "intermediate" +weight = 8 +notebook = "44_Creating_Custom_SuperComponents.ipynb" +aliases = [] +completion_time = "20 min" +created_at = 2025-04-22 +dependencies = ["sentence-transformers>=3.0.0", "datasets", "accelerate"] +featured = true diff --git a/tutorials/44_Creating_Custom_SuperComponents.ipynb b/tutorials/44_Creating_Custom_SuperComponents.ipynb new file mode 100644 index 0000000..8ea12f3 --- /dev/null +++ b/tutorials/44_Creating_Custom_SuperComponents.ipynb @@ -0,0 +1,3594 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "2OvkPji9O-qX" + }, + "source": [ + "# Tutorial: Creating Custom SuperComponents\n", + "\n", + "- **Level**: Intermediate\n", + "- **Time to complete**: 20 minutes\n", + "- **Concepts and Components Used**: [`@super_component`](https://docs.haystack.deepset.ai/docs/supercomponents), [`Pipeline`](https://docs.haystack.deepset.ai/docs/pipeline), [`DocumentJoiner`](https://docs.haystack.deepset.ai/docs/documentjoiner), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder), [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`TransformersSimilarityRanker`](https://docs.haystack.deepset.ai/docs/transformerssimilarityranker)\n", + "- **Goal**: After completing this tutorial, you'll have learned how to create custom SuperComponents using the `@super_component` decorator to simplify complex pipelines and make them reusable as components." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LFqHcXYPO-qZ" + }, + "source": [ + "## Overview\n", + "\n", + "In this tutorial, you'll learn how to create custom SuperComponents using the `@super_component` decorator. SuperComponents are a powerful way to encapsulate complex pipelines into reusable components with simplified interfaces.\n", + "\n", + "We'll explore several examples:\n", + "\n", + "1. Creating a simple HybridRetriever SuperComponent\n", + "2. Extending our HybridRetriever with a ranker component\n", + "3. Creating a SuperComponent with custom input and output mappings\n", + "4. Creating a SuperComponent that exposes outputs from non-leaf components\n", + "\n", + "The `@super_component` decorator makes it easy to convert a class that defines a pipeline into a fully functional Haystack component that can be used in other pipelines or applications without losing pipeline functionalities like content tracing and debugging. All it requires is that the class has an attribute called `pipeline`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QXjVlbPiO-qZ" + }, + "source": [ + "## Preparing the Environment\n", + "\n", + "First, let's install Haystack and the dependencies we'll need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UQbU8GUfO-qZ" + }, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "pip install haystack-ai\n", + "pip install \"sentence-transformers>=3.0.0\" datasets transformers[torch,sentencepiece]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lvfew16O-qa" + }, + "source": [ + "## Understanding the @super_component Decorator\n", + "\n", + "The `@super_component` decorator is a powerful tool that allows you to create custom components by wrapping a Pipeline. It handles all the complexity of mapping inputs and outputs between the component interface and the underlying pipeline.\n", + "\n", + "When you use the `@super_component` decorator, you need to define a class with:\n", + "\n", + "1. An `__init__` method that creates a Pipeline and assigns it to `self.pipeline`\n", + "2. Optionally, `input_mapping` and `output_mapping` attributes to customize how inputs and outputs are mapped\n", + "\n", + "The decorator then:\n", + "\n", + "1. Creates a new class that inherits from `SuperComponent`\n", + "2. Copies all methods and attributes from your original class\n", + "3. Adds initialization logic to properly set up the SuperComponent\n", + "\n", + "Let's see how this works with some practical examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yL8nuJdWO-qa" + }, + "source": [ + "## 1. Creating a HybridRetriever SuperComponent\n", + "\n", + "Let's start with a simple example: creating a HybridRetriever that combines BM25 and embedding-based retrieval. This SuperComponent will take a query and return relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "XvLVaFHTO-qb" + }, + "outputs": [], + "source": [ + "from haystack import Document, Pipeline, super_component\n", + "from haystack.components.joiners import DocumentJoiner\n", + "from haystack.components.embedders import SentenceTransformersTextEmbedder\n", + "from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "\n", + "from datasets import load_dataset\n", + "\n", + "\n", + "@super_component\n", + "class HybridRetriever:\n", + " def __init__(self, document_store: InMemoryDocumentStore, embedder_model: str = \"BAAI/bge-small-en-v1.5\"):\n", + " # Create the components\n", + " embedding_retriever = InMemoryEmbeddingRetriever(document_store)\n", + " bm25_retriever = InMemoryBM25Retriever(document_store)\n", + " text_embedder = SentenceTransformersTextEmbedder(embedder_model)\n", + " document_joiner = DocumentJoiner(join_mode=\"reciprocal_rank_fusion\")\n", + "\n", + " # Create the pipeline\n", + " self.pipeline = Pipeline()\n", + " self.pipeline.add_component(\"text_embedder\", text_embedder)\n", + " self.pipeline.add_component(\"embedding_retriever\", embedding_retriever)\n", + " self.pipeline.add_component(\"bm25_retriever\", bm25_retriever)\n", + " self.pipeline.add_component(\"document_joiner\", document_joiner)\n", + "\n", + " # Connect the components\n", + " self.pipeline.connect(\"text_embedder\", \"embedding_retriever\")\n", + " self.pipeline.connect(\"bm25_retriever\", \"document_joiner\")\n", + " self.pipeline.connect(\"embedding_retriever\", \"document_joiner\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RkaQjJJX0FAU" + }, + "source": [ + "Now, let's load a dataset and test our HybridRetriever:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aNzUi4iz0FAU" + }, + "outputs": [], + "source": [ + "# Load a dataset\n", + "dataset = load_dataset(\"HaystackBot/medrag-pubmed-chunk-with-embeddings\", split=\"train\")\n", + "docs = [Document(content=doc[\"contents\"], embedding=doc[\"embedding\"]) for doc in dataset]\n", + "document_store = InMemoryDocumentStore()\n", + "document_store.write_documents(docs)\n", + "\n", + "# Create and run the HybridRetriever\n", + "query = \"What treatments are available for chronic bronchitis?\"\n", + "retriever = HybridRetriever(document_store)\n", + "result = retriever.run(\n", + " text=query, query=query\n", + ") # `query` variable will match with `text` and `query` inputs of components in the pipeline." + ] + }, + { + "cell_type": "code", + "source": [ + "# Print the results\n", + "print(f\"Found {len(result['documents'])} documents\")\n", + "for i, doc in enumerate(result[\"documents\"][:3]): # Show first 3 documents\n", + " print(f\"\\nDocument {i+1} (Score: {doc.score:.4f}):\")\n", + " print(doc.content[:200] + \"...\")" + ], + "metadata": { + "id": "ZoTmeqV_j-OI", + "outputId": "7fa55a12-563c-4341-bbaa-3c578cc5e976", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Found 15 documents\n", + "\n", + "Document 1 (Score: 0.9841):\n", + "[Proceedings: Long-term therapy with antibiotics in chronic bronchitis]. Longterm therapy of chronic bacterial bronchitis assumes two forms: (a) therapy of acute exacerbations, and (b) continuous long...\n", + "\n", + "Document 2 (Score: 0.9541):\n", + "An investigation of renal function in chronic bronchitis. An investigation has been made into various parameters of renal function in patients with chronic bronchitis and in a group of hypoxic control...\n", + "\n", + "Document 3 (Score: 0.9186):\n", + "Haematologic adaptation in patients with chronic bronchitis and pulmonary insufficiency. The relationship between respiratory insufficiency, expressed by gas tensions in blood and bone marrow, and hae...\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZnuXEecr0FAU" + }, + "source": [ + "### How the HybridRetriever Works\n", + "\n", + "Let's break down what's happening in our HybridRetriever SuperComponent:\n", + "\n", + "1. We define a class decorated with `@super_component`\n", + "2. In the `__init__` method, we:\n", + " - Create all the components we need (embedding retriever, BM25 retriever, etc.)\n", + " - Create a Pipeline and add all components to it\n", + " - Connect the components to define the flow of data\n", + "3. The `@super_component` decorator handles all the complexity of making our class work as a component\n", + "\n", + "\n", + "If we define an input mapping like `{\"query\": [\"text_embedder.text\", \"bm25_retriever.query\"]}`, we can call `retriever.run(query=query)`, and the query will automatically be routed to both the text embedder's `text` input and the BM25 retriever's `query` input.\n", + "\n", + "You can also specify how the outputs should be exposed through `output_mapping`. For example, output mapping `{\"document_joiner.documents\": \"documents\"}` means that the documents produced by the `document_joiner` will be returned under the name `documents` when you call `retriever.run(...)`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HryYZP9ZO-qb" + }, + "source": [ + "## 2. A HybridRetriever with Re-Ranking and Custom 'input_mapping'\n", + "\n", + "Now, let's enhance our HybridRetriever by adding a ranker component. This will re-rank the documents based on their semantic similarity to the query, potentially improving the quality of the results. We also define a custom input_mapping." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "INdC3WvLO-qb" + }, + "outputs": [], + "source": [ + "from haystack import Document, Pipeline, super_component\n", + "from haystack.components.joiners import DocumentJoiner\n", + "from haystack.components.embedders import SentenceTransformersTextEmbedder\n", + "from haystack.components.rankers import TransformersSimilarityRanker\n", + "from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "\n", + "from datasets import load_dataset\n", + "\n", + "\n", + "@super_component\n", + "class HybridRetrieverWithRanker:\n", + " def __init__(\n", + " self,\n", + " document_store: InMemoryDocumentStore,\n", + " embedder_model: str = \"BAAI/bge-small-en-v1.5\",\n", + " ranker_model: str = \"BAAI/bge-reranker-base\",\n", + " ):\n", + " # Create the components\n", + " embedding_retriever = InMemoryEmbeddingRetriever(document_store)\n", + " bm25_retriever = InMemoryBM25Retriever(document_store)\n", + " text_embedder = SentenceTransformersTextEmbedder(embedder_model)\n", + " document_joiner = DocumentJoiner()\n", + " ranker = TransformersSimilarityRanker(ranker_model)\n", + "\n", + " # Create the pipeline\n", + " self.pipeline = Pipeline()\n", + " self.pipeline.add_component(\"text_embedder\", text_embedder)\n", + " self.pipeline.add_component(\"embedding_retriever\", embedding_retriever)\n", + " self.pipeline.add_component(\"bm25_retriever\", bm25_retriever)\n", + " self.pipeline.add_component(\"document_joiner\", document_joiner)\n", + " self.pipeline.add_component(\"ranker\", ranker)\n", + "\n", + " # Connect the components\n", + " self.pipeline.connect(\"text_embedder\", \"embedding_retriever\")\n", + " self.pipeline.connect(\"bm25_retriever\", \"document_joiner\")\n", + " self.pipeline.connect(\"embedding_retriever\", \"document_joiner\")\n", + " self.pipeline.connect(\"document_joiner\", \"ranker\")\n", + "\n", + " # Define input mapping\n", + " self.input_mapping = {\"query\": [\"text_embedder.text\", \"bm25_retriever.query\", \"ranker.query\"]}" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "yxaN3KBo65pv", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 440, + "referenced_widgets": [ + "b4246cbc7b0b4f7784293fbd8337befe", + "97e4db1d981344ac80ce2e59745f50ff", + "49c3d271587a46a08edfc1a059887131", + "df28d21a35204da58c8642cfe1687a4d", + "4c933b0299f9478e9c39ba079302035f", + "6873d6a802f44e86ad11c44a47b43452", + "4f5e209fdc084e6fabbc480d673b14cf", + "c1c3b0e627944396ab365a6c7b91cc54", + "6b44fcace7a84fa9a863bb782668789c", + "fc33fd0c55554ea39216330ea4f7c745", + "547c5f3ca88747f8b9744da7456335b9", + "f41d46b53c454c6d886acb02ad054cda", + "3f9b55a507414540808db89d2af7302c", + "3155d69cb8b54b1899d447179d1eff9b", + "980f33d3987a4a21983d8e36f45f27f3", + "0974fcf1f2cf43c79d0c9e925f95872e", + "e863cb07b1134c9291cee19eeb2ee8f1", + "a2951397273141b1b5c05a3dd89dea7c", + "d0d28d8937b548fbafd2c72d33e4e4ab", + "4c78d723207346cdba9a2be2650b0bbc", + "e6b0473eccbf42adbc79375c48bab702", + "0aea00b2a8d443719572911525df5727", + "165ba79a9c1a465b829b16643b485e54", + "e02626edcde54a918ba58711fb78f9eb", + "a8bf397d535545cca2916a3103189591", + "362d04b496e04dd896555cdc562c0387", + "40282da198fe4350b5a806cf76492d13", + "206acb3589a049f38bab5e674a97cf75", + "5645ab5f93b9493a915f34dfbf444617", + "70a6ccd0ba6d491a9fc5284eedd0d736", + "c56370cd22b04d7d8d6b04abb22eb8d6", + "9054763a91c84642bd9d2c2db3b6f25a", + "e5bf035129f543feb9c5fa7a978fc5e8", + "c511266135c740e483df076e2812c139", + "9ebf405398e04047bcf7fdf0c4ec1beb", + "5ddb907fa72940d4b6b04f37d8c1d03c", + "66cbc136d6fa4c2ab97706cf51f691b4", + "41803f1ed4474a4991011e486ba09c3b", + "dafb0a71240c499f8481ee8ccc551289", + "269b9dc104ac47f3a6861f868f2e3419", + "046d644b846740039469eb061d47218b", + "58f6e0621d1749b0bdb3855a9af27e47", + "23cae104afce477f8514919db4e3da9b", + "da8b68a02ff642139022e5bc09d39cda", + "858d7e87d43f4f6c9c5ce83488b160b8", + "5bd56c5f80784646b74ac3fde1c9c7b9", + "290dbd2bef85499a967738a3ed87453b", + "5e17e82b3b154a8aa731833992b852e5", + "95b52f609cac4ecb916a31d1f8c13c8c", + "1027aa20fdcd4311a2409047ccf29520", + "b6fb44afe94e48f18eb41d982cd7dfdc", + "aaab7ce16fc44608815cdfea8606cb09", + "5e21b9d4e394464bb8402a586074cef7", + "569b3110771b4e2e83498ea770b97abe", + "a6e7b602094e41e5a3db0031b2ede035", + "573b84602f75411f801c7a69509754d2", + "fec4f36eb14f4ef8b2d3932327eb1311", + "e7c9affa0c6f4fa189c70188bc5fc0e6", + "0dba51b2e7cc4927bb9c94f820abbdc8", + "16ebad6926f34338a088a6cdfb8ced86", + "c040063684024574a3cb7e7db7a10297", + "0023bce9a9434741a59910e734902d83", + "a5951c7a49d24ee0904e90a1193f683f", + "f4e7897f36d94a50b0d546cdd032d1ea", + "3d678d1ee9f14c3380208c49a5108221", + "ba108e9e3b0248be9d4413bd7d1e6f2f", + "59c201c7fd8b4e04b23246db1bc7061a", + "ec9d2a77052b4979ad999c8e843e4c55", + "9a791ad8efc340c2876a5335eb32e4f5", + "300576212d2a4109a77f732e6ebe8d8f", + "256d83e568f240f5b1ea5b611b599375", + "5b24575d05a64ec98c0a514042006aac", + "6d55e29f7df248dda6da276bb55a9f1a", + "50dd4ef6a384418694dce3209a1a35cd", + "3bc892d8b29f4e51a7091b5a80886802", + "f6aac71ee87f4f72ab8a854e587116c6", + "936bf0c898e648cfa7a2b8d5f69c44a0" + ] + }, + "outputId": "21486dba-7914-4349-a579-2993ec212d86" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "config.json: 0%| | 0.00/799 [00:00\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m# Deserialize the component from the dictionary\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mdeserialized\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcomponent_from_dict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mDocumentPreprocessor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mserialized\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"document_preprocessor\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\nDeserialized component:\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdeserialized\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/haystack/core/serialization.py\u001b[0m in \u001b[0;36mcomponent_from_dict\u001b[0;34m(cls, data, name, callbacks)\u001b[0m\n\u001b[1;32m 164\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 165\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallbacks\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mcallbacks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcomponent_pre_init\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 166\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mdo_from_dict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 167\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 168\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0m_hook_component_init\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcomponent_pre_init_callback\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/haystack/core/serialization.py\u001b[0m in \u001b[0;36mdo_from_dict\u001b[0;34m()\u001b[0m\n\u001b[1;32m 159\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdo_from_dict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 160\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcls\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"from_dict\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 161\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mcls\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfrom_dict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 163\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdefault_from_dict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcls\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/haystack/components/preprocessors/document_preprocessor.py\u001b[0m in \u001b[0;36mfrom_dict\u001b[0;34m(cls, data)\u001b[0m\n\u001b[1;32m 188\u001b[0m \"\"\"\n\u001b[1;32m 189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"splitting_function\"\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"init_parameters\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 190\u001b[0;31m data[\"init_parameters\"][\"splitting_function\"] = deserialize_callable(\n\u001b[0m\u001b[1;32m 191\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"init_parameters\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"splitting_function\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 192\u001b[0m )\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/haystack/utils/callable_serialization.py\u001b[0m in \u001b[0;36mdeserialize_callable\u001b[0;34m(callable_handle)\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[0;34m:\u001b[0m\u001b[0mraises\u001b[0m \u001b[0mDeserializationError\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mIf\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mcallable\u001b[0m \u001b[0mcannot\u001b[0m \u001b[0mbe\u001b[0m \u001b[0mfound\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 52\u001b[0m \"\"\"\n\u001b[0;32m---> 53\u001b[0;31m \u001b[0mparts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcallable_handle\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\".\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 54\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'split'" + ] + } + ], + "source": [ + "from haystack.core.serialization import component_to_dict, component_from_dict\n", + "from haystack.components.preprocessors import DocumentPreprocessor\n", + "\n", + "# Create an instance of our SuperComponent\n", + "preprocessor = DocumentPreprocessor()\n", + "\n", + "# Serialize the component to a dictionary\n", + "serialized = component_to_dict(preprocessor, \"document_preprocessor\")\n", + "print(\"Serialized component:\")\n", + "print(serialized)\n", + "\n", + "# Deserialize the component from the dictionary\n", + "deserialized = component_from_dict(DocumentPreprocessor, serialized, \"document_preprocessor\")\n", + "print(\"\\nDeserialized component:\")\n", + "print(deserialized)\n", + "\n", + "# Verify that the deserialized component works\n", + "doc = Document(content=\"I love pizza!\")\n", + "result = deserialized.run(documents=[doc])\n", + "print(f\"\\nDeserialized component produced {len(result['documents'])} documents\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6zbojiWXgEe" + }, + "source": [ + "The serialization and deserialization process works seamlessly with SuperComponents because the `@super_component` decorator automatically adds the necessary functionality. This is particularly useful when you want to:\n", + "\n", + "1. **Save and load pipelines**: You can save your entire pipeline, including SuperComponents, to a file and load it later.\n", + "2. **Deploy components**: You can deploy your SuperComponents to a server or cloud environment.\n", + "3. **Share components**: You can share your SuperComponents with others, who can then load and use them in their own pipelines.\n", + "\n", + "The serialization process captures all the initialization parameters of your SuperComponent, ensuring that when it's deserialized, it's recreated with the same configuration." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9y4iJE_SrS4K" + }, + "source": [ + "## 4. Creating a SuperComponent with Outputs from Non-Leaf Components\n", + "\n", + "One of the powerful features of SuperComponents is the ability to expose outputs from any component in the pipeline, not just the leaf components. With leaf components, we here refer to components that do not send any outputs to other components in a pipeline. Let's create a SuperComponent that demonstrates this capability." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "HryYZP9ZO-qc" + }, + "outputs": [], + "source": [ + "from haystack import Document, Pipeline, super_component\n", + "from haystack.components.joiners import DocumentJoiner\n", + "from haystack.components.embedders import SentenceTransformersTextEmbedder\n", + "from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever\n", + "from haystack.components.rankers import TransformersSimilarityRanker\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "\n", + "\n", + "@super_component\n", + "class AdvancedHybridRetriever:\n", + " def __init__(\n", + " self,\n", + " document_store: InMemoryDocumentStore,\n", + " embedder_model: str = \"BAAI/bge-small-en-v1.5\",\n", + " ranker_model: str = \"BAAI/bge-reranker-base\",\n", + " ):\n", + " # Create the components\n", + " embedding_retriever = InMemoryEmbeddingRetriever(document_store)\n", + " bm25_retriever = InMemoryBM25Retriever(document_store)\n", + " text_embedder = SentenceTransformersTextEmbedder(embedder_model)\n", + " document_joiner = DocumentJoiner()\n", + " ranker = TransformersSimilarityRanker(ranker_model)\n", + "\n", + " # Create the pipeline\n", + " self.pipeline = Pipeline()\n", + " self.pipeline.add_component(\"text_embedder\", text_embedder)\n", + " self.pipeline.add_component(\"embedding_retriever\", embedding_retriever)\n", + " self.pipeline.add_component(\"bm25_retriever\", bm25_retriever)\n", + " self.pipeline.add_component(\"document_joiner\", document_joiner)\n", + " self.pipeline.add_component(\"ranker\", ranker)\n", + "\n", + " # Connect the components\n", + " self.pipeline.connect(\"text_embedder\", \"embedding_retriever\")\n", + " self.pipeline.connect(\"bm25_retriever\", \"document_joiner\")\n", + " self.pipeline.connect(\"embedding_retriever\", \"document_joiner\")\n", + " self.pipeline.connect(\"document_joiner\", \"ranker\")\n", + "\n", + " # Define input and output mappings\n", + " self.input_mapping = {\"query\": [\"text_embedder.text\", \"bm25_retriever.query\", \"ranker.query\"]}\n", + "\n", + " # Expose outputs from multiple components, including non-leaf components\n", + " self.output_mapping = {\n", + " \"bm25_retriever.documents\": \"bm25_documents\",\n", + " \"embedding_retriever.documents\": \"embedding_documents\",\n", + " \"document_joiner.documents\": \"joined_documents\",\n", + " \"ranker.documents\": \"ranked_documents\",\n", + " \"text_embedder.embedding\": \"query_embedding\",\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "INdC3WvLO-qc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 373, + "referenced_widgets": [ + "1c63ba6f43574419b66fe173a763b04a", + "19f722a52a8e4e709bb99e6904187ced", + "d85c427101844bca92b072ddcad78faf", + "df0c5bd77043417fa206f28633170b75", + "efa58b66150e4f7bbce5ed3dd63f13a7", + "624236744c8043e8bed74fd8f6dceacc", + "6f9e161785a84bfe8b502f7cfca3bc62", + "ffac4adb3c85410992373480b9902b09", + "6072bee9c392439e95850c73df892e0c", + "7e3cfba7404a45a9920f2e9f07f1bf6e", + "90f7536c69e84b25a48386a7bffbfe42" + ] + }, + "outputId": "a9d0a257-6969-49bd-f141-7d3d4fe24813" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "Batches: 0%| | 0/1 [00:00