Merge branch 'main' into cohere

deepset-ai · May 12, 2023 · da90aa7 · da90aa7
2 parents ca6636e + d322bee
commit da90aa7
Show file tree

Hide file tree

Showing 9 changed files with 176 additions and 53 deletions.
diff --git a/.github/workflows/examples-tests.yml b/.github/workflows/examples-tests.yml
@@ -40,7 +40,7 @@ jobs:
           python-version: ${{ env.PYTHON_VERSION }}
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Run
         run: pytest examples/

diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -36,7 +36,7 @@ jobs:
           python-version: ${{ env.PYTHON_VERSION }}
 
       - name: Install Haystack
-        run: pip install ".[all]"
+        run: pip install ".[all,dev]"
 
       - name: Mypy
         if: steps.files.outputs.any_changed == 'true'
@@ -70,7 +70,7 @@ jobs:
 
       - name: Install Haystack
         run: |
-          pip install ".[all]"
+          pip install ".[all,dev]"
           pip install ./haystack-linter
 
       - name: Pylint

diff --git a/.github/workflows/openapi_sync.yml b/.github/workflows/openapi_sync.yml
@@ -26,7 +26,7 @@ jobs:
       - name: Install Haystack
         run: |
           pip install --upgrade pip
-          pip install -U -e .[all]
+          pip install -U -e .[all,dev]
           pip install -e ./rest_api
 
       - name: Update OpenAPI specs

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -155,7 +155,7 @@ jobs:
           python-version: ${{ env.PYTHON_VERSION }}
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Run
         run: pytest --cov-report xml:coverage.xml --cov="haystack" -m "unit" test/${{ matrix.topic }}
@@ -1051,7 +1051,7 @@ jobs:
           sudo apt-get install ffmpeg
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Run tests
         env:
@@ -1138,7 +1138,7 @@ jobs:
       # - name: Install sndfile (audio support) # https://github.com/libsndfile/libsndfile/releases/download/1.1.0/libsndfile-1.1.0-win64.zip
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Run tests
         env:
@@ -1252,7 +1252,7 @@ jobs:
           sudo apt-get install libsndfile1 ffmpeg
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Cache HF models
         id: cache-hf-models
@@ -1371,7 +1371,7 @@ jobs:
           python-version: ${{ env.PYTHON_VERSION }}
 
       - name: Install Haystack
-        run: pip install .[all]
+        run: pip install .[all,dev]
 
       - name: Run tests
         env:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -54,7 +54,7 @@ pip install --upgrade pip
 # Install Haystack in editable mode
 pip install -e '.[dev]'
 ```
-Note that the `.[dev]` part is enough in many development scenarios when adding minor code fixes. However, if your changes require a schema change, then you'll need to install all dependencies with `pip install -e '.[all]' ` command. Introducing new components or changing their interface requires a schema change.
+Note that the `.[dev]` part is enough in many development scenarios when adding minor code fixes. However, if your changes require a schema change, then you'll need to install all dependencies with `pip install -e '.[all,dev]' ` command. Introducing new components or changing their interface requires a schema change.
 This will install all the dependencies you need to work on the codebase, plus testing and formatting dependencies.
 
 Last, install the pre-commit hooks with:
@@ -278,7 +278,7 @@ We formally define three scopes for tests in Haystack with different requirement
 - Might not be possible to run locally due to system and hardware requirements
 - **Goal: being confident in releasing Haystack**
 
-> **Note**: migrating the existing tests into these new categories is still in progress. Please ask the maintainers if you are in doubt about how to 
+> **Note**: migrating the existing tests into these new categories is still in progress. Please ask the maintainers if you are in doubt about how to
 classify your tests or where to place them.
 
 If you are writing a test that depend on a document store, there are a few conventions to define on which document store

diff --git a/README.md b/README.md
@@ -71,19 +71,19 @@ An example pipeline would consist of one `Retriever` Node and one `Reader` Node.
 -   **Continuous Learning**: Collect new training data from user feedback in production & improve your models continuously.
 
 ## Resources
-|                                                                                               |                                                                                                                                                                                                                                                   |
-| --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| 📒 [Docs](https://docs.haystack.deepset.ai)                                             | Components, Pipeline Nodes, Guides, API Reference                                                                                                                                                                                                 |
+|                                                                        |                                                                                                                                                                                                                                                   |
+| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 📒 [Docs](https://docs.haystack.deepset.ai)                             | Components, Pipeline Nodes, Guides, API Reference                                                                                                                                                                                                 |
 | 💾 [Installation](https://github.com/deepset-ai/haystack#-installation) | How to install Haystack                                                                                                                                                                                                                           |
-| 🎓 [Tutorials](https://haystack.deepset.ai/tutorials)     | See what Haystack can do with our Notebooks & Scripts                                                                                                                                                                                             |
-| 🎉 [Haystack Extras](https://github.com/deepset-ai/haystack-extras)               | A repository that lists extra Haystack packages and components that can be installed separately.                                                                                                                                                                                             |
-| 🔰 [Demos](https://github.com/deepset-ai/haystack-demos)           | A repository containing Haystack demo applications with Docker Compose and a REST API                                                                                                                                                                                  |
-| 🖖 [Community](https://github.com/deepset-ai/haystack#-community)   | [Discord](https://haystack.deepset.ai/community/join), [Twitter](https://twitter.com/deepset_ai), [Stack Overflow](https://stackoverflow.com/questions/tagged/haystack), [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) |
-| 💙 [Contributing](https://github.com/deepset-ai/haystack#-contributing)             | We welcome all contributions!                                                                                                                                                                                                                     |
-| 📊 [Benchmarks](https://haystack.deepset.ai/benchmarks/)                             | Speed & Accuracy of Retriever, Readers and DocumentStores                                                                                                                                                                                         |
-| 🔭 [Roadmap](https://haystack.deepset.ai/overview/roadmap)                           | Public roadmap of Haystack                                                                                                                                                                                                                        |
-| 📰 [Blog](https://haystack.deepset.ai/blog)                                             | Learn about the latest with Haystack and NLP                                                                                                                                                                   |
-| ☎️ [Jobs](https://www.deepset.ai/jobs)                                                   | We're hiring! Have a look at our open positions                                                                                                                                                                                                   |
+| 🎓 [Tutorials](https://haystack.deepset.ai/tutorials)                   | See what Haystack can do with our Notebooks & Scripts                                                                                                                                                                                             |
+| 🎉 [Haystack Extras](https://github.com/deepset-ai/haystack-extras)     | A repository that lists extra Haystack packages and components that can be installed separately.                                                                                                                                                  |
+| 🔰 [Demos](https://github.com/deepset-ai/haystack-demos)                | A repository containing Haystack demo applications with Docker Compose and a REST API                                                                                                                                                             |
+| 🖖 [Community](https://github.com/deepset-ai/haystack#-community)       | [Discord](https://haystack.deepset.ai/community/join), [Twitter](https://twitter.com/deepset_ai), [Stack Overflow](https://stackoverflow.com/questions/tagged/haystack), [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) |
+| 💙 [Contributing](https://github.com/deepset-ai/haystack#-contributing) | We welcome all contributions!                                                                                                                                                                                                                     |
+| 📊 [Benchmarks](https://haystack.deepset.ai/benchmarks/)                | Speed & Accuracy of Retriever, Readers and DocumentStores                                                                                                                                                                                         |
+| 🔭 [Roadmap](https://haystack.deepset.ai/overview/roadmap)              | Public roadmap of Haystack                                                                                                                                                                                                                        |
+| 📰 [Blog](https://haystack.deepset.ai/blog)                             | Learn about the latest with Haystack and NLP                                                                                                                                                                                                      |
+| ☎️ [Jobs](https://www.deepset.ai/jobs)                                  | We're hiring! Have a look at our open positions                                                                                                                                                                                                   |
 
 
 ## 💾 Installation
@@ -94,25 +94,25 @@ For a detailed installation guide see [the official documentation](https://docs.
 
 Use [pip](https://github.com/pypa/pip) to install a basic version of Haystack's latest release:
 
-```
-    pip install farm-haystack
+```sh
+pip install farm-haystack
 ```
 
-This command installs everything needed for basic Pipelines that use an Elasticsearch DocumentStore.
+This command installs everything needed for basic Pipelines that use an in-memory DocumentStore.
 
 **Full Installation**
 
-To use more advanced features, like certain DocumentStores, FileConverters, OCR, or Ray, install further dependencies. The following command installs the [latest release](https://github.com/deepset-ai/haystack/releases) of Haystack and all its dependencies:
+To use more advanced features, like certain DocumentStores, FileConverters, OCR, or Ray,
+you need to install further dependencies. The following command installs the [latest release](https://github.com/deepset-ai/haystack/releases) of Haystack and all its dependencies:
 
 ```bash
-pip install --upgrade pip
 pip install 'farm-haystack[all]' ## or 'all-gpu' for the GPU-enabled dependencies
 ```
 
-If you want to try out the newest features that are not in an official release yet, you can install Haystack from the main branch. The following command installs from `main` branch with `dev` dependencies:
+If you want to try out the newest features that are not in an official release yet, you can install the unstable version from the main branch with the following command:
 
-```
-pip install git+https://github.com/deepset-ai/haystack.git@main#egg=farm-haystack[dev]
+```sh
+pip install git+https://github.com/deepset-ai/haystack.git@main#egg=farm-haystack
 ```
 
 To be able to make changes to Haystack code, install with the following commands:
@@ -127,8 +127,8 @@ cd haystack
 # Upgrade pip
 pip install --upgrade pip
 
-# Install Haystack in editable mode
-pip install -e '.[all]'
+# Install Haystack in editable mode (-e) with the development tools
+pip install -e '.[dev]'
 ```
 
 If you want to contribute to the Haystack repo, check our [Contributor Guidelines](#💙-contributing) first.

diff --git a/haystack/preview/dataclasses/document.py b/haystack/preview/dataclasses/document.py
@@ -4,7 +4,7 @@
 import hashlib
 import logging
 from pathlib import Path
-from dataclasses import asdict, dataclass, field, fields
+from dataclasses import dataclass, field, fields, asdict
 
 import numpy
 import pandas
@@ -21,6 +21,12 @@
     "audio": Path,
 }
 
+EQUALS_BY_TYPE = {
+    Path: lambda self, other: self.absolute() == other.absolute(),
+    numpy.ndarray: lambda self, other: self.shape == other.shape and (self == other).all(),
+    pandas.DataFrame: lambda self, other: self.equals(other),
+}
+
 
 def _create_id(
     classname: str, content: Any, metadata: Optional[Dict[str, Any]] = None, id_hash_keys: Optional[List[str]] = None
@@ -36,20 +42,39 @@ def _create_id(
     return hashlib.sha256(str(content_to_hash).encode("utf-8")).hexdigest()
 
 
+def _safe_equals(obj_1, obj_2) -> bool:
+    """
+    Compares two dictionaries for equality, taking arrays, dataframes and other objects into account.
+    """
+    if type(obj_1) != type(obj_2):
+        return False
+
+    if isinstance(obj_1, dict):
+        if obj_1.keys() != obj_2.keys():
+            return False
+        return all(_safe_equals(obj_1[key], obj_2[key]) for key in obj_1)
+
+    for type_, equals in EQUALS_BY_TYPE.items():
+        if isinstance(obj_1, type_):
+            return equals(obj_1, obj_2)
+
+    return obj_1 == obj_2
+
+
 @dataclass(frozen=True)
 class Document:
     """
     Base data class containing some data to be queried.
-    Can contain text snippets, tables, file paths to files like images or audios.
-    Documents can be sorted by score, serialized to/from dictionary and JSON, and are immutable.
+    Can contain text snippets, tables, and file paths to images or audios.
+    Documents can be sorted by score, saved to/from dictionary and JSON, and are immutable.
 
     Immutability is due to the fact that the document's ID depends on its content, so upon changing the content, also
     the ID should change.  To avoid keeping IDs in sync with the content by using properties, and asking docstores to
     be aware of this corner case, we decide to make Documents immutable and remove the issue. If you need to modify a
     Document, consider using `to_dict()`, modifying the dict, and then create a new Document object using
     `Document.from_dict()`.
 
-    Note that `id_hash_keys` are referring to keys in the metadata. `content` is always included in the id hash.
+    Note that `id_hash_keys` are referring to keys in the metadata. `content` is always included in the ID hash.
     In case of file-based documents (images, audios), the content that is hashed is the file paths,
     so if the file is moved, the hash is different, but if the file is modified without renaming it, the has will
     not differ.
@@ -59,17 +84,26 @@ class Document:
     content: Any = field(default_factory=lambda: None)
     content_type: ContentType = "text"
     metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
-    id_hash_keys: List[str] = field(default_factory=lambda: [], hash=False)
+    id_hash_keys: List[str] = field(default_factory=list, hash=False)
     score: Optional[float] = field(default=None, compare=True)
     embedding: Optional[numpy.ndarray] = field(default=None, repr=False)
 
     def __str__(self):
         return f"{self.__class__.__name__}('{self.content}')"
 
+    def __eq__(self, other):
+        """
+        Compares documents for equality. Compares `embedding` properly and checks the metadata taking care of
+        embeddings, paths, dataframes, nested dictionaries and other objects.
+        """
+        if type(self) == type(other):
+            return _safe_equals(self.to_dict(), other.to_dict())
+        return False
+
     def __post_init__(self):
         """
-        Generate the ID based on the init parameters and make sure that content_type
-        matches the actual type of content.
+        Generate the ID based on the init parameters and make sure that `content_type` matches the actual type of
+        content.
         """
         # Validate content_type
         if not isinstance(self.content, PYTHON_TYPES_FOR_CONTENT[self.content_type]):
@@ -100,17 +134,29 @@ def __post_init__(self):
         object.__setattr__(self, "id", hashed_content)
 
     def to_dict(self):
+        """
+        Saves the Document into a dictionary.
+        """
         return asdict(self)
 
     def to_json(self, **json_kwargs):
-        return json.dumps(self.to_dict(), *json_kwargs)
+        """
+        Saves the Document into a JSON string that can be later loaded back.
+        """
+        return json.dumps(self.to_dict(), **json_kwargs)
 
     @classmethod
     def from_dict(cls, dictionary):
+        """
+        Creates a new Document object from a dictionary of its fields.
+        """
         return cls(**dictionary)
 
     @classmethod
     def from_json(cls, data, **json_kwargs):
+        """
+        Creates a new Document object from a JSON string.
+        """
         dictionary = json.loads(data, **json_kwargs)
         return cls.from_dict(dictionary=dictionary)
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -227,11 +227,11 @@ formatting = [
 ]
 
 all = [
-  "farm-haystack[docstores,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,dev,onnx,beir,metrics]",
+  "farm-haystack[docstores,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,onnx,beir,metrics]",
 ]
 all-gpu = [
   # beir is incompatible with faiss-gpu: https://github.com/beir-cellar/beir/issues/71
-  "farm-haystack[docstores-gpu,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,dev,onnx-gpu,metrics]",
+  "farm-haystack[docstores-gpu,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,onnx-gpu,metrics]",
 ]
 
 [project.scripts]