Skip to content

Commit

Permalink
Merge branch 'main' into cohere
Browse files Browse the repository at this point in the history
  • Loading branch information
bogdankostic committed May 12, 2023
2 parents ca6636e + d322bee commit da90aa7
Show file tree
Hide file tree
Showing 9 changed files with 176 additions and 53 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/examples-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Run
run: pytest examples/
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Haystack
run: pip install ".[all]"
run: pip install ".[all,dev]"

- name: Mypy
if: steps.files.outputs.any_changed == 'true'
Expand Down Expand Up @@ -70,7 +70,7 @@ jobs:

- name: Install Haystack
run: |
pip install ".[all]"
pip install ".[all,dev]"
pip install ./haystack-linter
- name: Pylint
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/openapi_sync.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- name: Install Haystack
run: |
pip install --upgrade pip
pip install -U -e .[all]
pip install -U -e .[all,dev]
pip install -e ./rest_api
- name: Update OpenAPI specs
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ jobs:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Run
run: pytest --cov-report xml:coverage.xml --cov="haystack" -m "unit" test/${{ matrix.topic }}
Expand Down Expand Up @@ -1051,7 +1051,7 @@ jobs:
sudo apt-get install ffmpeg
- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Run tests
env:
Expand Down Expand Up @@ -1138,7 +1138,7 @@ jobs:
# - name: Install sndfile (audio support) # https://github.com/libsndfile/libsndfile/releases/download/1.1.0/libsndfile-1.1.0-win64.zip

- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Run tests
env:
Expand Down Expand Up @@ -1252,7 +1252,7 @@ jobs:
sudo apt-get install libsndfile1 ffmpeg
- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Cache HF models
id: cache-hf-models
Expand Down Expand Up @@ -1371,7 +1371,7 @@ jobs:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Haystack
run: pip install .[all]
run: pip install .[all,dev]

- name: Run tests
env:
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ pip install --upgrade pip
# Install Haystack in editable mode
pip install -e '.[dev]'
```
Note that the `.[dev]` part is enough in many development scenarios when adding minor code fixes. However, if your changes require a schema change, then you'll need to install all dependencies with `pip install -e '.[all]' ` command. Introducing new components or changing their interface requires a schema change.
Note that the `.[dev]` part is enough in many development scenarios when adding minor code fixes. However, if your changes require a schema change, then you'll need to install all dependencies with `pip install -e '.[all,dev]' ` command. Introducing new components or changing their interface requires a schema change.
This will install all the dependencies you need to work on the codebase, plus testing and formatting dependencies.

Last, install the pre-commit hooks with:
Expand Down Expand Up @@ -278,7 +278,7 @@ We formally define three scopes for tests in Haystack with different requirement
- Might not be possible to run locally due to system and hardware requirements
- **Goal: being confident in releasing Haystack**

> **Note**: migrating the existing tests into these new categories is still in progress. Please ask the maintainers if you are in doubt about how to
> **Note**: migrating the existing tests into these new categories is still in progress. Please ask the maintainers if you are in doubt about how to
classify your tests or where to place them.

If you are writing a test that depend on a document store, there are a few conventions to define on which document store
Expand Down
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,19 +71,19 @@ An example pipeline would consist of one `Retriever` Node and one `Reader` Node.
- **Continuous Learning**: Collect new training data from user feedback in production & improve your models continuously.

## Resources
| | |
| --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 📒 [Docs](https://docs.haystack.deepset.ai) | Components, Pipeline Nodes, Guides, API Reference |
| | |
| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 📒 [Docs](https://docs.haystack.deepset.ai) | Components, Pipeline Nodes, Guides, API Reference |
| 💾 [Installation](https://github.com/deepset-ai/haystack#-installation) | How to install Haystack |
| 🎓 [Tutorials](https://haystack.deepset.ai/tutorials) | See what Haystack can do with our Notebooks & Scripts |
| 🎉 [Haystack Extras](https://github.com/deepset-ai/haystack-extras) | A repository that lists extra Haystack packages and components that can be installed separately. |
| 🔰 [Demos](https://github.com/deepset-ai/haystack-demos) | A repository containing Haystack demo applications with Docker Compose and a REST API |
| 🖖 [Community](https://github.com/deepset-ai/haystack#-community) | [Discord](https://haystack.deepset.ai/community/join), [Twitter](https://twitter.com/deepset_ai), [Stack Overflow](https://stackoverflow.com/questions/tagged/haystack), [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) |
| 💙 [Contributing](https://github.com/deepset-ai/haystack#-contributing) | We welcome all contributions! |
| 📊 [Benchmarks](https://haystack.deepset.ai/benchmarks/) | Speed & Accuracy of Retriever, Readers and DocumentStores |
| 🔭 [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack |
| 📰 [Blog](https://haystack.deepset.ai/blog) | Learn about the latest with Haystack and NLP |
| ☎️ [Jobs](https://www.deepset.ai/jobs) | We're hiring! Have a look at our open positions |
| 🎓 [Tutorials](https://haystack.deepset.ai/tutorials) | See what Haystack can do with our Notebooks & Scripts |
| 🎉 [Haystack Extras](https://github.com/deepset-ai/haystack-extras) | A repository that lists extra Haystack packages and components that can be installed separately. |
| 🔰 [Demos](https://github.com/deepset-ai/haystack-demos) | A repository containing Haystack demo applications with Docker Compose and a REST API |
| 🖖 [Community](https://github.com/deepset-ai/haystack#-community) | [Discord](https://haystack.deepset.ai/community/join), [Twitter](https://twitter.com/deepset_ai), [Stack Overflow](https://stackoverflow.com/questions/tagged/haystack), [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) |
| 💙 [Contributing](https://github.com/deepset-ai/haystack#-contributing) | We welcome all contributions! |
| 📊 [Benchmarks](https://haystack.deepset.ai/benchmarks/) | Speed & Accuracy of Retriever, Readers and DocumentStores |
| 🔭 [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack |
| 📰 [Blog](https://haystack.deepset.ai/blog) | Learn about the latest with Haystack and NLP |
| ☎️ [Jobs](https://www.deepset.ai/jobs) | We're hiring! Have a look at our open positions |


## 💾 Installation
Expand All @@ -94,25 +94,25 @@ For a detailed installation guide see [the official documentation](https://docs.

Use [pip](https://github.com/pypa/pip) to install a basic version of Haystack's latest release:

```
pip install farm-haystack
```sh
pip install farm-haystack
```

This command installs everything needed for basic Pipelines that use an Elasticsearch DocumentStore.
This command installs everything needed for basic Pipelines that use an in-memory DocumentStore.

**Full Installation**

To use more advanced features, like certain DocumentStores, FileConverters, OCR, or Ray, install further dependencies. The following command installs the [latest release](https://github.com/deepset-ai/haystack/releases) of Haystack and all its dependencies:
To use more advanced features, like certain DocumentStores, FileConverters, OCR, or Ray,
you need to install further dependencies. The following command installs the [latest release](https://github.com/deepset-ai/haystack/releases) of Haystack and all its dependencies:

```bash
pip install --upgrade pip
pip install 'farm-haystack[all]' ## or 'all-gpu' for the GPU-enabled dependencies
```

If you want to try out the newest features that are not in an official release yet, you can install Haystack from the main branch. The following command installs from `main` branch with `dev` dependencies:
If you want to try out the newest features that are not in an official release yet, you can install the unstable version from the main branch with the following command:

```
pip install git+https://github.com/deepset-ai/haystack.git@main#egg=farm-haystack[dev]
```sh
pip install git+https://github.com/deepset-ai/haystack.git@main#egg=farm-haystack
```

To be able to make changes to Haystack code, install with the following commands:
Expand All @@ -127,8 +127,8 @@ cd haystack
# Upgrade pip
pip install --upgrade pip

# Install Haystack in editable mode
pip install -e '.[all]'
# Install Haystack in editable mode (-e) with the development tools
pip install -e '.[dev]'
```

If you want to contribute to the Haystack repo, check our [Contributor Guidelines](#💙-contributing) first.
Expand Down
62 changes: 54 additions & 8 deletions haystack/preview/dataclasses/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import hashlib
import logging
from pathlib import Path
from dataclasses import asdict, dataclass, field, fields
from dataclasses import dataclass, field, fields, asdict

import numpy
import pandas
Expand All @@ -21,6 +21,12 @@
"audio": Path,
}

EQUALS_BY_TYPE = {
Path: lambda self, other: self.absolute() == other.absolute(),
numpy.ndarray: lambda self, other: self.shape == other.shape and (self == other).all(),
pandas.DataFrame: lambda self, other: self.equals(other),
}


def _create_id(
classname: str, content: Any, metadata: Optional[Dict[str, Any]] = None, id_hash_keys: Optional[List[str]] = None
Expand All @@ -36,20 +42,39 @@ def _create_id(
return hashlib.sha256(str(content_to_hash).encode("utf-8")).hexdigest()


def _safe_equals(obj_1, obj_2) -> bool:
"""
Compares two dictionaries for equality, taking arrays, dataframes and other objects into account.
"""
if type(obj_1) != type(obj_2):
return False

if isinstance(obj_1, dict):
if obj_1.keys() != obj_2.keys():
return False
return all(_safe_equals(obj_1[key], obj_2[key]) for key in obj_1)

for type_, equals in EQUALS_BY_TYPE.items():
if isinstance(obj_1, type_):
return equals(obj_1, obj_2)

return obj_1 == obj_2


@dataclass(frozen=True)
class Document:
"""
Base data class containing some data to be queried.
Can contain text snippets, tables, file paths to files like images or audios.
Documents can be sorted by score, serialized to/from dictionary and JSON, and are immutable.
Can contain text snippets, tables, and file paths to images or audios.
Documents can be sorted by score, saved to/from dictionary and JSON, and are immutable.
Immutability is due to the fact that the document's ID depends on its content, so upon changing the content, also
the ID should change. To avoid keeping IDs in sync with the content by using properties, and asking docstores to
be aware of this corner case, we decide to make Documents immutable and remove the issue. If you need to modify a
Document, consider using `to_dict()`, modifying the dict, and then create a new Document object using
`Document.from_dict()`.
Note that `id_hash_keys` are referring to keys in the metadata. `content` is always included in the id hash.
Note that `id_hash_keys` are referring to keys in the metadata. `content` is always included in the ID hash.
In case of file-based documents (images, audios), the content that is hashed is the file paths,
so if the file is moved, the hash is different, but if the file is modified without renaming it, the has will
not differ.
Expand All @@ -59,17 +84,26 @@ class Document:
content: Any = field(default_factory=lambda: None)
content_type: ContentType = "text"
metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
id_hash_keys: List[str] = field(default_factory=lambda: [], hash=False)
id_hash_keys: List[str] = field(default_factory=list, hash=False)
score: Optional[float] = field(default=None, compare=True)
embedding: Optional[numpy.ndarray] = field(default=None, repr=False)

def __str__(self):
return f"{self.__class__.__name__}('{self.content}')"

def __eq__(self, other):
"""
Compares documents for equality. Compares `embedding` properly and checks the metadata taking care of
embeddings, paths, dataframes, nested dictionaries and other objects.
"""
if type(self) == type(other):
return _safe_equals(self.to_dict(), other.to_dict())
return False

def __post_init__(self):
"""
Generate the ID based on the init parameters and make sure that content_type
matches the actual type of content.
Generate the ID based on the init parameters and make sure that `content_type` matches the actual type of
content.
"""
# Validate content_type
if not isinstance(self.content, PYTHON_TYPES_FOR_CONTENT[self.content_type]):
Expand Down Expand Up @@ -100,17 +134,29 @@ def __post_init__(self):
object.__setattr__(self, "id", hashed_content)

def to_dict(self):
"""
Saves the Document into a dictionary.
"""
return asdict(self)

def to_json(self, **json_kwargs):
return json.dumps(self.to_dict(), *json_kwargs)
"""
Saves the Document into a JSON string that can be later loaded back.
"""
return json.dumps(self.to_dict(), **json_kwargs)

@classmethod
def from_dict(cls, dictionary):
"""
Creates a new Document object from a dictionary of its fields.
"""
return cls(**dictionary)

@classmethod
def from_json(cls, data, **json_kwargs):
"""
Creates a new Document object from a JSON string.
"""
dictionary = json.loads(data, **json_kwargs)
return cls.from_dict(dictionary=dictionary)

Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -227,11 +227,11 @@ formatting = [
]

all = [
"farm-haystack[docstores,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,dev,onnx,beir,metrics]",
"farm-haystack[docstores,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,onnx,beir,metrics]",
]
all-gpu = [
# beir is incompatible with faiss-gpu: https://github.com/beir-cellar/beir/issues/71
"farm-haystack[docstores-gpu,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,dev,onnx-gpu,metrics]",
"farm-haystack[docstores-gpu,audio,crawler,preprocessing,file-conversion,pdf,ocr,ray,onnx-gpu,metrics]",
]

[project.scripts]
Expand Down
Loading

0 comments on commit da90aa7

Please sign in to comment.