Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend TransformersQueryClassifier #2903

Closed
wants to merge 1,456 commits into from
Closed

Extend TransformersQueryClassifier #2903

wants to merge 1,456 commits into from

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Jul 27, 2022

Related Issue(s): #2587

Proposed changes:

  • Very first draft to extend TransformersQueryClassifier, to support text classification models (even with non-binary output) and also zero-shot classification

Pre-flight checklist

I had some little issues with pre-commit hooks and docs generation (hook id: pydoc-markdown)

tstadel and others added 30 commits April 26, 2022 20:28
* Update docs of DeepsetCloudDocumentStore

* Update Documentation & Code Style

* Update docstring

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* Update Documentation & Code Style

* move DEFAULT_API_ENDPOINT

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
* ignore mypy issues regarding files param of requests.post

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add support for aliases in elasticsearch document store

* Add alias support for OpenSearch

* Missing variable index

* Update Documentation & Code Style

* Add unit test for elasticsearch alias support

* Fix unit test when index is not compatible with haystack

* Fix auto format conflict

* Add comment explaining for loop for alias

* Update Documentation & Code Style

Co-authored-by: Jonathan Gallon <jonathan.gallon@totalenergies.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
* Linearize tables in EmbeddingRetriever

* Update Documentation & Code Style

* Fix typing

* Update Documentation & Code Style

* simplify table linearization method + make it private

* Update Documentation & Code Style

* fix typing

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Renaming the ElasticsearchFilterOnlyRetriever to FilterRetriever

* adding missed init file

* Update Documentation & Code Style

* fixed docstring

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add scale_scores_to_probabilities flag

* Update Documentation & Code Style

* fix tests

* fix sql mypy

* Update Documentation & Code Style

* fix responses

* Update Documentation & Code Style

* rename to scale_score_to_probability + docstrings

* use BaseDocumentStore.score_to_probability in elasticsearch and milvus2

* Update Documentation & Code Style

* fix tests

* Update Documentation & Code Style

* add tests

* improve naming

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add support for positional args in pipeline.get_config()

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…2487)

* changing the name of the retrievers from es_retriever to retriever

* Update Documentation & Code Style

* name fix 2

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This Update will fix this exception `Exception: pdftotext is not installed. It is part of xpdf or poppler-utils software suite. ` Now, converting PDFs wouldn't have any issues.
* Align TransformersReader defaults with vFARMReader

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add warning message

* Update doc string

* Update Documentation & Code Style

* Change DeprecationWarning to FutureWarning

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…TF-8` (#2420)

* Change default encoding for PDFToTextConverter

* Update Documentation & Code Style

* Improve docstring

* Update Documentation & Code Style

* Add list of ligatures to ignore and add the possibility to modify such list at need

* Add docstring

* Add tests

* Rename parameter

* Update Documentation & Code Style

* Move implementation into the base converter to make mypy happier

* Update Documentation & Code Style

* mypy and pylint

* mypy

* move encoding parameter to init of PDFToTextConverter

* Update Documentation & Code Style

* make utf8 default and fix mypy

* Update Documentation & Code Style

* Update Documentation & Code Style

* remove note on encoding in tutorial8

* Update Documentation & Code Style

* skip OCRConverter and test converter.run

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
* Remove BasePipeline and make a module for RayPipeline

* Can load pipelines from yaml, plenty of issues left

* Extract graph validation logic into _add_node_to_pipeline_graph & refactor load_from_config and add_node to use it

* Fix pipeline tests

* Move some tests out of test_pipeline.py and create MockDenseRetriever

* myoy and pylint (silencing too-many-public-methods)

* Fix issue found in some yaml files and in schema files

* Fix paths to YAML and fix some typos in Ray

* Fix eval tests

* Simplify MockDenseRetriever

* Fix Ray test

* Accidentally pushed merge coinflict, fixed

* Typo in schemas

* Typo in _json_schema.py

* Slightly reduce noisyness of version validation warnings

* Fix version logs tests

* Fix version logs tests again

* remove seemingly unused file

* Add check and test to avoid adding the same node to the pipeline twice

* Update Documentation & Code Style

* Revert config to pipeline_config

* Remo0ve unused import

* Complete reverting to pipeline_config

* Some more stray config=

* Update Documentation & Code Style

* Feedback

* Move back other_nodes tests into pipeline tests temporarily

* Update Documentation & Code Style

* Fixing tests

* Update Documentation & Code Style

* Fixing ray and standard pipeline tests

* Rename colliding load() methods in dense retrievers and faiss

* Update Documentation & Code Style

* Fix mypy on ray.py as well

* Add check for no root node

* Fix tests to use load_from_directory and load_index

* Try to workaround the disabled add_node of RayPipeline

* Update Documentation & Code Style

* Fix Ray test

* Fix FAISS tests

* Relax class check in _add_node_to_pipeline_graph

* Update Documentation & Code Style

* Try to fix mypy in ray.py

* unused import

* Try another fix for Ray

* Fix connector tests

* Update Documentation & Code Style

* Fix ray

* Update Documentation & Code Style

* use BaseComponent.load() in pipelines/base.py

* another round of feedback

* stray BaseComponent.load()

* Update Documentation & Code Style

* Fix FAISS tests too

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
* Upgrade xpdf to 4.04 in Exception text

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* change milvus links from 2.0.0 to 2.0.x

* Update Documentation & Code Style

* fix two broken links

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…talled (#2486)

* automatically download correct torch-scatter version

* raise error if torch-scatter is not installed

* Update Documentation & Code Style

* catch all import errors and fix linter

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
)

* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15

* Update Documentation & Code Style

* fix bug

* Update Documentation & Code Style

* update tutorial 15 outputs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
* Move super in OpenSearchDocumentStore and add small test

* Update Documentation & Code Style

* Add Opensearch container to the CI

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* delete unneeded files of last release

* add v1.4.0 docs with updated links

* upgrade version number

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Update version to 1.4.1rc0

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add_member_class_prefix: true

* Update Documentation & Code Style

* Trigger redeploy

* Trigger redeploy

* Fix pydoc param

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Upgrade transformers version to 4.18.0

* Adapt tokenization test to upgrade

* Adapt tokenization test to upgrade
* fix small typo in Document doc string

Was going through the tutorial, then digging through the code and just noticed a small typo

* generate markdown file changes from docstrings

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
* Update version to 1.4.1rc0

* Elasticsearch is not an optional dependency

* Fix import path

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Update version to 1.4.1rc0

* Add hint of enabling action on the fork in the PR template

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Add sort arg to JoinAnswers

* Update Documentation & Code Style

* Change naming and docstring

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
anakin87 and others added 4 commits August 3, 2022 14:55
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
ZanSara and others added 3 commits August 3, 2022 09:46
…loyments (#2918)

* Extending the Ray Serve integration to allow attributes for Serve deployments

This closes #2917

We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc.

Now this is possible from the pipeline yaml file for each node of the pipeline.

* Ran black and regenerated the json schemas

* Fixing the JSON Schema generation

* Trying to fix the schema CI test issue

* Fixing the test and the schemas

Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail.

* Merge the two Ray pipeline test cases

* Generate the JSON schemas again after `$ pip install .[all]`

* Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json`

This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it.

* Making changes based on @ZanSara's request - the newly requested test is failing

* Fixing the JSON schema generation again

* Renaming `replicas` and moving it under `serve_deployment_kwargs`

* add extras validation, untested

* Dcoumentation update

* Black

* [EMPTY] Re-trigger CI

Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
@ZanSara ZanSara mentioned this pull request Aug 3, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach a lot! Very clear and flexible. Thanks for the tests especially 😊 I would add a few more of them to test unhappy paths too (wrong labels given, no label, non-existent task names, etc...) but in general the PR is sound. Thank you! 🙌

haystack/nodes/query_classifier/transformers.py Outdated Show resolved Hide resolved
Comment on lines +107 to +115
def _get_edge_number(self, label):
return self.labels.index(label) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're about to merge a PR that improves management of nodes with a dynamic number of output edges: #2850 Can you wait for it to be merged and adapt the code to use the new system? It's going to be merged today or tomorrow anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little method (beautifully written by you 😃) simply matches a label to a specific output edge.
Does #2850 impact this? How?

test/conftest.py Outdated Show resolved Hide resolved
test/nodes/test_query_classifier.py Outdated Show resolved Hide resolved
@ZanSara
Copy link
Contributor

ZanSara commented Aug 3, 2022

I will try to figure out what to do for the CLAssistant bot in the meantime 😅

bglearning and others added 3 commits August 3, 2022 18:43
* Tutorial 06: Replace DPR with EmbeddingRetriever

Closes #2887

* Add updated tutorials/6.md file

Replace `DensePassageRetriever` with `EmbeddingRetriever`

* Update Tutorial 06 based on PR feedback

* Further updates to Tutorial-06 according to review feedback

* [Tutorial 06] Put in review feedback for the py file
* Ability to run Ray Serve detached

Fixes #2944

Ability to run Ray Serve detached - to allow running multiple instances of the app (HA).

See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis

* Generating the docs

* Re-trigger the CI pipeline

* Retrigger the CI Pipeline

* Typo in docstrings

* Fixing docstring and typing issues

* Regenerating docs

* [EMPTY] Re-trigger CI

* [EMPTY] Re-trigger CI

* Refactoring to allow any number of args for the `serve.start()` method

There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option.

* [EMPTY] Re-trigger CI

* Test whether the ServeControllerClient in fact has the supplied `detached` parameter
)

* fixed tokens in question generation

* simplified assignment

* same behavior also for pad and eos

* use skip_special_tokens in batch_decode

* fixed black error and update docs

* fixed schemas ci error

* JSON schemas

* Add git diff to debug schema issues

* opensearch schema was missing

* Add missing instruction in the workflow error message

* typo
@ZanSara
Copy link
Contributor

ZanSara commented Aug 3, 2022

One way around this is to squash all commits into a single one, so the history won't have trace of the non-existing email address and CLAssistant will stop bothering. There are a few ways to do that, let me know if you need help 🙂

masci and others added 7 commits August 3, 2022 19:19
* enable Opensearch unit tests under Win

* move unit tests into a dedicated job

* skip audio tests on missing dependencies

* avoid failing test collection when soundfile is not available

* Update .github/workflows/tests.yml

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
* fix validation for dynamic outgoing edges

* Update Documentation & Code Style

* use class outgoing_edges as fallback if no instance is provided

* implement classmethod approach

* readd comment

* fix mypy

* fix tests

* set outgoing_edges for all components

* set outgoing_edges for mocks too

* set document store outgoing_edges to 1

* set last missing outgoing_edges

* enforce BaseComponent subclasses to define outgoing_edges

* override _calculate_outgoing_edges for FileTypeClassifier

* remove superfluous test

* set rest_api's custom component's outgoing_edges

* Update docstring

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* remove unnecessary else

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@anakin87
Copy link
Member Author

anakin87 commented Aug 4, 2022

Sorry, I made a big mess with git. 🙁
At this point, the simplest thing is to close this PR and reopen a clean PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet