Extend `TransformersQueryClassifier` #2903

anakin87 · 2022-07-27T23:13:09Z

Related Issue(s): #2587

Proposed changes:

Very first draft to extend TransformersQueryClassifier, to support text classification models (even with non-binary output) and also zero-shot classification

Pre-flight checklist

I have read the contributors guidelines
I have enabled actions on my fork
If this is a code change, I added tests or updated existing ones
If this is a code change, I updated the docstrings

I had some little issues with pre-commit hooks and docs generation (hook id: pydoc-markdown)

* Update docs of DeepsetCloudDocumentStore * Update Documentation & Code Style * Update docstring Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com> * Update Documentation & Code Style * move DEFAULT_API_ENDPOINT Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* ignore mypy issues regarding files param of requests.post * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Add support for aliases in elasticsearch document store * Add alias support for OpenSearch * Missing variable index * Update Documentation & Code Style * Add unit test for elasticsearch alias support * Fix unit test when index is not compatible with haystack * Fix auto format conflict * Add comment explaining for loop for alias * Update Documentation & Code Style Co-authored-by: Jonathan Gallon <jonathan.gallon@totalenergies.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>

* Linearize tables in EmbeddingRetriever * Update Documentation & Code Style * Fix typing * Update Documentation & Code Style * simplify table linearization method + make it private * Update Documentation & Code Style * fix typing Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Renaming the ElasticsearchFilterOnlyRetriever to FilterRetriever * adding missed init file * Update Documentation & Code Style * fixed docstring * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add scale_scores_to_probabilities flag * Update Documentation & Code Style * fix tests * fix sql mypy * Update Documentation & Code Style * fix responses * Update Documentation & Code Style * rename to scale_score_to_probability + docstrings * use BaseDocumentStore.score_to_probability in elasticsearch and milvus2 * Update Documentation & Code Style * fix tests * Update Documentation & Code Style * add tests * improve naming * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add support for positional args in pipeline.get_config() * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…2487) * changing the name of the retrievers from es_retriever to retriever * Update Documentation & Code Style * name fix 2 * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This Update will fix this exception `Exception: pdftotext is not installed. It is part of xpdf or poppler-utils software suite. ` Now, converting PDFs wouldn't have any issues.

* Align TransformersReader defaults with vFARMReader * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Add warning message * Update doc string * Update Documentation & Code Style * Change DeprecationWarning to FutureWarning Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…TF-8` (#2420) * Change default encoding for PDFToTextConverter * Update Documentation & Code Style * Improve docstring * Update Documentation & Code Style * Add list of ligatures to ignore and add the possibility to modify such list at need * Add docstring * Add tests * Rename parameter * Update Documentation & Code Style * Move implementation into the base converter to make mypy happier * Update Documentation & Code Style * mypy and pylint * mypy * move encoding parameter to init of PDFToTextConverter * Update Documentation & Code Style * make utf8 default and fix mypy * Update Documentation & Code Style * Update Documentation & Code Style * remove note on encoding in tutorial8 * Update Documentation & Code Style * skip OCRConverter and test converter.run * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>

* Remove BasePipeline and make a module for RayPipeline * Can load pipelines from yaml, plenty of issues left * Extract graph validation logic into _add_node_to_pipeline_graph & refactor load_from_config and add_node to use it * Fix pipeline tests * Move some tests out of test_pipeline.py and create MockDenseRetriever * myoy and pylint (silencing too-many-public-methods) * Fix issue found in some yaml files and in schema files * Fix paths to YAML and fix some typos in Ray * Fix eval tests * Simplify MockDenseRetriever * Fix Ray test * Accidentally pushed merge coinflict, fixed * Typo in schemas * Typo in _json_schema.py * Slightly reduce noisyness of version validation warnings * Fix version logs tests * Fix version logs tests again * remove seemingly unused file * Add check and test to avoid adding the same node to the pipeline twice * Update Documentation & Code Style * Revert config to pipeline_config * Remo0ve unused import * Complete reverting to pipeline_config * Some more stray config= * Update Documentation & Code Style * Feedback * Move back other_nodes tests into pipeline tests temporarily * Update Documentation & Code Style * Fixing tests * Update Documentation & Code Style * Fixing ray and standard pipeline tests * Rename colliding load() methods in dense retrievers and faiss * Update Documentation & Code Style * Fix mypy on ray.py as well * Add check for no root node * Fix tests to use load_from_directory and load_index * Try to workaround the disabled add_node of RayPipeline * Update Documentation & Code Style * Fix Ray test * Fix FAISS tests * Relax class check in _add_node_to_pipeline_graph * Update Documentation & Code Style * Try to fix mypy in ray.py * unused import * Try another fix for Ray * Fix connector tests * Update Documentation & Code Style * Fix ray * Update Documentation & Code Style * use BaseComponent.load() in pipelines/base.py * another round of feedback * stray BaseComponent.load() * Update Documentation & Code Style * Fix FAISS tests too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

* Upgrade xpdf to 4.04 in Exception text * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* change milvus links from 2.0.0 to 2.0.x * Update Documentation & Code Style * fix two broken links * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…talled (#2486) * automatically download correct torch-scatter version * raise error if torch-scatter is not installed * Update Documentation & Code Style * catch all import errors and fix linter * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

) * replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 * Update Documentation & Code Style * fix bug * Update Documentation & Code Style * update tutorial 15 outputs Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>

* Move super in OpenSearchDocumentStore and add small test * Update Documentation & Code Style * Add Opensearch container to the CI Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* delete unneeded files of last release * add v1.4.0 docs with updated links * upgrade version number * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update version to 1.4.1rc0 * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add_member_class_prefix: true * Update Documentation & Code Style * Trigger redeploy * Trigger redeploy * Fix pydoc param * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Upgrade transformers version to 4.18.0 * Adapt tokenization test to upgrade * Adapt tokenization test to upgrade

* fix small typo in Document doc string Was going through the tutorial, then digging through the code and just noticed a small typo * generate markdown file changes from docstrings Co-authored-by: Julian Risch <julian.risch@deepset.ai>

* Update version to 1.4.1rc0 * Elasticsearch is not an optional dependency * Fix import path * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update version to 1.4.1rc0 * Add hint of enabling action on the fork in the PR template * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Add sort arg to JoinAnswers * Update Documentation & Code Style * Change naming and docstring * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

@ZanSara

…loyments (#2918) * Extending the Ray Serve integration to allow attributes for Serve deployments This closes #2917 We should be able to set Ray Serve attributes for the nodes of pipelines, like amount of GPU to use, max_concurrent_queries, etc. Now this is possible from the pipeline yaml file for each node of the pipeline. * Ran black and regenerated the json schemas * Fixing the JSON Schema generation * Trying to fix the schema CI test issue * Fixing the test and the schemas Python 3.8 was generating a different schema than Python 3.7 is creating in the CI. You MUST use Python 3.7 to generate the schemas, otherwise the CIs will fail. * Merge the two Ray pipeline test cases * Generate the JSON schemas again after `$ pip install .[all]` * Removing `haystack/json-schemas/haystack-pipeline-1.16.schema.json` This was generated by the JSON generator, but based on @ZanSara's instructions, I am removing it. * Making changes based on @ZanSara's request - the newly requested test is failing * Fixing the JSON schema generation again * Renaming `replicas` and moving it under `serve_deployment_kwargs` * add extras validation, untested * Dcoumentation update * Black * [EMPTY] Re-trigger CI Co-authored-by: Sara Zan <sarazanzo94@gmail.com>

ZanSara

I like this approach a lot! Very clear and flexible. Thanks for the tests especially 😊 I would add a few more of them to test unhappy paths too (wrong labels given, no label, non-existent task names, etc...) but in general the PR is sound. Thank you! 🙌

haystack/nodes/query_classifier/transformers.py

ZanSara · 2022-08-03T16:08:56Z

haystack/nodes/query_classifier/transformers.py

+    def _get_edge_number(self, label):
+        return self.labels.index(label) + 1


We're about to merge a PR that improves management of nodes with a dynamic number of output edges: #2850 Can you wait for it to be merged and adapt the code to use the new system? It's going to be merged today or tomorrow anyway

This little method (beautifully written by you 😃) simply matches a label to a specific output edge.
Does #2850 impact this? How?

test/conftest.py

test/nodes/test_query_classifier.py

ZanSara · 2022-08-03T16:32:58Z

I will try to figure out what to do for the CLAssistant bot in the meantime 😅

* Tutorial 06: Replace DPR with EmbeddingRetriever Closes #2887 * Add updated tutorials/6.md file Replace `DensePassageRetriever` with `EmbeddingRetriever` * Update Tutorial 06 based on PR feedback * Further updates to Tutorial-06 according to review feedback * [Tutorial 06] Put in review feedback for the py file

* Ability to run Ray Serve detached Fixes #2944 Ability to run Ray Serve detached - to allow running multiple instances of the app (HA). See https://docs.ray.io/en/latest/serve/package-ref.html#core-apis * Generating the docs * Re-trigger the CI pipeline * Retrigger the CI Pipeline * Typo in docstrings * Fixing docstring and typing issues * Regenerating docs * [EMPTY] Re-trigger CI * [EMPTY] Re-trigger CI * Refactoring to allow any number of args for the `serve.start()` method There seems to be additional arguments of the `serve.start()` method, so we should probably cover all of them at once, instead of only the `detached` option. * [EMPTY] Re-trigger CI * Test whether the ServeControllerClient in fact has the supplied `detached` parameter

) * fixed tokens in question generation * simplified assignment * same behavior also for pad and eos * use skip_special_tokens in batch_decode * fixed black error and update docs * fixed schemas ci error * JSON schemas * Add git diff to debug schema issues * opensearch schema was missing * Add missing instruction in the workflow error message * typo

ZanSara · 2022-08-03T16:54:49Z

One way around this is to squash all commits into a single one, so the history won't have trace of the non-existing email address and CLAssistant will stop bothering. There are a few ways to do that, let me know if you need help 🙂

* enable Opensearch unit tests under Win * move unit tests into a dedicated job * skip audio tests on missing dependencies * avoid failing test collection when soundfile is not available * Update .github/workflows/tests.yml Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

…rsQueryClassifier

…om/anakin87/haystack into extend_TransformersQueryClassifier

* fix validation for dynamic outgoing edges * Update Documentation & Code Style * use class outgoing_edges as fallback if no instance is provided * implement classmethod approach * readd comment * fix mypy * fix tests * set outgoing_edges for all components * set outgoing_edges for mocks too * set document store outgoing_edges to 1 * set last missing outgoing_edges * enforce BaseComponent subclasses to define outgoing_edges * override _calculate_outgoing_edges for FileTypeClassifier * remove superfluous test * set rest_api's custom component's outgoing_edges * Update docstring Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * remove unnecessary else Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

…rsQueryClassifier

review-notebook-app · 2022-08-04T09:27:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

anakin87 · 2022-08-04T10:20:51Z

Sorry, I made a big mess with git. 🙁
At this point, the simplest thing is to close this PR and reopen a clean PR.

tstadel and others added 30 commits April 26, 2022 20:28

Disable telemetry logs per default (#2463)

4f777b6

rename dataset to evaluation_set when logging to mlflow (#2457)

6392136

Ignore mypy issues regarding files param of requests.post (#2468)

6fe12f5

* ignore mypy issues regarding files param of requests.post * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

fix reader.eval and reader.eval_on_file output (#2476)

45f2a07

Add support for positional args in pipeline.get_config() (#2478)

6da2c7a

* add support for positional args in pipeline.get_config() * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update xpdfreader package installation (#2491)

b42815c

This Update will fix this exception `Exception: pdftotext is not installed. It is part of xpdf or poppler-utils software suite. ` Now, converting PDFs wouldn't have any issues.

fix dot_product metric in pinecone (#2494)

31eedb4

Align TransformersReader defaults with FARMReader (#2490)

f4a55b0

* Align TransformersReader defaults with vFARMReader * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Deprecate Milvus1DocumentStore (#2495)

931bad7

* Add warning message * Update doc string * Update Documentation & Code Style * Change DeprecationWarning to FutureWarning Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Upgrade xpdf to 4.04 in Exception text (#2488)

142d49b

* Upgrade xpdf to 4.04 in Exception text * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

change milvus links from 2.0.0 to 2.0.x (#2496)

96ed996

* change milvus links from 2.0.0 to 2.0.x * Update Documentation & Code Style * fix two broken links * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix OpenSearchDocumentStore's __init__ (#2498)

e4e7bdc

* Move super in OpenSearchDocumentStore and add small test * Update Documentation & Code Style * Add Opensearch container to the CI Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Release v1.4.0 (#2502)

1d17ebc

* delete unneeded files of last release * add v1.4.0 docs with updated links * upgrade version number * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update version to 1.4.1rc0 (#2509)

ae80434

* Update version to 1.4.1rc0 * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Upgrade transformers version to 4.18.0 (#2514)

9eb9a23

* Upgrade transformers version to 4.18.0 * Adapt tokenization test to upgrade * Adapt tokenization test to upgrade

fix small typo in Document doc string (#2520)

706a3cc

* fix small typo in Document doc string Was going through the tutorial, then digging through the code and just noticed a small typo * generate markdown file changes from docstrings Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Remove safe import from ElasticsearchDocumentStore (#2522)

438b5f4

* Update version to 1.4.1rc0 * Elasticsearch is not an optional dependency * Fix import path * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

PR template mention of enabling Actions (#2523)

2962631

* Update version to 1.4.1rc0 * Add hint of enabling action on the fork in the PR template * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Add sort arg to JoinAnswers (#2436)

bf29ce3

* Add sort arg to JoinAnswers * Update Documentation & Code Style * Change naming and docstring * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

anakin87 and others added 4 commits August 3, 2022 14:55

Update haystack/nodes/query_classifier/transformers.py

4e0c254

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Update haystack/nodes/query_classifier/transformers.py

a2c6b22

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Merge branch 'deepset-ai:master' into extend_TransformersQueryClassifier

29d7ed8

fix docs

f5da12a

anakin87 requested a review from agnieszka-m August 3, 2022 13:07

agnieszka-m approved these changes Aug 3, 2022

View reviewed changes

ZanSara and others added 3 commits August 3, 2022 09:46

Add git diff to schema checks (#2959)

0483a82

fix docker tag for cuda (#2952)

4328103

ZanSara mentioned this pull request Aug 3, 2022

Revisit tests #2811

Closed

ZanSara reviewed Aug 3, 2022

View reviewed changes

bglearning and others added 3 commits August 3, 2022 18:43

masci and others added 7 commits August 3, 2022 19:19

Merge remote-tracking branch 'upstream/master' into extend_Transforme…

fc9763e

…rsQueryClassifier

Merge branch 'deepset-ai:master' into extend_TransformersQueryClassifier

302524a

Merge branch 'extend_TransformersQueryClassifier' of https://github.c…

32edd00

…om/anakin87/haystack into extend_TransformersQueryClassifier

Merge remote-tracking branch 'upstream/master' into extend_Transforme…

d8d8681

…rsQueryClassifier

add tests, other fixes

ac270e3

merge

aa753b6

anakin87 closed this Aug 4, 2022

This was referenced Aug 4, 2022

feat: Extend TransformersQueryClassifier: clean version #2965

Merged

feat: Support embedding dimensions on DeepsetCloudDocumentStore #2995

Merged

anakin87 deleted the extend_TransformersQueryClassifier branch August 16, 2022 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend `TransformersQueryClassifier` #2903

Extend `TransformersQueryClassifier` #2903

anakin87 commented Jul 27, 2022 •

edited

ZanSara left a comment

ZanSara Aug 3, 2022

anakin87 Aug 4, 2022

ZanSara commented Aug 3, 2022

ZanSara commented Aug 3, 2022

review-notebook-app bot commented Aug 4, 2022

anakin87 commented Aug 4, 2022

		def _get_edge_number(self, label):
		return self.labels.index(label) + 1

Extend TransformersQueryClassifier #2903

Extend TransformersQueryClassifier #2903

Conversation

anakin87 commented Jul 27, 2022 • edited

Pre-flight checklist

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara Aug 3, 2022

Choose a reason for hiding this comment

anakin87 Aug 4, 2022

Choose a reason for hiding this comment

ZanSara commented Aug 3, 2022

ZanSara commented Aug 3, 2022

review-notebook-app bot commented Aug 4, 2022

anakin87 commented Aug 4, 2022

Extend `TransformersQueryClassifier` #2903

Extend `TransformersQueryClassifier` #2903

anakin87 commented Jul 27, 2022 •

edited