Add QueryClassifier incl. baseline models #1099

shahrukhx01 · 2021-05-26T18:08:06Z

Proposed changes:

Query classifier updated

Status (please check what you already did):

[] First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

Discussion Points:
I have made the changes as per the reviews in the other PR.

Issue: Linked Issue

shahrukhx01 · 2021-05-26T18:12:29Z

@lalitpagaria @tholor I have made the suggested changes, please let me know about your reviews.

haystack/pipeline.py

lalitpagaria · 2021-05-26T19:42:02Z

@shahrukhx01 Thanks for working on it.
Design part looks fine to me. @tholor can check further.

Can you please add relevant test cases and example/document/tutorial

tholor · 2021-05-31T16:57:22Z

Finally had a look at the models and shared my feedback here

Now, let's talk about the implementation. What I'd suggest as next steps:

Let's add two classes to Haystack: TransformersQueryClassifier and SklearnQueryClassifier
For SklearnQueryClassifier: Let's allow loading models more flexibly from a local file or a remote URL, i.e. query_classifier can be None or a Path or a Str similar to

haystack/haystack/retriever/dense.py

Line 38 in 84c3429

query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
For transformers, we can leverage transformers classification pipeline
Let's make the question vs. keyword models the defaults if the user doesn't specify anything custom
Let's link the question vs. statement models in the docstring and the "usage documentation" as alternatives
I still have to verify this, but it might be helpful if we don't return "output_1" and "output_2" in the node, but rather the actual classification labels, i.e. "question" and "keywords". I think this will simplify setting up the right connections when plumbing the pipeline together as you don't need to remember what was "output_1" and "output_2"
Let's add tests

shahrukhx01 · 2021-06-02T07:56:42Z

@lalitpagaria could you please do a quick review if you have time, I have added the two classes TransformersQueryClassifier and SklearnQueryClassifier as discussed above. However, derived them BaseComponent not from DecisionNode as you pointed in my earlier PR as it was not discussed earlier. Also, please ignore most other changes in the file as I have black formatter on it.

@tholor I have added the two classes and necessary details as you mentioned earlier, Could you please host the following model and vectorizer on S3 for question vs statement classiifcation?

Model: https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_classifier.pickle
Vectorizer: https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle

Although they point to same links, however, these are two different objects based on spaadia dataset. Thanks.

Once the code is in good shape, I'll move to writing tests and add update documentation.

lalitpagaria · 2021-06-02T08:17:17Z

@shahrukhx01 One suggestion. Please do not club formatting changes with code changes. It is very hard to review the code.

Also can you please add relevant tests.

tholor · 2021-06-02T11:46:13Z

@shahrukhx01 Uploaded the new files here:
https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/model.pickle
https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/vectorizer.pickle
https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/readme.txt

shahrukhx01 · 2021-06-02T15:26:53Z

I have added this test test_query_keyword_statement_classifier() in the test_pipeline.py. Please review and let me know anything needs to change, once this PR is merged, I'll move to add documentation.

tholor · 2021-06-03T06:16:54Z

Please make sure that the CI is passing. Right now mypy complains about some types in pipelines.py: https://github.com/deepset-ai/haystack/pull/1099/checks?check_run_id=2731190150

Let me know if you need help here

shahrukhx01 · 2021-06-03T06:59:26Z

@tholor I have added the following patch, could you please run the workflow again:

if isinstance(query_classifier, Path):            
    file_url = urllib.request.pathname2url(r"{}".format(query_classifier))            
    query_classifier = f"file:{file_url}"

lalitpagaria · 2021-06-03T07:04:41Z

I have started.

lalitpagaria · 2021-06-03T07:08:10Z

@shahrukhx01 Again error it coming. I suggest you to create local test env and also try running mypy locally as well.
See CI steps : https://github.com/deepset-ai/haystack/blob/master/.github/workflows/ci.yml

Successfully installed mypy-0.812 mypy-extensions-0.4.3 typed-ast-1.4.3 typing-extensions-3.10.0.0
haystack/pipeline.py:768: error: Argument 1 to "urlopen" has incompatible type "Union[Path, str]"; expected "Union[str, Request]"
Found 1 error in 1 file (checked 57 source files)

shahrukhx01 · 2021-06-03T09:09:02Z

@shahrukhx01 Again error it coming. I suggest you to create local test env and also try running mypy locally as well.
See CI steps : https://github.com/deepset-ai/haystack/blob/master/.github/workflows/ci.yml
Successfully installed mypy-0.812 mypy-extensions-0.4.3 typed-ast-1.4.3 typing-extensions-3.10.0.0
haystack/pipeline.py:768: error: Argument 1 to "urlopen" has incompatible type "Union[Path, str]"; expected "Union[str, Request]"
Found 1 error in 1 file (checked 57 source files)

@lalitpagaria I have fixed it, and mypy is passing type test on locally, could you please run workflow now

shahrukhx01 · 2021-06-03T12:31:13Z

@tholor CI is passing now. Please give your overall review when you get time to see this. thanks!

shahrukhx01 · 2021-06-07T14:29:29Z

hi @tholor, Sorry to nudge you again. But could you please comment on this, so that we can proceed?

tholor

Sure, I already had it on my todo list for today and was kicking off the CI earlier to see if it's passing :)
It so far looks very good to me - just left one comment about output names.
Could you please also revert all pure formatting changes that you have here? As @lalitpagaria mentioned, it makes it really hard to review and some parts are rather changed for the worse like:
https://github.com/shahrukhx01/haystack/blob/14b7f758886881d281d4a5392c5638b65ea649c3/haystack/pipeline.py#L159-L163
So I would not like to merge those formatting changes into master.

tholor · 2021-06-07T14:39:04Z

haystack/pipeline.py

+        is_question: bool = self.query_classifier.predict(query_vector)[0]
+        if is_question:
+            return (kwargs, "output_1")
+        else:


Shall we change the output name here from "output_1" -> "semantic_query" and "output_2" -> "keywords"?
I think this will be help a lot when you stick pipelines together and can rather refer to .keywords as the input:

pipeline.add_node(name="elastic",component=elastic_retriever, inputs["SkQueryKeywordQuestionClassifier.keywords"])

Can you test quickly if this works already as expected or if anything in our pipeline class is currently blocking this? If yes, we can add two arguments to init (output_1_name, output_2_name) with the above defaults. This would allow users to switch it easily to their class names (e.g. when using the question vs statement model).

@tholor the output follows a specific format where prefix always has to be "output_" and full format is something like "output_{integer}", so I can't change the output to that there, I have added those two in init, please let me know if that's the thing you wanted there.

Ok my bad, sorry. I thought we already relaxed this requirement about the output format.
Then let's stick with your original version (hard coded output_1 and output_2; no init args) and document in the docstring which output belongs to which category for the available models.

@oryx1729 do you think we can (in general) relax the naming requirements for node outputs easily or do you see bigger drawbacks?

haystack/pipeline.py

tholor · 2021-06-07T15:19:04Z

Ah and one more thing: we are missing documentation here in terms of a "Usage page" or an additional section in our pipeline tutorial. I am also fine with adding this in a separate PR, but it should be done directly afterwards as we will forget about it otherwise.

shahrukhx01 · 2021-06-07T19:42:46Z

Ah and one more thing: we are missing documentation here in terms of a "Usage page" or an additional section in our pipeline tutorial. I am also fine with adding this in a separate PR, but it should be done directly afterwards as we will forget about it otherwise.

@tholor I can create detailed documentation, and a tutorial for this on Colab, however, I'd need some time. For this I will create a separate PR. I have opened #1155 for this specifically, so that we don't forget about this. You can make me the assignee for that :)

shahrukhx01 · 2021-06-07T19:55:05Z

@tholor Overall, I have reverted formatting, added the output_names, please let me know if anything else needs to be added/changed.

tholor

Seems ready to merge once we reverted output_1_name and output_2_name (see comment)

…/haystack into add_query_classifier merge with master

shahrukhx01 · 2021-06-08T07:22:05Z

@tholor I have reverted the changes, and updated the docstring.

lalitpagaria

LGTM
Thank you @shahrukhx01 for your contribution.

tholor · 2021-06-08T08:01:38Z

@shahrukhx01 Just adjusted the docstring for the SklearnQueryClassifier to make usage a bit more explicit.
Could you adjust the docstring for the transformers class in a similar way, please?

shahrukhx01 · 2021-06-08T08:37:08Z

@shahrukhx01 Just adjusted the docstring for the SklearnQueryClassifier to make usage a bit more explicit.
Could you adjust the docstring for the transformers class in a similar way, please?

@tholor updated the second docstring as per your style.

tholor · 2021-06-08T13:20:07Z

Did a few minor, last-mile adjustments - seems now ready to be merged :)

shahrukhx01 · 2021-06-08T13:24:00Z

@tholor do you have any ideas on how these baseline models can be improved? is there any dataset that we can use for augmentation? or do we have to synthetically generate adversarial examples in the train set?

restructure query classifier code and add s3 based pickles

545ff22

lalitpagaria reviewed May 26, 2021

View reviewed changes

haystack/pipeline.py Outdated Show resolved Hide resolved

make model and vectorizer optional in query classifier

aeea566

lalitpagaria reviewed May 26, 2021

View reviewed changes

haystack/pipeline.py Outdated Show resolved Hide resolved

lalitpagaria reviewed May 26, 2021

View reviewed changes

haystack/pipeline.py Outdated Show resolved Hide resolved

update query classifier as per init style

9d7f752

lalitpagaria requested a review from tholor May 27, 2021 04:25

shahrukhx01 added 2 commits June 2, 2021 08:30

Merge branch 'deepset-ai:master' into add_query_classifier

8ee5ca3

add query classifiers sklearn/hf

a71e47c

shahrukhx01 added 2 commits June 2, 2021 14:26

update docstrings for query classifiers

19a7f6c

add unit test for query classifier

f027bb4

add type patch for sklearn classifier

91d4197

shahrukhx01 added 2 commits June 3, 2021 11:06

fix mypy type issue

2b0028d

Merge branch 'master' into add_query_classifier

7189ea2

Merge branch 'deepset-ai:master' into add_query_classifier

14b7f75

tholor requested changes Jun 7, 2021

View reviewed changes

shahrukhx01 added 5 commits June 7, 2021 17:21

revert to pure formatting

1287be6

add query classifiers

b2aa9d1

add query classifiers

0929a5e

resolve conflict

6704a1b

add output names for query classifier

94e05fd

shahrukhx01 mentioned this pull request Jun 7, 2021

Add documentation for query classifier #1155

Closed

Merge branch 'deepset-ai:master' into add_query_classifier

15fa518

tholor approved these changes Jun 8, 2021

View reviewed changes

shahrukhx01 added 2 commits June 8, 2021 09:19

revert output and update docstring queryclassifier

403ef3b

Merge branch 'add_query_classifier' of https://github.com/shahrukhx01…

5f4643e

…/haystack into add_query_classifier merge with master

lalitpagaria approved these changes Jun 8, 2021

View reviewed changes

lalitpagaria changed the title ~~WIP: add query baseline query classifier updated~~ Add query baseline query classifier updated Jun 8, 2021

Update docstring for SklearnQueryClassifier

5013e8a

update transformer query classifier docstring

c968fb3

shahrukhx01 and others added 2 commits June 8, 2021 10:43

fix typo

62cc0d8

change arg names in query classifier classes

5346e45

tholor changed the title ~~Add query baseline query classifier updated~~ Add QueryClassifier incl. baseline models Jun 8, 2021

tholor added 2 commits June 8, 2021 14:55

add set_config(). rename attributes

34fff38

fix set_config()

a3ef928

tholor merged commit 545c625 into deepset-ai:master Jun 8, 2021

tholor mentioned this pull request Jun 13, 2021

Introduce QueryClassifier #611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QueryClassifier incl. baseline models #1099

Add QueryClassifier incl. baseline models #1099

shahrukhx01 commented May 26, 2021 •

edited

shahrukhx01 commented May 26, 2021

lalitpagaria commented May 26, 2021

tholor commented May 31, 2021

shahrukhx01 commented Jun 2, 2021 •

edited

lalitpagaria commented Jun 2, 2021

tholor commented Jun 2, 2021

shahrukhx01 commented Jun 2, 2021

tholor commented Jun 3, 2021

shahrukhx01 commented Jun 3, 2021

lalitpagaria commented Jun 3, 2021

lalitpagaria commented Jun 3, 2021

shahrukhx01 commented Jun 3, 2021 •

edited

shahrukhx01 commented Jun 3, 2021

shahrukhx01 commented Jun 7, 2021

tholor left a comment

tholor Jun 7, 2021

shahrukhx01 Jun 7, 2021

tholor Jun 8, 2021

tholor Jun 8, 2021

tholor commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

tholor left a comment

shahrukhx01 commented Jun 8, 2021

lalitpagaria left a comment

tholor commented Jun 8, 2021

shahrukhx01 commented Jun 8, 2021

tholor commented Jun 8, 2021

shahrukhx01 commented Jun 8, 2021

Add QueryClassifier incl. baseline models #1099

Add QueryClassifier incl. baseline models #1099

Conversation

shahrukhx01 commented May 26, 2021 • edited

shahrukhx01 commented May 26, 2021

lalitpagaria commented May 26, 2021

tholor commented May 31, 2021

shahrukhx01 commented Jun 2, 2021 • edited

lalitpagaria commented Jun 2, 2021

tholor commented Jun 2, 2021

shahrukhx01 commented Jun 2, 2021

tholor commented Jun 3, 2021

shahrukhx01 commented Jun 3, 2021

lalitpagaria commented Jun 3, 2021

lalitpagaria commented Jun 3, 2021

shahrukhx01 commented Jun 3, 2021 • edited

shahrukhx01 commented Jun 3, 2021

shahrukhx01 commented Jun 7, 2021

tholor left a comment

Choose a reason for hiding this comment

tholor Jun 7, 2021

Choose a reason for hiding this comment

shahrukhx01 Jun 7, 2021

Choose a reason for hiding this comment

tholor Jun 8, 2021

Choose a reason for hiding this comment

tholor Jun 8, 2021

Choose a reason for hiding this comment

tholor commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

tholor left a comment

Choose a reason for hiding this comment

shahrukhx01 commented Jun 8, 2021

lalitpagaria left a comment

Choose a reason for hiding this comment

tholor commented Jun 8, 2021

shahrukhx01 commented Jun 8, 2021

tholor commented Jun 8, 2021

shahrukhx01 commented Jun 8, 2021

shahrukhx01 commented May 26, 2021 •

edited

shahrukhx01 commented Jun 2, 2021 •

edited

shahrukhx01 commented Jun 3, 2021 •

edited