Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QueryClassifier incl. baseline models #1099

Merged
merged 25 commits into from Jun 8, 2021
Merged

Add QueryClassifier incl. baseline models #1099

merged 25 commits into from Jun 8, 2021

Conversation

shahrukhx01
Copy link
Contributor

@shahrukhx01 shahrukhx01 commented May 26, 2021

Proposed changes:

  • Query classifier updated

Status (please check what you already did):

  • [] First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

Discussion Points:
I have made the changes as per the reviews in the other PR.

Issue: Linked Issue

@shahrukhx01
Copy link
Contributor Author

@lalitpagaria @tholor I have made the suggested changes, please let me know about your reviews.

haystack/pipeline.py Outdated Show resolved Hide resolved
haystack/pipeline.py Outdated Show resolved Hide resolved
haystack/pipeline.py Outdated Show resolved Hide resolved
@lalitpagaria
Copy link
Contributor

@shahrukhx01 Thanks for working on it.
Design part looks fine to me. @tholor can check further.

Can you please add relevant test cases and example/document/tutorial

@lalitpagaria lalitpagaria requested a review from tholor May 27, 2021 04:25
@tholor
Copy link
Member

tholor commented May 31, 2021

Finally had a look at the models and shared my feedback here

Now, let's talk about the implementation. What I'd suggest as next steps:

  • Let's add two classes to Haystack: TransformersQueryClassifier and SklearnQueryClassifier
  • For SklearnQueryClassifier: Let's allow loading models more flexibly from a local file or a remote URL, i.e. query_classifier can be None or a Path or a Str similar to
    query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
  • For transformers, we can leverage transformers classification pipeline
  • Let's make the question vs. keyword models the defaults if the user doesn't specify anything custom
  • Let's link the question vs. statement models in the docstring and the "usage documentation" as alternatives
  • I still have to verify this, but it might be helpful if we don't return "output_1" and "output_2" in the node, but rather the actual classification labels, i.e. "question" and "keywords". I think this will simplify setting up the right connections when plumbing the pipeline together as you don't need to remember what was "output_1" and "output_2"
  • Let's add tests

@shahrukhx01
Copy link
Contributor Author

shahrukhx01 commented Jun 2, 2021

@lalitpagaria could you please do a quick review if you have time, I have added the two classes TransformersQueryClassifier and SklearnQueryClassifier as discussed above. However, derived them BaseComponent not from DecisionNode as you pointed in my earlier PR as it was not discussed earlier. Also, please ignore most other changes in the file as I have black formatter on it.

@tholor I have added the two classes and necessary details as you mentioned earlier, Could you please host the following model and vectorizer on S3 for question vs statement classiifcation?

Model: https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_classifier.pickle
Vectorizer: https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle

Although they point to same links, however, these are two different objects based on spaadia dataset. Thanks.

Once the code is in good shape, I'll move to writing tests and add update documentation.

@lalitpagaria
Copy link
Contributor

@shahrukhx01 One suggestion. Please do not club formatting changes with code changes. It is very hard to review the code.

Also can you please add relevant tests.

@shahrukhx01
Copy link
Contributor Author

I have added this test test_query_keyword_statement_classifier() in the test_pipeline.py. Please review and let me know anything needs to change, once this PR is merged, I'll move to add documentation.

@tholor
Copy link
Member

tholor commented Jun 3, 2021

Please make sure that the CI is passing. Right now mypy complains about some types in pipelines.py: https://github.com/deepset-ai/haystack/pull/1099/checks?check_run_id=2731190150

Let me know if you need help here

@shahrukhx01
Copy link
Contributor Author

@tholor I have added the following patch, could you please run the workflow again:

if isinstance(query_classifier, Path):            
    file_url = urllib.request.pathname2url(r"{}".format(query_classifier))            
    query_classifier = f"file:{file_url}"

@lalitpagaria
Copy link
Contributor

I have started.

@lalitpagaria
Copy link
Contributor

@shahrukhx01 Again error it coming. I suggest you to create local test env and also try running mypy locally as well.
See CI steps : https://github.com/deepset-ai/haystack/blob/master/.github/workflows/ci.yml

Successfully installed mypy-0.812 mypy-extensions-0.4.3 typed-ast-1.4.3 typing-extensions-3.10.0.0
haystack/pipeline.py:768: error: Argument 1 to "urlopen" has incompatible type "Union[Path, str]"; expected "Union[str, Request]"
Found 1 error in 1 file (checked 57 source files)

@shahrukhx01
Copy link
Contributor Author

shahrukhx01 commented Jun 3, 2021

@shahrukhx01 Again error it coming. I suggest you to create local test env and also try running mypy locally as well.
See CI steps : https://github.com/deepset-ai/haystack/blob/master/.github/workflows/ci.yml

Successfully installed mypy-0.812 mypy-extensions-0.4.3 typed-ast-1.4.3 typing-extensions-3.10.0.0
haystack/pipeline.py:768: error: Argument 1 to "urlopen" has incompatible type "Union[Path, str]"; expected "Union[str, Request]"
Found 1 error in 1 file (checked 57 source files)

@lalitpagaria I have fixed it, and mypy is passing type test on locally, could you please run workflow now

@shahrukhx01
Copy link
Contributor Author

@tholor CI is passing now. Please give your overall review when you get time to see this. thanks!

@shahrukhx01
Copy link
Contributor Author

hi @tholor, Sorry to nudge you again. But could you please comment on this, so that we can proceed?

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I already had it on my todo list for today and was kicking off the CI earlier to see if it's passing :)
It so far looks very good to me - just left one comment about output names.
Could you please also revert all pure formatting changes that you have here? As @lalitpagaria mentioned, it makes it really hard to review and some parts are rather changed for the worse like:
https://github.com/shahrukhx01/haystack/blob/14b7f758886881d281d4a5392c5638b65ea649c3/haystack/pipeline.py#L159-L163
So I would not like to merge those formatting changes into master.

is_question: bool = self.query_classifier.predict(query_vector)[0]
if is_question:
return (kwargs, "output_1")
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change the output name here from "output_1" -> "semantic_query" and "output_2" -> "keywords"?
I think this will be help a lot when you stick pipelines together and can rather refer to .keywords as the input:

    pipeline.add_node(name="elastic",component=elastic_retriever, inputs["SkQueryKeywordQuestionClassifier.keywords"])

Can you test quickly if this works already as expected or if anything in our pipeline class is currently blocking this? If yes, we can add two arguments to init (output_1_name, output_2_name) with the above defaults. This would allow users to switch it easily to their class names (e.g. when using the question vs statement model).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tholor the output follows a specific format where prefix always has to be "output_" and full format is something like "output_{integer}", so I can't change the output to that there, I have added those two in init, please let me know if that's the thing you wanted there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok my bad, sorry. I thought we already relaxed this requirement about the output format.
Then let's stick with your original version (hard coded output_1 and output_2; no init args) and document in the docstring which output belongs to which category for the available models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oryx1729 do you think we can (in general) relax the naming requirements for node outputs easily or do you see bigger drawbacks?

haystack/pipeline.py Show resolved Hide resolved
@tholor
Copy link
Member

tholor commented Jun 7, 2021

Ah and one more thing: we are missing documentation here in terms of a "Usage page" or an additional section in our pipeline tutorial. I am also fine with adding this in a separate PR, but it should be done directly afterwards as we will forget about it otherwise.

@shahrukhx01
Copy link
Contributor Author

Ah and one more thing: we are missing documentation here in terms of a "Usage page" or an additional section in our pipeline tutorial. I am also fine with adding this in a separate PR, but it should be done directly afterwards as we will forget about it otherwise.

@tholor I can create detailed documentation, and a tutorial for this on Colab, however, I'd need some time. For this I will create a separate PR. I have opened #1155 for this specifically, so that we don't forget about this. You can make me the assignee for that :)

@shahrukhx01
Copy link
Contributor Author

@tholor Overall, I have reverted formatting, added the output_names, please let me know if anything else needs to be added/changed.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ready to merge once we reverted output_1_name and output_2_name (see comment)

@shahrukhx01
Copy link
Contributor Author

@tholor I have reverted the changes, and updated the docstring.

Copy link
Contributor

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you @shahrukhx01 for your contribution.

@lalitpagaria lalitpagaria changed the title WIP: add query baseline query classifier updated Add query baseline query classifier updated Jun 8, 2021
@tholor
Copy link
Member

tholor commented Jun 8, 2021

@shahrukhx01 Just adjusted the docstring for the SklearnQueryClassifier to make usage a bit more explicit.
Could you adjust the docstring for the transformers class in a similar way, please?

@shahrukhx01
Copy link
Contributor Author

@shahrukhx01 Just adjusted the docstring for the SklearnQueryClassifier to make usage a bit more explicit.
Could you adjust the docstring for the transformers class in a similar way, please?

@tholor updated the second docstring as per your style.

@tholor tholor changed the title Add query baseline query classifier updated Add QueryClassifier incl. baseline models Jun 8, 2021
@tholor
Copy link
Member

tholor commented Jun 8, 2021

Did a few minor, last-mile adjustments - seems now ready to be merged :)

@tholor tholor merged commit 545c625 into deepset-ai:master Jun 8, 2021
@shahrukhx01
Copy link
Contributor Author

@tholor do you have any ideas on how these baseline models can be improved? is there any dataset that we can use for augmentation? or do we have to synthetically generate adversarial examples in the train set?

@tholor tholor mentioned this pull request Jun 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants