New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QueryClassifier incl. baseline models #1099
Conversation
@lalitpagaria @tholor I have made the suggested changes, please let me know about your reviews. |
@shahrukhx01 Thanks for working on it. Can you please add relevant test cases and example/document/tutorial |
Finally had a look at the models and shared my feedback here Now, let's talk about the implementation. What I'd suggest as next steps:
|
@lalitpagaria could you please do a quick review if you have time, I have added the two classes @tholor I have added the two classes and necessary details as you mentioned earlier, Could you please host the following model and vectorizer on S3 for question vs statement classiifcation? Model: https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_classifier.pickle Although they point to same links, however, these are two different objects based on spaadia dataset. Thanks. Once the code is in good shape, I'll move to writing tests and add update documentation. |
@shahrukhx01 One suggestion. Please do not club formatting changes with code changes. It is very hard to review the code. Also can you please add relevant tests. |
I have added this test |
Please make sure that the CI is passing. Right now mypy complains about some types in pipelines.py: https://github.com/deepset-ai/haystack/pull/1099/checks?check_run_id=2731190150 Let me know if you need help here |
@tholor I have added the following patch, could you please run the workflow again: if isinstance(query_classifier, Path):
file_url = urllib.request.pathname2url(r"{}".format(query_classifier))
query_classifier = f"file:{file_url}" |
I have started. |
@shahrukhx01 Again error it coming. I suggest you to create local test env and also try running mypy locally as well.
|
@lalitpagaria I have fixed it, and mypy is passing type test on locally, could you please run workflow now |
@tholor CI is passing now. Please give your overall review when you get time to see this. thanks! |
hi @tholor, Sorry to nudge you again. But could you please comment on this, so that we can proceed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I already had it on my todo list for today and was kicking off the CI earlier to see if it's passing :)
It so far looks very good to me - just left one comment about output names.
Could you please also revert all pure formatting changes that you have here? As @lalitpagaria mentioned, it makes it really hard to review and some parts are rather changed for the worse like:
https://github.com/shahrukhx01/haystack/blob/14b7f758886881d281d4a5392c5638b65ea649c3/haystack/pipeline.py#L159-L163
So I would not like to merge those formatting changes into master.
is_question: bool = self.query_classifier.predict(query_vector)[0] | ||
if is_question: | ||
return (kwargs, "output_1") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the output name here from "output_1" -> "semantic_query" and "output_2" -> "keywords"?
I think this will be help a lot when you stick pipelines together and can rather refer to .keywords
as the input:
pipeline.add_node(name="elastic",component=elastic_retriever, inputs["SkQueryKeywordQuestionClassifier.keywords"])
Can you test quickly if this works already as expected or if anything in our pipeline class is currently blocking this? If yes, we can add two arguments to init (output_1_name, output_2_name) with the above defaults. This would allow users to switch it easily to their class names (e.g. when using the question vs statement model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tholor the output follows a specific format where prefix always has to be "output_" and full format is something like "output_{integer}", so I can't change the output to that there, I have added those two in init, please let me know if that's the thing you wanted there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok my bad, sorry. I thought we already relaxed this requirement about the output format.
Then let's stick with your original version (hard coded output_1 and output_2; no init args) and document in the docstring which output belongs to which category for the available models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oryx1729 do you think we can (in general) relax the naming requirements for node outputs easily or do you see bigger drawbacks?
Ah and one more thing: we are missing documentation here in terms of a "Usage page" or an additional section in our pipeline tutorial. I am also fine with adding this in a separate PR, but it should be done directly afterwards as we will forget about it otherwise. |
@tholor I can create detailed documentation, and a tutorial for this on Colab, however, I'd need some time. For this I will create a separate PR. I have opened #1155 for this specifically, so that we don't forget about this. You can make me the assignee for that :) |
@tholor Overall, I have reverted formatting, added the output_names, please let me know if anything else needs to be added/changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ready to merge once we reverted output_1_name
and output_2_name
(see comment)
…/haystack into add_query_classifier merge with master
@tholor I have reverted the changes, and updated the docstring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you @shahrukhx01 for your contribution.
@shahrukhx01 Just adjusted the docstring for the SklearnQueryClassifier to make usage a bit more explicit. |
@tholor updated the second docstring as per your style. |
Did a few minor, last-mile adjustments - seems now ready to be merged :) |
@tholor do you have any ideas on how these baseline models can be improved? is there any dataset that we can use for augmentation? or do we have to synthetically generate adversarial examples in the train set? |
Proposed changes:
Status (please check what you already did):
Discussion Points:
I have made the changes as per the reviews in the other PR.
Issue: Linked Issue