WIP: add query baseline query classifier and boilerplate in pipeline #1083

shahrukhx01 · 2021-05-22T11:50:59Z

Proposed changes:

Query classifier baseline and its boilerplate in the pipeline

Status (please check what you already did):

[*] First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

Discussion Points:
I have added the baseline model, however,

Currently, I'm reading models from temporarily hosted raw files on Github, I'd need some direction in replacing GitHub placeholder files with S3 hosted models for classification.
Could you please comment on how we are going to extend the QueryClassifier class since I intend to add a transformer finetuned model for the same purpose. Should we then completely replace sklearn based classifier or keep both?
Any other major pointer that I missed.

PS:
This is my first PR on here, please let me know any contribution guideline that I might have missed. Also, any instructions manuals for contributors, which would help me get started with the codebase quickly since I'd like to actively contribute to the haystack in general. Thanks!

shahrukhx01 · 2021-05-22T11:53:08Z

@tholor please let me know how to proceed with this.

lalitpagaria · 2021-05-24T07:05:43Z

haystack/pipeline.py

+        outgoing_edges = 2
+        query_vectorizer = pickle.load(
+            urllib.request.urlopen(
+                "https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle"


I think Deepset team host these model on their s3 (@tholor WDYT?)
Also It is better to pass model via constructor

Yep, I agree. I will upload it to our s3.

Done. You can find it at https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/model.pickle

Also added a tiny readme: https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/readme.txt

@lalitpagaria thanks for your feedback.
@tholor could you also please upload the TF-IDF vectorizer pickle for feature extraction on S3. https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle

Done: https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/vectorizer.pickle

lalitpagaria · 2021-05-24T07:07:28Z

haystack/pipeline.py

@@ -593,6 +595,29 @@ def run(self, **kwargs):
        return kwargs, "output_1"


+class QueryClassifier:


I think it should be subclass of BaseComponent

lalitpagaria · 2021-05-24T07:15:53Z

@shahrukhx01 Thank you for PR.
I did quick review here is my comments -

Currently, I'm reading models from temporarily hosted raw files on Github, I'd need some direction in replacing GitHub placeholder files with S3 hosted models for classification.

Yes I think deepset team can host these model on their s3

Could you please comment on how we are going to extend the QueryClassifier class since I intend to add a transformer finetuned model for the same purpose. Should we then completely replace sklearn based classifier or keep both?

In this case think

there should be DecisionNode (kind of switch case) derived from BaseComponent like JoinDocuments, which will use BaseClassifier to route incoming request to other direction.
There should be BaseClassifier base class and then we can have SklearnQueryClassifier and TransformerQueryClassifier
@tholor What do you think of this design.

Any other major pointer that I missed.

Better to add code/script to train and benchmark Sklearn classifier

add query baseline query classifier and boilerplate

0f93c33

shahrukhx01 changed the title ~~add query baseline query classifier and boilerplate in pipeline~~ WIP: add query baseline query classifier and boilerplate in pipeline May 22, 2021

lalitpagaria reviewed May 24, 2021

View reviewed changes

shahrukhx01 mentioned this pull request May 26, 2021

Add QueryClassifier incl. baseline models #1099

Merged

3 tasks

shahrukhx01 closed this May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: add query baseline query classifier and boilerplate in pipeline #1083

WIP: add query baseline query classifier and boilerplate in pipeline #1083

shahrukhx01 commented May 22, 2021

shahrukhx01 commented May 22, 2021

lalitpagaria May 24, 2021

tholor May 26, 2021

tholor May 26, 2021

shahrukhx01 May 26, 2021

tholor May 26, 2021

lalitpagaria May 24, 2021

lalitpagaria commented May 24, 2021

		@@ -593,6 +595,29 @@ def run(self, **kwargs):
		return kwargs, "output_1"


		class QueryClassifier:

WIP: add query baseline query classifier and boilerplate in pipeline #1083

WIP: add query baseline query classifier and boilerplate in pipeline #1083

Conversation

shahrukhx01 commented May 22, 2021

shahrukhx01 commented May 22, 2021

lalitpagaria May 24, 2021

Choose a reason for hiding this comment

tholor May 26, 2021

Choose a reason for hiding this comment

tholor May 26, 2021

Choose a reason for hiding this comment

shahrukhx01 May 26, 2021

Choose a reason for hiding this comment

tholor May 26, 2021

Choose a reason for hiding this comment

lalitpagaria May 24, 2021

Choose a reason for hiding this comment

lalitpagaria commented May 24, 2021