Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add query baseline query classifier and boilerplate in pipeline #1083

Closed
wants to merge 1 commit into from
Closed

WIP: add query baseline query classifier and boilerplate in pipeline #1083

wants to merge 1 commit into from

Conversation

shahrukhx01
Copy link
Contributor

Proposed changes:

  • Query classifier baseline and its boilerplate in the pipeline

Status (please check what you already did):

  • [*] First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

Discussion Points:
I have added the baseline model, however,

  • Currently, I'm reading models from temporarily hosted raw files on Github, I'd need some direction in replacing GitHub placeholder files with S3 hosted models for classification.
  • Could you please comment on how we are going to extend the QueryClassifier class since I intend to add a transformer finetuned model for the same purpose. Should we then completely replace sklearn based classifier or keep both?
  • Any other major pointer that I missed.

Issue: Linked Issue

PS:
This is my first PR on here, please let me know any contribution guideline that I might have missed. Also, any instructions manuals for contributors, which would help me get started with the codebase quickly since I'd like to actively contribute to the haystack in general. Thanks!

@shahrukhx01
Copy link
Contributor Author

@tholor please let me know how to proceed with this.

@shahrukhx01 shahrukhx01 changed the title add query baseline query classifier and boilerplate in pipeline WIP: add query baseline query classifier and boilerplate in pipeline May 22, 2021
outgoing_edges = 2
query_vectorizer = pickle.load(
urllib.request.urlopen(
"https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Deepset team host these model on their s3 (@tholor WDYT?)
Also It is better to pass model via constructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree. I will upload it to our s3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitpagaria thanks for your feedback.
@tholor could you also please upload the TF-IDF vectorizer pickle for feature extraction on S3. https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/query_vectorizer.pickle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -593,6 +595,29 @@ def run(self, **kwargs):
return kwargs, "output_1"


class QueryClassifier:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be subclass of BaseComponent

@lalitpagaria
Copy link
Contributor

@shahrukhx01 Thank you for PR.
I did quick review here is my comments -

Currently, I'm reading models from temporarily hosted raw files on Github, I'd need some direction in replacing GitHub placeholder files with S3 hosted models for classification.

Yes I think deepset team can host these model on their s3

Could you please comment on how we are going to extend the QueryClassifier class since I intend to add a transformer finetuned model for the same purpose. Should we then completely replace sklearn based classifier or keep both?

In this case think

  • there should be DecisionNode (kind of switch case) derived from BaseComponent like JoinDocuments, which will use BaseClassifier to route incoming request to other direction.
  • There should be BaseClassifier base class and then we can have SklearnQueryClassifier and TransformerQueryClassifier
    @tholor What do you think of this design.

Any other major pointer that I missed.

  • Better to add code/script to train and benchmark Sklearn classifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants