Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FARMClassifier node for Document Classification #1265

Merged
merged 6 commits into from Jul 13, 2021

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Jul 9, 2021

Proposed changes:

  • Add FARM classification node that enriches a List of Documents with class probabilities in meta field

**Status **:

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Implement evaluation (or remove methods copied from Ranker for now)
  • Updated documentation

@julian-risch julian-risch changed the title WIP: Add FARM classification node Add FARMClassifier node for Document Classification Jul 9, 2021
@julian-risch julian-risch marked this pull request as ready for review July 9, 2021 15:26
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool addition! I can see many use cases at query - but also at indexing time. Left a few minor comments.

return wrapper

def print_time(self):
print("Ranker (Speed)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left overs from "Ranker copy"

return_preds: bool = False,
) -> dict:
"""
Performs evaluation of the Ranker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the eval method from the base class if we don't have it implemented in any child class

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=classifier, name="Classifier", inputs=["ESRetriever"])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it actually also work to use this node in an indexing pipeline? I mean something like FileConverter->Preprocessor->Classifier->DocStore

So we would basically append meta data to the docs at indexing time...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be part of this PR if it requires bigger changes, but maybe you can document what's missing for that use case and create a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue here: #1281


- Take a plain language model (e.g. `bert-base-cased`) and train it for TextClassification
- Take a TextClassification model and fine-tune it for your domain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some info about the expected format of the train file (csv, what columns ...)

haystack/classifier/farm.py Outdated Show resolved Hide resolved
dev_split=dev_split,
test_filename=test_filename,
data_dir=Path(data_dir),
delimiter="\t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also add the delimiter as an option to the init?

"""
Use loaded classification model to classify the supplied list of Document.

Returns list of Document enriched with classification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add here the info where the classification result is stored (Document.meta["classification"])

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@julian-risch julian-risch merged commit 4e6f7f3 into master Jul 13, 2021
@julian-risch julian-risch deleted the classification-node branch July 13, 2021 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants