-
Notifications
You must be signed in to change notification settings - Fork 489
OPENNLP-1374: Initial support for long documents #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR adds initial support for long documents by splitting them up into sections. (The size of the splits and size of the overlap of splits can be set in the |
|
I'm leaving this one in my inbox to read the issue & code later with more calm 👍 feel free to ping me if I haven't reviewed it by next week. Thanks Jeff! |
kinow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One typo in a public method. But the rest looks good. Thanks for adding a link to the post where you got the idea to split it into 200 length chunks with 50 overlapping words 👍
| * Calculates the document classification scores by averaging the scores for | ||
| * all individual parts of a document. | ||
| */ | ||
| public class AverageClassifcationScoringStrategy implements ClassificationScoringStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in class name?
s/AverageClassifcationScoringStrategy/AverageClassificationScoringStrategy
|
|
||
| } | ||
|
|
||
| return averages; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one second I thought this was simply calculating the average of arrays, and was going to suggest to use streams and other Java 8+ methods… then I realized it was doing something more elaborate. 👍 no need to change anything here then! 👍
| } | ||
|
|
||
| } catch (OrtException ex) { | ||
| throw new RuntimeException("Error performing namefinder inference: " + ex.getMessage(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 better than the old System.err without the rest of the stack trace.
| // In this article as the paper suggests, we are going to segment the input into smaller text and feed | ||
| // each of them into BERT, it means for each row, we will split the text in order to have some | ||
| // smaller text (200 words long each) | ||
| // https://medium.com/analytics-vidhya/text-classification-with-bert-using-transformers-for-long-text-inputs-f54833994dfd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 👍
|
Great, thanks! @kinow |
kinow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.