Skip to content

Conversation

@jzonthemtn
Copy link
Contributor

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@jzonthemtn
Copy link
Contributor Author

This PR adds initial support for long documents by splitting them up into sections. (The size of the splits and size of the overlap of splits can be set in the InferenceOptions.)

@kinow
Copy link
Member

kinow commented Aug 8, 2022

I'm leaving this one in my inbox to read the issue & code later with more calm 👍 feel free to ping me if I haven't reviewed it by next week. Thanks Jeff!

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo in a public method. But the rest looks good. Thanks for adding a link to the post where you got the idea to split it into 200 length chunks with 50 overlapping words 👍

* Calculates the document classification scores by averaging the scores for
* all individual parts of a document.
*/
public class AverageClassifcationScoringStrategy implements ClassificationScoringStrategy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in class name?

s/AverageClassifcationScoringStrategy/AverageClassificationScoringStrategy


}

return averages;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one second I thought this was simply calculating the average of arrays, and was going to suggest to use streams and other Java 8+ methods… then I realized it was doing something more elaborate. 👍 no need to change anything here then! 👍

}

} catch (OrtException ex) {
throw new RuntimeException("Error performing namefinder inference: " + ex.getMessage(), ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 better than the old System.err without the rest of the stack trace.

// In this article as the paper suggests, we are going to segment the input into smaller text and feed
// each of them into BERT, it means for each row, we will split the text in order to have some
// smaller text (200 words long each)
// https://medium.com/analytics-vidhya/text-classification-with-bert-using-transformers-for-long-text-inputs-f54833994dfd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 👍

@jzonthemtn
Copy link
Contributor Author

Great, thanks! @kinow

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@jzonthemtn jzonthemtn merged commit 2d4ad08 into apache:master Aug 15, 2022
@jzonthemtn jzonthemtn deleted the OPENNLP-1374 branch August 15, 2022 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants