-
Notifications
You must be signed in to change notification settings - Fork 489
OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516
Conversation
| * {@link #DEFAULT_ALPHANUMERIC} pattern will be returned. | ||
| * @return The alphanumeric {@link Pattern} for the language, or the default pattern. | ||
| */ | ||
| public Pattern getAlphanumeric(String languageCode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point would it make sense to have languageCode be an enum instead of a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to open a Jira epic (or single issue?) for this improvement and share a list of languages most widely used by OpenNLP's users. As a start, this Factory might cover most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good idea I think, +1 to both the enum and the jira.
|
Just going to test something and see if I can suggest another language, then will review it 👍 |
kinow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add Catalan (which I am studying next after Spanish), and found the list of characters (wanted to confirm whether the l with middle dot (ŀ) character was still used or not - looks like it was replaced by l.l instead of ŀl (that l with middle dot followed by an l)).
diff --git a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
index 3099da73..5e7602a2 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
@@ -44,6 +44,9 @@ public class Factory {
// https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/
private static final Pattern SPANISH = Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$");
+ // From: https://en.wikipedia.org/wiki/Catalan_orthography#Spelling_patterns
+ private static final Pattern CATALAN = Pattern.compile("^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$");
+
/**
* Gets the alphanumeric pattern for a language.
*
@@ -71,6 +74,9 @@ public class Factory {
if ("nl".equals(languageCode) || "nld".equals(languageCode) || "dut".equals(languageCode)) {
return DUTCH;
}
+ if ("ca".equals(languageCode) || "cat".equals(languageCode)) {
+ return CATALAN;
+ }
return DEFAULT_ALPHANUMERIC;
}I wanted to confirm it works OK, but looks like the Factory is used in tests, but only superficially (check the pattern, etc, but without parsing anything) and when you train a tokenizer model.
I think we only have a stemmer for Catalan at the moment -
opennlp/opennlp-tools/src/main/java/opennlp/tools/stemmer/snowball/catalanStemmer.java
Line 42 in 1b5142b
| public class catalanStemmer extends opennlp.tools.stemmer.snowball.AbstractSnowballStemmer { |
What do you think about adding Catalan too, @mawiesne ? I can move it to a follow-up issue if that sounds better.
Cheers
Bruno
We can integrate Catalan with this PR, I think. Issue title reads open for me. Maybe I'll add a Polish pattern as well, later today. However: |
…lian, ...) - adds Spanish, Italian, Catalan, and Polish alphabet regex patterns to `...lang.Factory` - adjusts German pattern to include `é` and `É` to cover established loan words such as "Café" or "Cuvée" - adjusts `TokenizerFactoryTest` to use langCode "eng" instead of "spa" as Spanish will (now) return a specialized pattern
ec19211 to
0e024a8
Compare
|
@kinow I added the Catalan and Polish regex. Please share your feedback if we are good to 🚀 with this PR. |
kinow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these @mawiesne ! One minor comment (more a nit-pick, sorry!), but looks good to me otherwise.
|
|
||
| // From: https://en.wikipedia.org/wiki/Polish_alphabet | ||
| // https://pl.wikipedia.org/wiki/Alfabet_polski | ||
| private static final Pattern POLISH = Pattern.compile("^[A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ]+$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my OCD here, @mawiesne , but could we keep the same order for lower case and upper case? 😬
s/A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ/A-Za-z0-9żźćąśęłóńŻŹĆĄŚĘŁÓŃ (I was reading the upper case as "alphanum and Z Z Caselon", and thought it was an easy way to memorize it, so went with that for the lower case chars too, but we can change it if that makes more sense)
|
Change
...lang.FactoryéandÉto cover established loan words such as "Café" or "Cuvée"TokenizerFactoryTestto use langCode "eng" instead of "spa" as Spanish will (now) return a specialized patternTasks
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.