OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516

mawiesne · 2023-03-04T12:41:03Z

Change

adds Spanish, Italian, Catalan, and Polish alphabet regex patterns to ...lang.Factory
adjusts German pattern to include é and É to cover established loan words such as "Café" or "Cuvée"
adjusts TokenizerFactoryTest to use langCode "eng" instead of "spa" as Spanish will (now) return a specialized pattern

Tasks

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

jzonthemtn · 2023-03-04T17:43:11Z

opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java

   *                     {@link #DEFAULT_ALPHANUMERIC} pattern will be returned.
   * @return The alphanumeric {@link Pattern} for the language, or the default pattern.
   */
  public Pattern getAlphanumeric(String languageCode) {


At some point would it make sense to have languageCode be an enum instead of a string?

Feel free to open a Jira epic (or single issue?) for this improvement and share a list of languages most widely used by OpenNLP's users. As a start, this Factory might cover most.

A good idea I think, +1 to both the enum and the jira.

kinow · 2023-03-04T21:38:53Z

Just going to test something and see if I can suggest another language, then will review it 👍

kinow

I wanted to add Catalan (which I am studying next after Spanish), and found the list of characters (wanted to confirm whether the l with middle dot (ŀ) character was still used or not - looks like it was replaced by l.l instead of ŀl (that l with middle dot followed by an l)).

diff --git a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
index 3099da73..5e7602a2 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
@@ -44,6 +44,9 @@ public class Factory {
   //       https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/
   private static final Pattern SPANISH = Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$");
 
+  // From: https://en.wikipedia.org/wiki/Catalan_orthography#Spelling_patterns
+  private static final Pattern CATALAN = Pattern.compile("^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$");
+
   /**
    * Gets the alphanumeric pattern for a language.
    *
@@ -71,6 +74,9 @@ public class Factory {
     if ("nl".equals(languageCode) || "nld".equals(languageCode) || "dut".equals(languageCode)) {
       return DUTCH;
     }
+    if ("ca".equals(languageCode) || "cat".equals(languageCode)) {
+      return CATALAN;
+    }
 
     return DEFAULT_ALPHANUMERIC;
   }

I wanted to confirm it works OK, but looks like the Factory is used in tests, but only superficially (check the pattern, etc, but without parsing anything) and when you train a tokenizer model.

I think we only have a stemmer for Catalan at the moment -

opennlp/opennlp-tools/src/main/java/opennlp/tools/stemmer/snowball/catalanStemmer.java

Line 42 in 1b5142b

    
           public class catalanStemmer extends opennlp.tools.stemmer.snowball.AbstractSnowballStemmer {

What do you think about adding Catalan too, @mawiesne ? I can move it to a follow-up issue if that sounds better.

Cheers
Bruno

mawiesne · 2023-03-05T07:45:10Z

What do you think about adding Catalan too, @mawiesne ?

We can integrate Catalan with this PR, I think. Issue title reads open for me. Maybe I'll add a Polish pattern as well, later today.

However:
@kinow Could you open an issue for adding pattern verification tests per language we include? Each (parameterized) test should include some typical (rare) examples specific to that language.

…lian, ...) - adds Spanish, Italian, Catalan, and Polish alphabet regex patterns to `...lang.Factory` - adjusts German pattern to include `é` and `É` to cover established loan words such as "Café" or "Cuvée" - adjusts `TokenizerFactoryTest` to use langCode "eng" instead of "spa" as Spanish will (now) return a specialized pattern

mawiesne · 2023-03-05T09:49:55Z

@kinow I added the Catalan and Polish regex. Please share your feedback if we are good to 🚀 with this PR.

kinow

Thanks for adding these @mawiesne ! One minor comment (more a nit-pick, sorry!), but looks good to me otherwise.

kinow · 2023-03-05T10:10:58Z

opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java

+
+  // From: https://en.wikipedia.org/wiki/Polish_alphabet
+  //       https://pl.wikipedia.org/wiki/Alfabet_polski
+  private static final Pattern POLISH = Pattern.compile("^[A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ]+$");


Just my OCD here, @mawiesne , but could we keep the same order for lower case and upper case? 😬

s/A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ/A-Za-z0-9żźćąśęłóńŻŹĆĄŚĘŁÓŃ (I was reading the upper case as "alphanum and Z Z Caselon", and thought it was an easy way to memorize it, so went with that for the lower case chars too, but we can change it if that makes more sense)

kinow · 2023-03-05T10:15:32Z

@kinow Could you open an issue for adding pattern verification tests per language we include? Each (parameterized) test should include some typical (rare) examples specific to that language.

Done! https://issues.apache.org/jira/browse/OPENNLP-1479

mawiesne requested review from kinow and rzo1 March 4, 2023 12:41

rzo1 approved these changes Mar 4, 2023

View reviewed changes

jzonthemtn reviewed Mar 4, 2023

View reviewed changes

jzonthemtn approved these changes Mar 4, 2023

View reviewed changes

kinow approved these changes Mar 5, 2023

View reviewed changes

mawiesne force-pushed the OPENNLP-1474_Create_tokenizer_factories_for_other_langs_(Spanish,_Italian,_.) branch from ec19211 to 0e024a8 Compare March 5, 2023 09:47

kinow reviewed Mar 5, 2023

View reviewed changes

kinow changed the title ~~OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, ...)~~ OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) Mar 5, 2023

kinow mentioned this pull request Mar 5, 2023

OPENNLP-1480: Quick HTML javadoc fix #518

Merged

10 tasks

mawiesne merged commit 12ec3db into main Mar 6, 2023

mawiesne deleted the OPENNLP-1474_Create_tokenizer_factories_for_other_langs_(Spanish,_Italian,_.) branch March 6, 2023 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516

OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516

Uh oh!

mawiesne commented Mar 4, 2023 •

edited

Loading

Uh oh!

jzonthemtn Mar 4, 2023

Uh oh!

mawiesne Mar 4, 2023

Uh oh!

kinow Mar 4, 2023 •

edited

Loading

Uh oh!

kinow commented Mar 4, 2023

Uh oh!

kinow left a comment

Uh oh!

mawiesne commented Mar 5, 2023 •

edited

Loading

Uh oh!

mawiesne commented Mar 5, 2023

Uh oh!

kinow left a comment

Uh oh!

kinow Mar 5, 2023

Uh oh!

kinow commented Mar 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516

OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) #516

Uh oh!

Conversation

mawiesne commented Mar 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Tasks

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

jzonthemtn Mar 4, 2023

Choose a reason for hiding this comment

Uh oh!

mawiesne Mar 4, 2023

Choose a reason for hiding this comment

Uh oh!

kinow Mar 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kinow commented Mar 4, 2023

Uh oh!

kinow left a comment

Choose a reason for hiding this comment

Uh oh!

mawiesne commented Mar 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawiesne commented Mar 5, 2023

Uh oh!

kinow left a comment

Choose a reason for hiding this comment

Uh oh!

kinow Mar 5, 2023

Choose a reason for hiding this comment

Uh oh!

kinow commented Mar 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mawiesne commented Mar 4, 2023 •

edited

Loading

kinow Mar 4, 2023 •

edited

Loading

mawiesne commented Mar 5, 2023 •

edited

Loading