Skip to content

Conversation

@mawiesne
Copy link
Contributor

@mawiesne mawiesne commented Mar 4, 2023

Change

  • adds Spanish, Italian, Catalan, and Polish alphabet regex patterns to ...lang.Factory
  • adjusts German pattern to include é and É to cover established loan words such as "Café" or "Cuvée"
  • adjusts TokenizerFactoryTest to use langCode "eng" instead of "spa" as Spanish will (now) return a specialized pattern

Tasks

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@mawiesne mawiesne requested review from kinow and rzo1 March 4, 2023 12:41
* {@link #DEFAULT_ALPHANUMERIC} pattern will be returned.
* @return The alphanumeric {@link Pattern} for the language, or the default pattern.
*/
public Pattern getAlphanumeric(String languageCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point would it make sense to have languageCode be an enum instead of a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a Jira epic (or single issue?) for this improvement and share a list of languages most widely used by OpenNLP's users. As a start, this Factory might cover most.

Copy link
Member

@kinow kinow Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good idea I think, +1 to both the enum and the jira.

@kinow
Copy link
Member

kinow commented Mar 4, 2023

Just going to test something and see if I can suggest another language, then will review it 👍

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add Catalan (which I am studying next after Spanish), and found the list of characters (wanted to confirm whether the l with middle dot (ŀ) character was still used or not - looks like it was replaced by l.l instead of ŀl (that l with middle dot followed by an l)).

diff --git a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
index 3099da73..5e7602a2 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java
@@ -44,6 +44,9 @@ public class Factory {
   //       https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/
   private static final Pattern SPANISH = Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$");
 
+  // From: https://en.wikipedia.org/wiki/Catalan_orthography#Spelling_patterns
+  private static final Pattern CATALAN = Pattern.compile("^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$");
+
   /**
    * Gets the alphanumeric pattern for a language.
    *
@@ -71,6 +74,9 @@ public class Factory {
     if ("nl".equals(languageCode) || "nld".equals(languageCode) || "dut".equals(languageCode)) {
       return DUTCH;
     }
+    if ("ca".equals(languageCode) || "cat".equals(languageCode)) {
+      return CATALAN;
+    }
 
     return DEFAULT_ALPHANUMERIC;
   }

I wanted to confirm it works OK, but looks like the Factory is used in tests, but only superficially (check the pattern, etc, but without parsing anything) and when you train a tokenizer model.

I think we only have a stemmer for Catalan at the moment -

public class catalanStemmer extends opennlp.tools.stemmer.snowball.AbstractSnowballStemmer {

What do you think about adding Catalan too, @mawiesne ? I can move it to a follow-up issue if that sounds better.

Cheers
Bruno

@mawiesne
Copy link
Contributor Author

mawiesne commented Mar 5, 2023

What do you think about adding Catalan too, @mawiesne ?

We can integrate Catalan with this PR, I think. Issue title reads open for me. Maybe I'll add a Polish pattern as well, later today.

However:
@kinow Could you open an issue for adding pattern verification tests per language we include? Each (parameterized) test should include some typical (rare) examples specific to that language.

…lian, ...)

- adds Spanish, Italian, Catalan, and Polish alphabet regex patterns to `...lang.Factory`
- adjusts German pattern to include `é` and `É` to cover established loan words such as "Café" or "Cuvée"
- adjusts `TokenizerFactoryTest` to use langCode "eng" instead of "spa" as Spanish will (now) return a specialized pattern
@mawiesne mawiesne force-pushed the OPENNLP-1474_Create_tokenizer_factories_for_other_langs_(Spanish,_Italian,_.) branch from ec19211 to 0e024a8 Compare March 5, 2023 09:47
@mawiesne
Copy link
Contributor Author

mawiesne commented Mar 5, 2023

@kinow I added the Catalan and Polish regex. Please share your feedback if we are good to 🚀 with this PR.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these @mawiesne ! One minor comment (more a nit-pick, sorry!), but looks good to me otherwise.


// From: https://en.wikipedia.org/wiki/Polish_alphabet
// https://pl.wikipedia.org/wiki/Alfabet_polski
private static final Pattern POLISH = Pattern.compile("^[A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ]+$");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my OCD here, @mawiesne , but could we keep the same order for lower case and upper case? 😬

s/A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ/A-Za-z0-9żźćąśęłóńŻŹĆĄŚĘŁÓŃ (I was reading the upper case as "alphanum and Z Z Caselon", and thought it was an easy way to memorize it, so went with that for the lower case chars too, but we can change it if that makes more sense)

@kinow
Copy link
Member

kinow commented Mar 5, 2023

@kinow Could you open an issue for adding pattern verification tests per language we include? Each (parameterized) test should include some typical (rare) examples specific to that language.

Done! https://issues.apache.org/jira/browse/OPENNLP-1479

@kinow kinow changed the title OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, ...) OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, Catalan, and Polish) Mar 5, 2023
@kinow kinow mentioned this pull request Mar 5, 2023
10 tasks
@mawiesne mawiesne merged commit 12ec3db into main Mar 6, 2023
@mawiesne mawiesne deleted the OPENNLP-1474_Create_tokenizer_factories_for_other_langs_(Spanish,_Italian,_.) branch March 6, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants