LUCENE-10312: Add PersianStemmer #540

raminmjj · 2021-12-14T09:09:47Z

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

apply :lucene:analysis:common:spotlessApply add org.apache.lucene.analysis.fa.PersianStemFilterFactory fix: Test PersianStemFilterFactory

NightOwl888 · 2022-03-02T06:52:12Z

Hi there. We (the Lucene.NET project) are waiting for approval of this stemmer before we will accept it into our codebase (apache/lucenenet#571). We aren't really sure how analysis components are vetted, so please let us know if there is anything else required for this to be accepted.

mocobeta · 2022-05-05T17:28:53Z

I'm sorry for the late response. I just kicked the CI - I'll take a look.

raminmjj · 2022-05-05T18:52:09Z

Hi Tomoko(@mocobeta).
I added "PersianStem" in org.apache.lucene.analysis.TokenFilterFactory file.
What should I do to pass this error?
Thank you.

mocobeta · 2022-05-06T01:29:27Z

I added "PersianStem" in org.apache.lucene.analysis.TokenFilterFactory file.

Yes it's correct, now this has passed the tests/checks.

mocobeta

Maybe it's worth adding PersianStemFilter to the Javadoc of PersianAnalyzer#createComponents().

mocobeta · 2022-05-06T07:58:49Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemFilter.java

+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
+
+/**
+ * A {@link TokenFilter} that applies {@link PersianStemmer} to stem Arabic words..


My IDE says "Two consecutive dots"; it looks like a typo.

mocobeta · 2022-05-06T08:08:02Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemFilterFactory.java

+/**
+ * Factory for {@link PersianStemFilter}.
+ *
+ * <pre class="prettyprint">
+ * &lt;fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100"&gt;
+ *   &lt;analyzer&gt;
+ *     &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
+ *     &lt;filter class="solr.PersianNormalizationFilterFactory"/&gt;
+ *     &lt;filter class="solr.PersianStemFilterFactory"/&gt;
+ *   &lt;/analyzer&gt;
+ * &lt;/fieldType&gt;</pre>
+ *
+ * @since 3.1
+ * @lucene.spi {@value #NAME}
+ */


This Solr-scheme example is obsoleted and no longer needed in Lucene Javadoc, can you please remove the XML stuff? Instead, you can list the parameters like this.
Also, I suppose @since should be 9.2.0 (the next minor release).

PersianStemFilterFactory takes no parameters, so you can just delete <pre>...</pre>.

mocobeta · 2022-05-06T08:20:21Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+
+/**
+ * Stemmer for Persian.
+ *


It'd be worth mentioning what algorithm is used/implemented in the stemmer if it's possible.
For example, see

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemmer.java

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/hi/HindiStemmer.java

I found the ArabicStemmer does not mention the algorithms or rules it bases on. As @NightOwl888 told me, this PersianStemmer is a derivative component of it; then I'm fine with the javadocs as is.

mocobeta · 2022-05-06T08:26:39Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+  public static final char ALEF = '\u0627';
+  public static final char HEH = '\u0647';
+  public static final char TEH = '\u062A';
+  public static final char REH = '\u0631';
+  public static final char NOON = '\u0646';
+  public static final char YEH = '\u064A';
+  public static final char ZWNJ = '\u200c';
+
+  public static final char[][] suffixes = {
+    ("" + ALEF + TEH).toCharArray(),
+    ("" + ALEF + NOON).toCharArray(),
+    ("" + TEH + REH + YEH + NOON).toCharArray(),
+    ("" + TEH + REH).toCharArray(),
+    ("" + YEH + YEH).toCharArray(),
+    ("" + YEH).toCharArray(),
+    ("" + HEH + ALEF).toCharArray(),
+    ("" + ZWNJ).toCharArray(),
+  };


These constants can be private?

Since this is based on the ArabicStemmer where these are public, it would seem odd to make them public in one case and private in the other. Same goes for the stem, stemSuffix and stemPrefix methods.

Thanks for the pointer. I haven't noticed that.
The original ArabicStemmer was added in 2010 and those public static constants seem unchanged since then. It's a bad practice in these days to unnecessarily expose constants/variables/methods; especially it isn't safe to expose the suffixes char array - it's substantially mutable, even this is marked as final.
Please keep class members private as far as possible. I will open an issue for ArabicStemmer to make those members private.

Gotcha. Should that be expanded to include ArabicNormalizer and PersianNormalizer?

Ah yes, I think so. This is another issue, we can improve them later.

mocobeta · 2022-05-06T08:28:41Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+  }
+
+  /**
+   * Stem suffix(es) off an Persian word.


Again, my IDE suggests "use 'a' instead of 'an'".

mocobeta · 2022-05-06T08:33:02Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+   * @param len length of input buffer
+   * @return new length of input buffer after stemming
+   */
+  public int stemSuffix(char[] s, int len) {


This also can be private I think?

mocobeta · 2022-05-06T08:35:35Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+   * @param suffix suffix to check
+   * @return true if the suffix matches and can be stemmed
+   */
+  boolean endsWithCheckLength(char[] s, int len, char[] suffix) {


Same here - change the method visibility to private, please.

mocobeta · 2022-05-06T08:39:44Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+    for (int i = 0; i < suffixes.length; i++)
+      if (endsWithCheckLength(s, len, suffixes[i]))
+        len = deleteN(s, len - suffixes[i].length, len, suffixes[i].length);


Can you add { and } to these for and if clauses; ommiting them is error-prone.

mocobeta · 2022-05-06T08:43:46Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+  boolean endsWithCheckLength(char[] s, int len, char[] suffix) {
+    if (len < suffix.length + 2) { // all suffixes require at least 2 characters after stemming
+      return false;
+    } else {


Let's remove this else - this is not needed and fewer nests are better.

mocobeta · 2022-05-06T08:49:49Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+      for (int i = 0; i < suffix.length; i++) {
+        if (s[len - suffix.length + i] != suffix[i]) {
+          return false;
+        }


You could use Arrays.equals(...)?

mocobeta · 2022-05-06T08:59:24Z

lucene/analysis/common/src/test/org/apache/lucene/analysis/fa/TestPersianStemFilter.java

+    assertTokenStreamContents(filter, new String[] {"ساهدهات"});
+  }
+
+  private void check(final String input, final String expected) throws IOException {


We have BaseTokenStreamTestCase#checkOneTerm(Analyzer, input, expected). Is it possible to replace this with the built-in check method?

This is how it was done in the TestArabicStemFilter class, which this is based on.

Ah okay, thanks! I think both of them could be replaceable with the built-in check method so that they are consistent with other analyzer's tests, and the built-in check method includes a few more consistency checks for the analyzed tokens than the current check() method. I'll look at both of them another time. So it's fine with me for now.

mocobeta · 2022-05-06T10:08:53Z

I left some minor comments.
I suppose we need a CHANGES entry in the 9.2.0 New Features section. Could you add the line?

NightOwl888 · 2022-05-06T12:53:58Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianStemmer.java

+  public static final char REH = '\u0631';
+  public static final char NOON = '\u0646';
+  public static final char YEH = '\u064A';
+  public static final char ZWNJ = '\u200c';


This seems inconsistent - it is not a letter, but a Zero-Width Non-Joining character. It seems that the abbreviation for this constant should be more descriptive than the others. Would you agree?

Good catch - do you have any suggestions for the name? I've actually never seen before the character.

Hmm...I discovered that SoraniNormalizer also uses ZWNJ. I guess the name isn't as big of a deal if we are making it private or package-private. But to me, it would be more intelligible to change them both to spell out ZERO_WIDTH_NON_JOINER than to use an unpronounceable constant for this one case.

The non-acronym version is fine with me, anyway the constant is private use. I don't think the inconsistency with SoraniNormalizer would cause any problems.

Alright, for the sake of consistency, I yield. We should keep it as ZWNJ.

mocobeta · 2022-05-06T14:54:49Z

FYI, the feature freeze for the next release will be 10th May, according to the proposal for 9.2 release at Lucene's dev mail list.
We still have a few more days until then, so I optimistically suggested that we ship this with it in 9.2. There's no hurry, though. We'll be able to have this in 9.3, in case this takes some more time.

raminmjj · 2022-05-07T06:50:47Z

Sorry for the delay in responding.
@mocobeta, I applied some changes based on your comments.

mocobeta · 2022-05-07T08:09:23Z

I just made a small change on it 050cbf1.

Looks great, thank you @raminmjj and @NightOwl888! I'm merging this to main and will backport it to the 9x branch soon.
I'm also sorry for the delay in the review.

Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>

NightOwl888 · 2022-05-07T14:52:38Z

@mocobeta - Thanks for merging this. Please let me know the Jira issue number(s) for the related work on ArabicStemmer, ArabicNormalizer, and PersianNormalizer.

mocobeta · 2022-05-07T15:54:26Z

@NightOwl888 I opened https://issues.apache.org/jira/browse/LUCENE-10561 for them.

* main: LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)

* main: (24 commits) LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862) LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853) LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859) LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837) LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663) Allow to link to github PR from changes (apache#854) LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858) LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847) LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844) gradle 7.3.3 quick upgrade (apache#856) LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848) ...

Added changes based on apache/lucene#540 and https://issues.apache.org/jira/browse/LUCENE-10312

raminmjj and others added 4 commits December 14, 2021 12:20

Add PersianStemmer

41a268c

apply spotlessApply

bfacf95

add: Test PersianStemFilterFactory

674a0d8

apply :lucene:analysis:common:spotlessApply add org.apache.lucene.analysis.fa.PersianStemFilterFactory fix: Test PersianStemFilterFactory

Merge branch 'main' into PersianStemmer

71a109e

raminmjj mentioned this pull request Mar 2, 2022

Add PersianStemmer apache/lucenenet#571

Merged

raminmjj and others added 2 commits May 5, 2022 18:56

Merge branch 'apache:main' into PersianStemmer

d5aa8fd

update package path

f32e16c

mocobeta self-assigned this May 5, 2022

add PersianStemFilterFactory in module-info

6600c3e

mocobeta reviewed May 6, 2022

View reviewed changes

NightOwl888 reviewed May 6, 2022

View reviewed changes

refactor

2fea9a4

apply './gradlew tidy'

0b3147b

raminmjj added 2 commits May 7, 2022 10:03

fix Persian stemmer tests

18bfc68

add change log

5116b5c

make private suffixes char array

050cbf1

mocobeta merged commit 111d6b1 into apache:main May 7, 2022

mocobeta added a commit that referenced this pull request May 7, 2022

LUCENE-10312: Add PersianStemmer (#540)

9f04771

Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>

raminmjj added a commit to raminmjj/lucenenet that referenced this pull request May 9, 2022

refactor based on (apache/lucene#540)

1ad634d

This was referenced May 19, 2022

LUCENE-10312: Revert changes in PersianAnalyzer #904

Merged

LUCENE-10312: Make stemming configurable on PersianAnalyzer #906

Closed

NightOwl888 pushed a commit to apache/lucenenet that referenced this pull request May 22, 2022

Added PersianStemmer (#571)

c7ab459

Added changes based on apache/lucene#540 and https://issues.apache.org/jira/browse/LUCENE-10312

raminmjj deleted the PersianStemmer branch May 27, 2022 06:04

This was referenced May 14, 2022

Reduce class/member visibility of all normalizer and stemmer classes [LUCENE-10561] #11597

Closed

Add PersianStemmer [LUCENE-10312] #11348

Closed

cbuescher mentioned this pull request Sep 17, 2024

Lucene 10 "persian" analyzer now stems by default elastic/elasticsearch#113050

Open

LUCENE-10312: Add PersianStemmer #540

LUCENE-10312: Add PersianStemmer #540

Conversation

raminmjj commented Dec 14, 2021 • edited Loading

NightOwl888 commented Mar 2, 2022

mocobeta commented May 5, 2022

raminmjj commented May 5, 2022 • edited Loading

mocobeta commented May 6, 2022

mocobeta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta May 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NightOwl888 May 6, 2022 • edited Loading

Choose a reason for hiding this comment

mocobeta May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NightOwl888 May 6, 2022 • edited Loading

Choose a reason for hiding this comment

mocobeta May 6, 2022 • edited Loading

Choose a reason for hiding this comment

mocobeta commented May 6, 2022

NightOwl888 May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta commented May 6, 2022 • edited Loading

raminmjj commented May 7, 2022

mocobeta commented May 7, 2022

NightOwl888 commented May 7, 2022

mocobeta commented May 7, 2022

raminmjj commented Dec 14, 2021 •

edited

Loading

raminmjj commented May 5, 2022 •

edited

Loading

mocobeta May 7, 2022 •

edited

Loading

NightOwl888 May 6, 2022 •

edited

Loading

mocobeta May 6, 2022 •

edited

Loading

NightOwl888 May 6, 2022 •

edited

Loading

mocobeta May 6, 2022 •

edited

Loading

NightOwl888 May 6, 2022 •

edited

Loading

mocobeta commented May 6, 2022 •

edited

Loading