Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10312: Add PersianStemmer #540

Merged
merged 12 commits into from
May 7, 2022
Merged

Conversation

raminmjj
Copy link
Contributor

@raminmjj raminmjj commented Dec 14, 2021

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

raminmjj and others added 4 commits December 14, 2021 12:20
apply :lucene:analysis:common:spotlessApply

add org.apache.lucene.analysis.fa.PersianStemFilterFactory

fix: Test PersianStemFilterFactory
@NightOwl888
Copy link

Hi there. We (the Lucene.NET project) are waiting for approval of this stemmer before we will accept it into our codebase (apache/lucenenet#571). We aren't really sure how analysis components are vetted, so please let us know if there is anything else required for this to be accepted.

@mocobeta
Copy link
Contributor

mocobeta commented May 5, 2022

I'm sorry for the late response. I just kicked the CI - I'll take a look.

@mocobeta mocobeta self-assigned this May 5, 2022
@raminmjj
Copy link
Contributor Author

raminmjj commented May 5, 2022

Hi Tomoko(@mocobeta).
I added "PersianStem" in org.apache.lucene.analysis.TokenFilterFactory file.
What should I do to pass this error?
Thank you.

@mocobeta
Copy link
Contributor

mocobeta commented May 6, 2022

I added "PersianStem" in org.apache.lucene.analysis.TokenFilterFactory file.

Yes it's correct, now this has passed the tests/checks.

Copy link
Contributor

@mocobeta mocobeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth adding PersianStemFilter to the Javadoc of PersianAnalyzer#createComponents().

import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;

/**
* A {@link TokenFilter} that applies {@link PersianStemmer} to stem Arabic words..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE says "Two consecutive dots"; it looks like a typo.

Comment on lines 23 to 37
/**
* Factory for {@link PersianStemFilter}.
*
* <pre class="prettyprint">
* &lt;fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100"&gt;
* &lt;analyzer&gt;
* &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
* &lt;filter class="solr.PersianNormalizationFilterFactory"/&gt;
* &lt;filter class="solr.PersianStemFilterFactory"/&gt;
* &lt;/analyzer&gt;
* &lt;/fieldType&gt;</pre>
*
* @since 3.1
* @lucene.spi {@value #NAME}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Solr-scheme example is obsoleted and no longer needed in Lucene Javadoc, can you please remove the XML stuff? Instead, you can list the parameters like this.
Also, I suppose @since should be 9.2.0 (the next minor release).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PersianStemFilterFactory takes no parameters, so you can just delete <pre>...</pre>.


/**
* Stemmer for Persian.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mocobeta mocobeta May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the ArabicStemmer does not mention the algorithms or rules it bases on. As @NightOwl888 told me, this PersianStemmer is a derivative component of it; then I'm fine with the javadocs as is.

Comment on lines 34 to 51
public static final char ALEF = '\u0627';
public static final char HEH = '\u0647';
public static final char TEH = '\u062A';
public static final char REH = '\u0631';
public static final char NOON = '\u0646';
public static final char YEH = '\u064A';
public static final char ZWNJ = '\u200c';

public static final char[][] suffixes = {
("" + ALEF + TEH).toCharArray(),
("" + ALEF + NOON).toCharArray(),
("" + TEH + REH + YEH + NOON).toCharArray(),
("" + TEH + REH).toCharArray(),
("" + YEH + YEH).toCharArray(),
("" + YEH).toCharArray(),
("" + HEH + ALEF).toCharArray(),
("" + ZWNJ).toCharArray(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants can be private?

Copy link

@NightOwl888 NightOwl888 May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is based on the ArabicStemmer where these are public, it would seem odd to make them public in one case and private in the other. Same goes for the stem, stemSuffix and stemPrefix methods.

Copy link
Contributor

@mocobeta mocobeta May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I haven't noticed that.
The original ArabicStemmer was added in 2010 and those public static constants seem unchanged since then. It's a bad practice in these days to unnecessarily expose constants/variables/methods; especially it isn't safe to expose the suffixes char array - it's substantially mutable, even this is marked as final.
Please keep class members private as far as possible. I will open an issue for ArabicStemmer to make those members private.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Should that be expanded to include ArabicNormalizer and PersianNormalizer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I think so. This is another issue, we can improve them later.

}

/**
* Stem suffix(es) off an Persian word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, my IDE suggests "use 'a' instead of 'an'".

* @param len length of input buffer
* @return new length of input buffer after stemming
*/
public int stemSuffix(char[] s, int len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also can be private I think?

* @param suffix suffix to check
* @return true if the suffix matches and can be stemmed
*/
boolean endsWithCheckLength(char[] s, int len, char[] suffix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - change the method visibility to private, please.

Comment on lines 74 to 76
for (int i = 0; i < suffixes.length; i++)
if (endsWithCheckLength(s, len, suffixes[i]))
len = deleteN(s, len - suffixes[i].length, len, suffixes[i].length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add { and } to these for and if clauses; ommiting them is error-prone.

boolean endsWithCheckLength(char[] s, int len, char[] suffix) {
if (len < suffix.length + 2) { // all suffixes require at least 2 characters after stemming
return false;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this else - this is not needed and fewer nests are better.

Comment on lines 92 to 95
for (int i = 0; i < suffix.length; i++) {
if (s[len - suffix.length + i] != suffix[i]) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use Arrays.equals(...)?

assertTokenStreamContents(filter, new String[] {"ساهدهات"});
}

private void check(final String input, final String expected) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have BaseTokenStreamTestCase#checkOneTerm(Analyzer, input, expected). Is it possible to replace this with the built-in check method?

Copy link

@NightOwl888 NightOwl888 May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it was done in the TestArabicStemFilter class, which this is based on.

Copy link
Contributor

@mocobeta mocobeta May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, thanks! I think both of them could be replaceable with the built-in check method so that they are consistent with other analyzer's tests, and the built-in check method includes a few more consistency checks for the analyzed tokens than the current check() method. I'll look at both of them another time. So it's fine with me for now.

@mocobeta
Copy link
Contributor

mocobeta commented May 6, 2022

I left some minor comments.
I suppose we need a CHANGES entry in the 9.2.0 New Features section. Could you add the line?

public static final char REH = '\u0631';
public static final char NOON = '\u0646';
public static final char YEH = '\u064A';
public static final char ZWNJ = '\u200c';
Copy link

@NightOwl888 NightOwl888 May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent - it is not a letter, but a Zero-Width Non-Joining character. It seems that the abbreviation for this constant should be more descriptive than the others. Would you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - do you have any suggestions for the name? I've actually never seen before the character.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I discovered that SoraniNormalizer also uses ZWNJ. I guess the name isn't as big of a deal if we are making it private or package-private. But to me, it would be more intelligible to change them both to spell out ZERO_WIDTH_NON_JOINER than to use an unpronounceable constant for this one case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-acronym version is fine with me, anyway the constant is private use. I don't think the inconsistency with SoraniNormalizer would cause any problems.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, for the sake of consistency, I yield. We should keep it as ZWNJ.

@mocobeta
Copy link
Contributor

mocobeta commented May 6, 2022

FYI, the feature freeze for the next release will be 10th May, according to the proposal for 9.2 release at Lucene's dev mail list.
We still have a few more days until then, so I optimistically suggested that we ship this with it in 9.2. There's no hurry, though. We'll be able to have this in 9.3, in case this takes some more time.

@raminmjj
Copy link
Contributor Author

raminmjj commented May 7, 2022

Sorry for the delay in responding.
@mocobeta, I applied some changes based on your comments.

@mocobeta
Copy link
Contributor

mocobeta commented May 7, 2022

I just made a small change on it 050cbf1.

Looks great, thank you @raminmjj and @NightOwl888! I'm merging this to main and will backport it to the 9x branch soon.
I'm also sorry for the delay in the review.

@mocobeta mocobeta merged commit 111d6b1 into apache:main May 7, 2022
mocobeta added a commit that referenced this pull request May 7, 2022
Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
@NightOwl888
Copy link

@mocobeta - Thanks for merging this. Please let me know the Jira issue number(s) for the related work on ArabicStemmer, ArabicNormalizer, and PersianNormalizer.

@mocobeta
Copy link
Contributor

mocobeta commented May 7, 2022

raminmjj added a commit to raminmjj/lucenenet that referenced this pull request May 9, 2022
wjp719 added a commit to wjp719/lucene that referenced this pull request May 10, 2022
* main:
  LUCENE-10532: remove @Slow annotation (apache#832)
  LUCENE-10312: Add PersianStemmer (apache#540)
  LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871)
  LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869)
  Disable liftbot, we have our own tools
  LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860)
  Make CONTRIBUTING.md a bit more succinct (apache#866)
  LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796)
  Add change line for LUCENE-9848
  LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
wjp719 added a commit to wjp719/lucene that referenced this pull request May 10, 2022
* main: (24 commits)
  LUCENE-10532: remove @Slow annotation (apache#832)
  LUCENE-10312: Add PersianStemmer (apache#540)
  LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871)
  LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869)
  Disable liftbot, we have our own tools
  LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860)
  Make CONTRIBUTING.md a bit more succinct (apache#866)
  LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796)
  Add change line for LUCENE-9848
  LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
  LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853)
  LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859)
  LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837)
  LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663)
  Allow to link to github PR from changes (apache#854)
  LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858)
  LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847)
  LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844)
  gradle 7.3.3 quick upgrade (apache#856)
  LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848)
  ...
NightOwl888 pushed a commit to apache/lucenenet that referenced this pull request May 22, 2022
@raminmjj raminmjj deleted the PersianStemmer branch May 27, 2022 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants