LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour #981

cbuescher · 2019-10-28T20:06:20Z

This fixes an issue where sets of equivalent synonyms in the Wordnet format are
parsed and added to the SynonymMap in a way that leads to the original input
token not being typed as "word" but as SYNONYM instead. Also the original token
doesn't appear first in the token stream output, which is the case for
equivalent solr formatted synonym files.

Currently the WordnetSynonymParser adds all combinations of input/output pairs
of a synset entry into the synonym map, while the SolrSynonymParser excludes
those where input and output term are the same. This change adds the same
behaviour to WordnetSynonymParser and adds tests that show the two formats are
outputting the same token order and types now.

This fixes an issue where sets of equivalent synonyms in the Wordnet format are parsed and added to the SynonymMap in a way that leads to the original input token not being typed as "word" but as SYNONYM instead. Also the original token doesn't appear first in the token stream output, which is the case for equivalent solr formatted synonym files. Currently the WordnetSynonymParser adds all combinations of input/output pairs of a synset entry into the synonym map, while the SolrSynonymParser excludes those where input and output term are the same. This change adds the same behaviour to WordnetSynonymParser and adds tests that show the two formats are outputting the same token order and types now.

romseygeek · 2019-11-13T14:51:22Z

Could you add a CHANGES.txt entry? All looks good other than that, and precommit passes locally for me.

cbuescher · 2019-11-13T15:38:34Z

@romseygeek thanks for the review, I updated the CHANGES.txt.

…981) This fixes an issue where sets of equivalent synonyms in the Wordnet format are parsed and added to the SynonymMap in a way that leads to the original input token not being typed as "word" but as SYNONYM instead. Also the original token doesn't appear first in the token stream output, which is the case for equivalent solr formatted synonym files. Currently the WordnetSynonymParser adds all combinations of input/output pairs of a synset entry into the synonym map, while the SolrSynonymParser excludes those where input and output term are the same. This change adds the same behaviour to WordnetSynonymParser and adds tests that show the two formats are outputting the same token order and types now.

cbuescher mentioned this pull request Oct 28, 2019

Synonym token types differ between Solr/Wordnet format elastic/elasticsearch#44507

Closed

Adding entry to changes.txt

898e7c3

romseygeek merged commit 3a7b25b into apache:master Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour #981

LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour #981

cbuescher commented Oct 28, 2019

romseygeek commented Nov 13, 2019

cbuescher commented Nov 13, 2019

LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour #981

LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour #981

Conversation

cbuescher commented Oct 28, 2019

romseygeek commented Nov 13, 2019

cbuescher commented Nov 13, 2019