Highlighters sometime highlight additional caracters #11726

jeantil · 2015-06-17T16:09:49Z

Verified on 1.5.2 and 1.6.0

When using a charfilter to remove some characters (in our case we want to ignore (and ) ) the highlighters will highlight trailing removed caracters

Here is a sample sense session to reproduce the issue.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "filter": {       
      },
      "analyzer": {
        "analyzer_with_charfilter": {
          "filter": [
            "asciifolding",
            "lowercase"
          ],
          "char_filter": [
            "sign_mappings"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        "sign_mappings": {
          "type": "mapping",
          "mappings": [
            "*=>star",
            "+=>plus",
            "(=>",
            ")=>"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "index_options" : "offsets",
          "analyzer": "analyzer_with_charfilter"
        }
      }
    }
  }
}
PUT test/test/1
{
  "name":"(F31)"
}
PUT test/test/2
{
  "name":"(F31) foobar"
}
GET test/test/_search
{
  "fielddata_fields": ["name"], 
  "query": {
    "match": {
      "name": "F31 foobar"
    }
  },"highlight": {
    "require_field_match" : false,
    "fields": {
      "name":{
        "type" : "postings"
      }
    }
  }
}

the highlights for the last query look like

"highlight": {
  "name": [
    "(<em>F31)</em> <em>foobar</em>"
  ]
}

and

"highlight": {
  "name": [
    "(<em>F31)</em>"
  ]
}

I would expect the closing paren not to be highlighted ...

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-06-18T17:23:28Z

The Lucene docs for the PatternReplaceCharFilter (https://lucene.apache.org/core/5_2_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilter.html) say:

NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble.

(I know this isn't the one you're using, but the same advice probably applies to the mapping char filter)

That said, the output from the _analyze API seems weird to me. For instance:

 GET test/_analyze?field=name&text=F31

returns:

{
   "tokens": [
      {
         "token": "f31",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      }
   ]
}

while this:

GET test/_analyze?field=name&text=(F31)

returns:

{
   "tokens": [
      {
         "token": "f31",
         "start_offset": 1,
         "end_offset": 5,
         "type": "word",
         "position": 1
      }
   ]
}

I would expect the end_offset in the second example to be 4, not 5....

mikemccand · 2015-06-19T15:30:19Z

This definitely looks like a bug, and I've created a simple pure Lucene test case showing it, but I'm not yet sure how to fix it; it could be the API for correcting offsets from CharFilter is too simplistic ... I'll open a Lucene issue for discussion.

jeantil · 2015-06-19T15:47:31Z

Thank you !

Le ven. 19 juin 2015 17:31, Michael McCandless notifications@github.com a
écrit :

This definitely looks like a bug, and I've created a simple pure Lucene
test case showing it, but I'm not yet sure how to fix it; it could be the
API for correcting offsets from CharFilter is too simplistic ... I'll open
a Lucene issue for discussion.

—
Reply to this email directly or view it on GitHub
#11726 (comment)
.

mikemccand · 2015-06-20T10:12:01Z

OK I opened https://issues.apache.org/jira/browse/LUCENE-6595 but I'm not sure how to fix it!

svola · 2015-06-25T13:20:40Z

I'm encountering a similar issue:

I'm using a prefix-query on the field I'm highlighting.
I'm searching for the term "uhr" and the highlighted result looks like this:

 "Schmuck > Dame<em>nuh</em>ren > Modische <em>Uhren</em>"

So the highlight is shifted to the left.
I'm using a special analyzer for german word decomposition.

There seems to be some problem with finding the correct starting position.

{
     "token": "damenuhr",
     "start_offset": 13,
     "end_offset": 23,
     "type": "<ALPHANUM>",
     "position": 2
  },
  {
     "token": "dam",
     "start_offset": 13,
     "end_offset": 17,
     "type": "<ALPHANUM>",
     "position": 2
  },
  {
     "token": "uhr",
     "start_offset": 17,
     "end_offset": 20,
     "type": "<ALPHANUM>",
     "position": 2
  },

mikemccand · 2015-06-26T12:11:39Z

@svola are you also using a char_filter? If not, then yours is likely a different issue with maybe the german decomposition not setting the right offsets for the tokens it creates?

cristiana93 · 2016-11-16T12:55:23Z

Is there any progress on the problem?
I am also having the same issues. I must remove some chars with char_filter using pattern_replace in order for my tokenizer to work, but my highlight is broken because of the start and end offsets.

svola · 2016-11-16T13:38:34Z

It didn't change in ES 5. Same problem for me.

romseygeek · 2018-03-13T14:00:16Z

The linked Lucene issue is still open, which is where this needs to be fixed.

cc @elastic/es-search-aggs

mayya-sharipova · 2019-06-10T10:58:06Z

The linked Lucene issue is still open, which is where this needs to be fixed.

jackpf · 2022-04-19T16:56:50Z

Was there ever a fix found for this? Still seems to be broken 7 years later

clintongormley added discuss :Search/Analysis How text is split into tokens >bug and removed discuss labels Jun 18, 2015

clintongormley assigned mikemccand Jun 18, 2015

jeantil mentioned this issue Nov 6, 2015

Search optimization - Highlighter causes many query rewrites #11442

Closed

romseygeek unassigned mikemccand Mar 13, 2018

rjernst added the Team:Search Meta label for search team label May 4, 2020

asfimport mentioned this issue Jul 14, 2015

CharFilter offsets correction is wonky [LUCENE-6595] apache/lucene#7653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlighters sometime highlight additional caracters #11726

Highlighters sometime highlight additional caracters #11726

jeantil commented Jun 17, 2015

clintongormley commented Jun 18, 2015

mikemccand commented Jun 19, 2015

jeantil commented Jun 19, 2015

mikemccand commented Jun 20, 2015

svola commented Jun 25, 2015

mikemccand commented Jun 26, 2015

cristiana93 commented Nov 16, 2016 •

edited

svola commented Nov 16, 2016

romseygeek commented Mar 13, 2018

mayya-sharipova commented Jun 10, 2019

jackpf commented Apr 19, 2022

Highlighters sometime highlight additional caracters #11726

Highlighters sometime highlight additional caracters #11726

Comments

jeantil commented Jun 17, 2015

clintongormley commented Jun 18, 2015

mikemccand commented Jun 19, 2015

jeantil commented Jun 19, 2015

mikemccand commented Jun 20, 2015

svola commented Jun 25, 2015

mikemccand commented Jun 26, 2015

cristiana93 commented Nov 16, 2016 • edited

svola commented Nov 16, 2016

romseygeek commented Mar 13, 2018

mayya-sharipova commented Jun 10, 2019

jackpf commented Apr 19, 2022

cristiana93 commented Nov 16, 2016 •

edited