Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighters sometime highlight additional caracters #11726

Open
jeantil opened this issue Jun 17, 2015 · 11 comments
Open

Highlighters sometime highlight additional caracters #11726

jeantil opened this issue Jun 17, 2015 · 11 comments
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team

Comments

@jeantil
Copy link

jeantil commented Jun 17, 2015

Verified on 1.5.2 and 1.6.0

When using a charfilter to remove some characters (in our case we want to ignore (and ) ) the highlighters will highlight trailing removed caracters

Here is a sample sense session to reproduce the issue.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "filter": {       
      },
      "analyzer": {
        "analyzer_with_charfilter": {
          "filter": [
            "asciifolding",
            "lowercase"
          ],
          "char_filter": [
            "sign_mappings"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        "sign_mappings": {
          "type": "mapping",
          "mappings": [
            "*=>star",
            "+=>plus",
            "(=>",
            ")=>"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "index_options" : "offsets",
          "analyzer": "analyzer_with_charfilter"
        }
      }
    }
  }
}
PUT test/test/1
{
  "name":"(F31)"
}
PUT test/test/2
{
  "name":"(F31) foobar"
}
GET test/test/_search
{
  "fielddata_fields": ["name"], 
  "query": {
    "match": {
      "name": "F31 foobar"
    }
  },"highlight": {
    "require_field_match" : false,
    "fields": {
      "name":{
        "type" : "postings"
      }
    }
  }
}

the highlights for the last query look like

"highlight": {
  "name": [
    "(<em>F31)</em> <em>foobar</em>"
  ]
}

and

"highlight": {
  "name": [
    "(<em>F31)</em>"
  ]
}

I would expect the closing paren not to be highlighted ...

@clintongormley
Copy link

The Lucene docs for the PatternReplaceCharFilter (https://lucene.apache.org/core/5_2_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilter.html) say:

NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble.

(I know this isn't the one you're using, but the same advice probably applies to the mapping char filter)

That said, the output from the _analyze API seems weird to me. For instance:

 GET test/_analyze?field=name&text=F31

returns:

{
   "tokens": [
      {
         "token": "f31",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      }
   ]
}

while this:

GET test/_analyze?field=name&text=(F31)

returns:

{
   "tokens": [
      {
         "token": "f31",
         "start_offset": 1,
         "end_offset": 5,
         "type": "word",
         "position": 1
      }
   ]
}

I would expect the end_offset in the second example to be 4, not 5....

@mikemccand
Copy link
Contributor

This definitely looks like a bug, and I've created a simple pure Lucene test case showing it, but I'm not yet sure how to fix it; it could be the API for correcting offsets from CharFilter is too simplistic ... I'll open a Lucene issue for discussion.

@jeantil
Copy link
Author

jeantil commented Jun 19, 2015

Thank you !

Le ven. 19 juin 2015 17:31, Michael McCandless notifications@github.com a
écrit :

This definitely looks like a bug, and I've created a simple pure Lucene
test case showing it, but I'm not yet sure how to fix it; it could be the
API for correcting offsets from CharFilter is too simplistic ... I'll open
a Lucene issue for discussion.


Reply to this email directly or view it on GitHub
#11726 (comment)
.

@mikemccand
Copy link
Contributor

OK I opened https://issues.apache.org/jira/browse/LUCENE-6595 but I'm not sure how to fix it!

@svola
Copy link

svola commented Jun 25, 2015

I'm encountering a similar issue:

I'm using a prefix-query on the field I'm highlighting.
I'm searching for the term "uhr" and the highlighted result looks like this:

 "Schmuck > Dame<em>nuh</em>ren > Modische <em>Uhren</em>" 

So the highlight is shifted to the left.
I'm using a special analyzer for german word decomposition.

There seems to be some problem with finding the correct starting position.

{
     "token": "damenuhr",
     "start_offset": 13,
     "end_offset": 23,
     "type": "<ALPHANUM>",
     "position": 2
  },
  {
     "token": "dam",
     "start_offset": 13,
     "end_offset": 17,
     "type": "<ALPHANUM>",
     "position": 2
  },
  {
     "token": "uhr",
     "start_offset": 17,
     "end_offset": 20,
     "type": "<ALPHANUM>",
     "position": 2
  },

@mikemccand
Copy link
Contributor

@svola are you also using a char_filter? If not, then yours is likely a different issue with maybe the german decomposition not setting the right offsets for the tokens it creates?

@cristiana93
Copy link

cristiana93 commented Nov 16, 2016

Is there any progress on the problem?
I am also having the same issues. I must remove some chars with char_filter using pattern_replace in order for my tokenizer to work, but my highlight is broken because of the start and end offsets.

@svola
Copy link

svola commented Nov 16, 2016

It didn't change in ES 5. Same problem for me.

@romseygeek
Copy link
Contributor

The linked Lucene issue is still open, which is where this needs to be fixed.

cc @elastic/es-search-aggs

@mayya-sharipova
Copy link
Contributor

The linked Lucene issue is still open, which is where this needs to be fixed.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@jackpf
Copy link

jackpf commented Apr 19, 2022

Was there ever a fix found for this? Still seems to be broken 7 years later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

9 participants